Sources of Unsoundness

cclyzer++ is not completely sound. Instead, it prioritizes usefulness over soundness, making it “soundy” (see http://soundiness.org/). The following sections the discuss known, notable sources of unsoundness.

Suballocations and Type Back-Propagation

cclyzer++ creates suballocations based on an allocation’s type. However, Type Back-Propagation can result in an allocation having multiple assigned types, and cclyzer++ doesn’t consider aliasing between the suballocations of the various “variants” of the allocation which have different types. The following example 1 demonstrates this issue.

#include <stdint.h>
#include <stdlib.h>

typedef struct { int32_t i1; int32_t i2; } a;
typedef struct { int16_t j1; int16_t j2; int32_t j3; } b;

int main(int argc, char *argv[]) {
  a *p = malloc(sizeof(a));
  b *q = (b*)p;
  int32_t *x = &p->i1;
  int16_t *y = &q->j2;
  return 0;
}

Due to type back-propagation, cclyzer++ will create three separate abstract allocations to represent allocations to represent the single call to malloc:

  • An “untyped” allocation of type i8, call it *heap_alloc@main[p]

  • An allocation of type a, call it *typed_heap_alloc@main[p, a]

  • An allocation of type b, call it *typed_heap_alloc@main[p, b]

The allocation *heap_alloc@main[p, a] would have field suballocations *heap_alloc@main[p].i1 and *heap_alloc@main[p].i2, and similarly for *typed_heap_alloc@main[p, b]. The issue is that the variable y points to *typed_heap_alloc@main[p, b].j2, but the analysis doesn’t keep track of the overlap between *heap_alloc@main[p, b].j2 and *heap_alloc@main[p, a].i1, so stores to one won’t be reflected in loads from the other.

Language Coverage

cclyzer++ doesn’t handle several tricky features of LLVM. The following features could probably have an impact on the soundness of a pointer analysis results but are not yet supported:

  • Inline assembly

  • The llvm.global_ctors and llvm.global_dtors intrinsic global variables

  • Exception handling instructions:
    • catchpad

    • catchret

    • catchswitch

    • cleanuppad

    • cleanupret

    • landingpad

    • resume

  • Other difficult instructions:
    • addrspacecast

    • callbr

    • inttoptr

    • ptrtoint

    • va_arg

  • Several LLVM intrinsics:
    • llvm.memmove

    • llvm.ptrmask

    • llvm.stackrestore

    • llvm.stacksave

    • llvm.va_end

    • llvm.va_start

    • llvm.vp.ptrtoint

  • Code generator intrinsics

  • Vector-related constant expressions, including
    • extractelement

    • insertelement

    • shufflevector

  • Vector-related instructions, including
    • extractelement

    • insertelement

    • shufflevector

  • Vector-related intrinsics, including
    • llvm.masked.load

    • llvm.masked.store

  • Anything related to parallelism and concurrency, including instructions like cmpxchg

  • ptrtoint and inttoptr constant expressions

  • The addrspacecast constant expression

The analysis also doesn’t use any LLVM pointer metadata like llvm.lifetime.* or dereferencable to improve precision.

Clang can be run with the -fno-vectorize and -fno-slp-vectorize options to disable [auto vectorization](https://llvm.org/docs/Vectorizers.html), preventing the generation of unsupported vector instructions. Similarly, LLVM’s [LowerInvoke](https://llvm.org/doxygen/LowerInvoke_8h_source.html) pass can be used to remove (unsupported) exception handling constructs (i.e., via the invoke and resume instructions).

Poison and Undef

The analysis doesn’t do anything with poison or undef values - it doesn’t create points-to facts for them and it doesn’t detect when they may be created or how they may be propagated.

Type-Based Heuristics

cclyzer++ uses type-compatibility heuristics to filter out “infeasible” points-to facts. In particular, the rules for handling memcpy and getelementptr use a notion of type compatibility to propagate fewer facts. This is probably usually OK, but likely leads to unsoundness in similar situations to the one described for suballocations.

External Functions

External functions without sound signatures may have effects on points-to relations that aren’t captured by the analysis.

Footnotes

1

Example adapted from: Sui, Y., Fan, X., Zhou, H. and Xue, J., 2018. Loop-oriented pointer analysis for automatic simd vectorization. ACM Transactions on Embedded Computing Systems (TECS), 17(2), pp.1-31.