Analysis Implementation

cclyzer++ is written in Soufflé Datalog to achieve a balance between ease of implementation and performance, see Strictly declarative specification of sophisticated points-to analyses for the reasoning behind this choice. cclyzer++ uses a variety of extensions to vanilla Datalog, each is detailed below.

Context-Sensitivity via Records

Contexts are represented as a cons-list of context items using Soufflé’s records:

// From context/interface.dl
.type ContextItem <: symbol
.type Context = [
    head : ContextItem,
    tail : Context
]

In principle, there’s no need to k-bound these contexts - one could instead not insert repeated context entries and still guarantee termination in principle. In practice, k-bounding is necessary for the analysis to terminate in a reasonable time on all but the most trivial of programs.

The Context datatype represents both contexts and allocation contexts.

Unification-Based Analysis via Subsumption and Choice

The unification-based analysis is fairly unusual. Such an analysis wouldn’t normally be written in vanilla Datalog, which lacks access to the datastructures that make it performant (see discussion in the points-to tutorial). However, Soufflé provides features that make it possible to write a high-performance unification-based analysis.

The unification-based analysis works by merging or unifying allocations that flow to the same place. Consider the following program:

int main() {
  int u;
  int v;
  int *x;
  x = &u;
  x = &v;
  int *y = x;
  return 0;
}

In the unification-based analysis, the allocations *stack_alloc@main[u] and *stack_alloc@main[v] both flow to the variable x, so they would be unified, and treated identically by the rest of the analysis. Ignoring context-sensitivity and heap cloning for the sake of presentation, the output of the analysis would conceptually be something like

ptr_points_to: (empty)
var_points_to:
- &u points to *stack_alloc@main[u]
- &v points to *stack_alloc@main[v]
- x points to the set {*stack_alloc@main[u], *stack_alloc@main[v]}
- y points to the set {*stack_alloc@main[u], *stack_alloc@main[v]}

The stack allocations of u and v are said to be merged, unified, or in the same equivalence class. This is represented by a binary relation unify over (allocation, allocation context) pairs (or just allocations when not considering heap cloning).

Subsumption

Soufflé’s eqrel data structure seems suited to represent merged allocations—a unification-based analysis featured in the evaluation section of the paper that introduced it. However, Galois encountered some performance issues when implementing an analysis based on eqrel. Luckily, Soufflé has a feature called subsumption which can be used to emulate a union-find structure without some of the difficulties of eqrel.

The following snippet schematically demonstrates how to build a union-find-like structure using subsumption. base is some arbitrary binary relation over numbers, which represents whatever pairs need to be inserted into an equivalence relation. The relation union_find relates numbers (on the RHS) to their equivalence class representatives (on the LHS). union_find eventually guarantees that the equivalence class representative is the minimum of the members of the equivalence class (though any way of uniquely choosing a representative would do).

.decl base(x: number, y: number)
.input base
base(n, n) :- base(n, _); base(_, n).

.decl eq(x: number, y: number) eqrel
eq(x, y) :- base(x, y).

.decl union_find(x: number, y: number) btree_delete
union_find(x, y) :- base(x, y), x <= y.                   // rule #1
union_find(y, x) :- base(x, y), y <= x.                   // rule #2
union_find(x, z) :- union_find(x, y), union_find(y, z).   // rule #3
union_find(x, y) <=                                       // rule #4
  union_find(z, y) :-
    z < x.

union_find requires a bit more ceremony than eq. Rules #1 and #2 insert all the pairs in base. Rule #3 states that if x is the representative of the equivalence class of y and y is the representative of the equivalence class of z, then x is the representative of the equivalence class of z. This is analogous to what happens during the find operation on union-find structures, when existing nodes’ parents are modified. Rule #4 removes redundant facts using subsumption: if x and z are both potential representative for y’s equivalence class, and z is less than x, then delete the fact that says that x is a possible representative.

To check that eq and union_find really represent the same equivalence relation, the following assertions may be added (they “pass” when they are empty):

.decl assert_same1(x: number, y: number)
assert_same1(x, y) :-
  eq(x, y),
  union_find(z, x),
  ! union_find(z, y).

.decl assert_same2(x: number, y: number)
assert_same2(x, y) :-
  union_find(z, x),
  union_find(z, y),
  ! eq(x, y).

Choice

Besides a way to efficiently merge points-to sets, the other key component of a performant unification-based analysis is a way to avoid propagating redundant facts throughout the analysis (i.e., there should be only one points-to fact per unified allocation set). cclyzer++ uses Soufflé’s ``choice-domain` feature <choice-domain>`_ to non-deterministically choose a single (allocation, allocation context) pair from each equivalence class for each (variable, context) pair.

To illustrate this point, again consider the program from the beginning of this section (and again ignore context sensitivity and heap cloning). Then the unification analysis with choice-domain would non-deterministically compute either of the following results:

ptr_points_to: (empty)
var_points_to:
- &u points to *stack_alloc@main[u]
- &v points to *stack_alloc@main[v]
- x points to *stack_alloc@main[u]
- y points to *stack_alloc@main[u]
unify:
- *stack_alloc@main[u] is unified with *stack_alloc@main[v]

or

ptr_points_to: (empty)
var_points_to:
- &u points to *stack_alloc@main[u]
- &v points to *stack_alloc@main[v]
- x points to *stack_alloc@main[v]
- y points to *stack_alloc@main[v]
unify:
- *stack_alloc@main[u] is unified with *stack_alloc@main[v]

A post-processing step replaces each points-to fact with a canonical one, where the canonical (allocation, allocation context) pair is the minimal one under some arbitrary total order. If *stack_alloc@main[u] is less than *stack_alloc@main[v] under this order, then the “finalized” results would have all the variables &u, &v, x, and y pointing to *stack_alloc@main[u]. See unification.dl on the details of total order and finalization process.

Shared Code via Components

The subset- and unification-based analyses share code via components. As a representative example, the StripCtx component provides context-insensitive projections of the core points-to relations. These rules are agnostic to the style of the analysis, they only require that the “input relations” (callgraph_edge, *_points_to) have appropriate arities and types.

.comp StripCtx {
  .decl callgraph_edge(?calleeCtx: Context, ?callee: FunctionDecl, ?callerCtx: Context, ?callerInstr: Instruction) inline
  .decl operand_points_to(?aCtx: Context, ?alloc: Allocation, ?ctx: Context, ?operand: Operand) inline
  .decl ptr_points_to(?aCtx: Context, ?alloc: Allocation, ?ctx: Context, ?ptr: Allocation) inline
  .decl var_points_to(?aCtx: Context, ?alloc: Allocation, ?ctx: Context, ?var: Variable) inline

  .decl stripctx_callgraph_edge(?callee: FunctionDecl, ?callerInstr: Instruction)
  stripctx_callgraph_edge(?callee, ?callerInstr) :-
    callgraph_edge(_, ?callee, _, ?callerInstr).

  .decl stripctx_var_points_to(?alloc: Allocation, ?var: Variable)
  stripctx_var_points_to(?alloc, ?var) :-
    var_points_to(_, ?alloc, _, ?var).

  .decl stripctx_ptr_points_to(?to: Allocation, ?from: Allocation)
  stripctx_ptr_points_to(?to, ?from) :-
    ptr_points_to(_, ?to, _, ?from).

  .decl stripctx_operand_points_to(?to: Allocation, ?from: Operand)
  stripctx_operand_points_to(?to, ?from) :-
    operand_points_to(_, ?to, _, ?from).
}

This component is then instantiated in the core pointer analysis component, which both the subset- and unification-based analyses derive from. The points-to relations are copied to the inputs of the StripCtx component.

.comp PointsTo {
  // ...

  .init stripctx = StripCtx

  stripctx.callgraph_edge(?calleeCtx, ?callee, ?callerCtx, ?callerInstr) :-
    callgraph.callgraph_edge(?calleeCtx, ?callee, ?callerCtx, ?callerInstr).

  stripctx.operand_points_to(?aCtx, ?alloc, ?ctx, ?operand) :-
    operand_points_to(?aCtx, ?alloc, ?ctx, ?operand).

  stripctx.ptr_points_to(?aCtx, ?alloc, ?ctx, ?ptr) :-
    ptr_points_to(?aCtx, ?alloc, ?ctx, ?ptr).

  stripctx.var_points_to(?aCtx, ?alloc, ?ctx, ?var) :-
    var_points_to(?aCtx, ?alloc, ?ctx, ?var).

  // ...
}

.comp SubsetPointsTo : PointsTo {
  // ...
}

.init subset = SubsetPointsTo

.comp UnificationPointsTo : PointsTo {
  // ...
}

.init unification = UnificationPointsTo

// Can refer to subset.stripctx.stripctx_callgraph_edge,
// unification.stripctx.stripctx_var_points_to, etc.

In this way, both analysis have their own, separate version of these rules, while sharing the implementation. Other components require different ways of setting up their input relations depending on the analysis, these are instantiated in subset.dl and unification.dl instead of in the PointsTo component.

All rules that depend (even indirectly) on the points-to relations are parameterized on them. This means that the subset and unification analyses can actually be run at the same time (or in sequence) during the same run of the overall Datalog program. At the moment, this is mostly used for testing - in particular, there are assertions to the effect that the results of the unification analysis are a superset of those of the subset analysis.

Project Files

Like C, the Soufflé language has no native notion of splitting a program into multiple files. Soufflé uses the C pre-processor to concatenate many files into a single translation unit. cclyzer++ has several top-level “project files” that #include different subsets of the Datalog files, for different purposes:

debug.project includes all the files in cclyzer++, and exports almost all of the relations.
subset.project includes only the files necessary to run the subset analysis and export the core points-to relations.
unification.project includes only the files necessary to run the unification analysis and export the core points-to relations.

Assertions

The file points-to/assertions.dl contains several “assertion relations”, which have names starting with assert_. The assertions “fail” when they are inhabited - but it’s up to the consumer of the analysis to actually check if this occurred and take appropriate action. Assertions are not included in the points-to project files, as they can be quite expensive to check and are primarily intended for use in testing the analysis.