# Analysis Implementation

cclyzer++ is written in Soufflé Datalog to achieve a balance between ease of implementation and performance, see Strictly declarative specification of sophisticated points-to analyses for the reasoning behind this choice. cclyzer++ uses a variety of extensions to vanilla Datalog, each is detailed below.

## Context-Sensitivity via Records

Contexts are represented as a cons-list of context items using Soufflé’s records:

```
// From context/interface.dl
.type ContextItem <: symbol
.type Context = [
head : ContextItem,
tail : Context
]
```

In principle, there’s no need to *k*-bound these contexts - one could instead
not insert repeated context entries and still guarantee termination in
principle. In practice, *k*-bounding is necessary for the analysis to terminate
in a reasonable time on all but the most trivial of programs.

The `Context`

datatype represents both contexts and allocation contexts.

## Unification-Based Analysis via Subsumption and Choice

The unification-based analysis is fairly unusual. Such an analysis wouldn’t normally be written in vanilla Datalog, which lacks access to the datastructures that make it performant (see discussion in the points-to tutorial). However, Soufflé provides features that make it possible to write a high-performance unification-based analysis.

The unification-based analysis works by *merging* or *unifying* allocations that
flow to the same place. Consider the following program:

```
int main() {
int u;
int v;
int *x;
x = &u;
x = &v;
int *y = x;
return 0;
}
```

In the unification-based analysis, the allocations `*stack_alloc@main[u]`

and
`*stack_alloc@main[v]`

both flow to the variable `x`

, so they would be
*unified*, and treated identically by the rest of the analysis. Ignoring
context-sensitivity and heap cloning for the sake of presentation, the output of
the analysis would conceptually be something like

`ptr_points_to`

: (empty)`var_points_to`

:`&u`

points to`*stack_alloc@main[u]`

`&v`

points to`*stack_alloc@main[v]`

`x`

points to*the set*`{*stack_alloc@main[u], *stack_alloc@main[v]}`

`y`

points to*the set*`{*stack_alloc@main[u], *stack_alloc@main[v]}`

The stack allocations of `u`

and `v`

are said to be *merged*, *unified*, or
*in the same equivalence class*. This is represented by a binary relation
`unify`

over (allocation, allocation context) pairs (or just allocations when
not considering heap cloning).

### Subsumption

Soufflé’s `eqrel`

data structure seems suited to represent merged
allocations—a unification-based analysis featured in the evaluation section of
the paper that introduced it. However, Galois encountered
some performance issues when implementing an analysis based on
`eqrel`

. Luckily, Soufflé has a feature called subsumption which can be
used to emulate a union-find structure without some of the difficulties of
`eqrel`

.

The following snippet schematically demonstrates how to build a union-find-like
structure using subsumption. `base`

is some arbitrary binary relation over
numbers, which represents whatever pairs need to be inserted into an equivalence
relation. The relation `union_find`

relates numbers (on the RHS) to their
equivalence class representatives (on the LHS). `union_find`

eventually
guarantees that the equivalence class representative is the minimum of the
members of the equivalence class (though any way of uniquely choosing a
representative would do).

```
.decl base(x: number, y: number)
.input base
base(n, n) :- base(n, _); base(_, n).
.decl eq(x: number, y: number) eqrel
eq(x, y) :- base(x, y).
.decl union_find(x: number, y: number) btree_delete
union_find(x, y) :- base(x, y), x <= y. // rule #1
union_find(y, x) :- base(x, y), y <= x. // rule #2
union_find(x, z) :- union_find(x, y), union_find(y, z). // rule #3
union_find(x, y) <= // rule #4
union_find(z, y) :-
z < x.
```

`union_find`

requires a bit more ceremony than `eq`

. Rules #1 and #2 insert
all the pairs in `base`

. Rule #3 states that if `x`

is the representative of
the equivalence class of `y`

and `y`

is the representative of the
equivalence class of `z`

, then `x`

is the representative of the equivalence
class of `z`

. This is analogous to what happens during the `find`

operation
on union-find structures, when existing nodes’ parents are modified. Rule #4
removes redundant facts using subsumption: if `x`

and `z`

are both potential
representative for `y`

’s equivalence class, and `z`

is less than `x`

, then
delete the fact that says that `x`

is a possible representative.

To check that `eq`

and `union_find`

really represent the same equivalence
relation, the following assertions may be added (they “pass” when they are
empty):

```
.decl assert_same1(x: number, y: number)
assert_same1(x, y) :-
eq(x, y),
union_find(z, x),
! union_find(z, y).
.decl assert_same2(x: number, y: number)
assert_same2(x, y) :-
union_find(z, x),
union_find(z, y),
! eq(x, y).
```

### Choice

Besides a way to efficiently merge points-to sets, the other key component of a performant unification-based analysis is a way to avoid propagating redundant facts throughout the analysis (i.e., there should be only one points-to fact per unified allocation set). cclyzer++ uses Soufflé’s ``choice-domain` feature <choice-domain>`_ to non-deterministically choose a single (allocation, allocation context) pair from each equivalence class for each (variable, context) pair.

To illustrate this point, again consider the program from the beginning of this
section (and again ignore context sensitivity and heap
cloning). Then the unification analysis with `choice-domain`

would
non-deterministically compute *either* of the following results:

`ptr_points_to`

: (empty)`var_points_to`

:`&u`

points to`*stack_alloc@main[u]`

`&v`

points to`*stack_alloc@main[v]`

`x`

points to`*stack_alloc@main[u]`

`y`

points to`*stack_alloc@main[u]`

`unify`

:`*stack_alloc@main[u]`

is unified with`*stack_alloc@main[v]`

or

`ptr_points_to`

: (empty)`var_points_to`

:`&u`

points to`*stack_alloc@main[u]`

`&v`

points to`*stack_alloc@main[v]`

`x`

points to`*stack_alloc@main[v]`

`y`

points to`*stack_alloc@main[v]`

`unify`

:`*stack_alloc@main[u]`

is unified with`*stack_alloc@main[v]`

A post-processing step replaces each points-to fact with a *canonical* one,
where the canonical (allocation, allocation context) pair is the minimal one
under some arbitrary total order. If `*stack_alloc@main[u]`

is less than
`*stack_alloc@main[v]`

under this order, then the “finalized” results
would have *all* the variables `&u`

, `&v`

, `x`

, and `y`

pointing to
`*stack_alloc@main[u]`

. See `unification.dl`

on the details of total order
and finalization process.

## Project Files

Like C, the Soufflé language has no native notion of splitting a program into
multiple files. Soufflé uses the C pre-processor to concatenate many files into
a single translation unit. cclyzer++ has several top-level “project files” that
`#include`

different subsets of the Datalog files, for different purposes:

`debug.project`

includes all the files in cclyzer++, and exports almost all of the relations.`subset.project`

includes only the files necessary to run the subset analysis and export the core points-to relations.`unification.project`

includes only the files necessary to run the unification analysis and export the core points-to relations.

## Assertions

The file `points-to/assertions.dl`

contains several “assertion relations”,
which have names starting with `assert_`

. The assertions “fail” when they are
inhabited - but it’s up to the consumer of the analysis to actually check if
this occurred and take appropriate action. Assertions are not included in the
points-to project files, as they can be quite expensive to check
and are primarily intended for use in testing the analysis.