Analysis Implementation ======================= cclyzer++ is written in `Soufflé Datalog`_ to achieve a balance between ease of implementation and performance, see `Strictly declarative specification of sophisticated points-to analyses `_ for the reasoning behind this choice. cclyzer++ uses a variety of extensions to vanilla Datalog, each is detailed below. Context-Sensitivity via Records ******************************* Contexts are represented as a cons-list of context items using Soufflé's `records`_: .. code-block:: // From context/interface.dl .type ContextItem <: symbol .type Context = [ head : ContextItem, tail : Context ] In principle, there's no need to *k*-bound these contexts - one could instead not insert repeated context entries and still guarantee termination in principle. In practice, *k*-bounding is necessary for the analysis to terminate in a reasonable time on all but the most trivial of programs. The ``Context`` datatype represents both contexts and allocation contexts. .. _impl-unification: Unification-Based Analysis via Subsumption and Choice ***************************************************** The unification-based analysis is fairly unusual. Such an analysis wouldn't normally be written in vanilla Datalog, which lacks access to the datastructures that make it performant (see discussion in the points-to `tutorial`_). However, Soufflé provides features that make it possible to write a high-performance unification-based analysis. The unification-based analysis works by *merging* or *unifying* allocations that flow to the same place. Consider the following program: .. code-block:: c int main() { int u; int v; int *x; x = &u; x = &v; int *y = x; return 0; } In the unification-based analysis, the allocations ``*stack_alloc@main[u]`` and ``*stack_alloc@main[v]`` both flow to the variable ``x``, so they would be *unified*, and treated identically by the rest of the analysis. Ignoring context-sensitivity and heap cloning for the sake of presentation, the output of the analysis would conceptually be something like * ``ptr_points_to``: (empty) * ``var_points_to``: - ``&u`` points to ``*stack_alloc@main[u]`` - ``&v`` points to ``*stack_alloc@main[v]`` - ``x`` points to *the set* ``{*stack_alloc@main[u], *stack_alloc@main[v]}`` - ``y`` points to *the set* ``{*stack_alloc@main[u], *stack_alloc@main[v]}`` The stack allocations of ``u`` and ``v`` are said to be *merged*, *unified*, or *in the same equivalence class*. This is represented by a binary relation ``unify`` over (allocation, allocation context) pairs (or just allocations when not considering heap cloning). Subsumption ~~~~~~~~~~~ Soufflé's ``eqrel`` `data structure `_ seems suited to represent merged allocations---a unification-based analysis featured in the evaluation section of `the paper that introduced it `_. However, Galois `encountered some performance issues `_ when implementing an analysis based on ``eqrel``. Luckily, Soufflé has a feature called `subsumption`_ which can be used to emulate a union-find structure without some of the difficulties of ``eqrel``. The following snippet schematically demonstrates how to build a union-find-like structure using subsumption. ``base`` is some arbitrary binary relation over numbers, which represents whatever pairs need to be inserted into an equivalence relation. The relation ``union_find`` relates numbers (on the RHS) to their equivalence class representatives (on the LHS). ``union_find`` eventually guarantees that the equivalence class representative is the minimum of the members of the equivalence class (though any way of uniquely choosing a representative would do). .. code-block:: .decl base(x: number, y: number) .input base base(n, n) :- base(n, _); base(_, n). .decl eq(x: number, y: number) eqrel eq(x, y) :- base(x, y). .decl union_find(x: number, y: number) btree_delete union_find(x, y) :- base(x, y), x <= y. // rule #1 union_find(y, x) :- base(x, y), y <= x. // rule #2 union_find(x, z) :- union_find(x, y), union_find(y, z). // rule #3 union_find(x, y) <= // rule #4 union_find(z, y) :- z < x. ``union_find`` requires a bit more ceremony than ``eq``. Rules #1 and #2 insert all the pairs in ``base``. Rule #3 states that if ``x`` is the representative of the equivalence class of ``y`` and ``y`` is the representative of the equivalence class of ``z``, then ``x`` is the representative of the equivalence class of ``z``. This is analogous to what happens during the ``find`` operation on union-find structures, when existing nodes' parents are modified. Rule #4 removes redundant facts using subsumption: if ``x`` and ``z`` are both potential representative for ``y``'s equivalence class, and ``z`` is less than ``x``, then delete the fact that says that ``x`` is a possible representative. To check that ``eq`` and ``union_find`` really represent the same equivalence relation, the following assertions may be added (they "pass" when they are empty): .. code-block:: .decl assert_same1(x: number, y: number) assert_same1(x, y) :- eq(x, y), union_find(z, x), ! union_find(z, y). .decl assert_same2(x: number, y: number) assert_same2(x, y) :- union_find(z, x), union_find(z, y), ! eq(x, y). Choice ~~~~~~ Besides a way to efficiently merge points-to sets, the other key component of a performant unification-based analysis is a way to avoid propagating redundant facts throughout the analysis (i.e., there should be only one points-to fact per unified allocation set). cclyzer++ uses `Soufflé's ``choice-domain`` feature `_ to non-deterministically choose a single (allocation, allocation context) pair from each equivalence class for each (variable, context) pair. To illustrate this point, again consider the program from `the beginning of this section `_ (and again ignore context sensitivity and heap cloning). Then the unification analysis with ``choice-domain`` would non-deterministically compute *either* of the following results: * ``ptr_points_to``: (empty) * ``var_points_to``: - ``&u`` points to ``*stack_alloc@main[u]`` - ``&v`` points to ``*stack_alloc@main[v]`` - ``x`` points to ``*stack_alloc@main[u]`` - ``y`` points to ``*stack_alloc@main[u]`` * ``unify``: - ``*stack_alloc@main[u]`` is unified with ``*stack_alloc@main[v]`` or * ``ptr_points_to``: (empty) * ``var_points_to``: - ``&u`` points to ``*stack_alloc@main[u]`` - ``&v`` points to ``*stack_alloc@main[v]`` - ``x`` points to ``*stack_alloc@main[v]`` - ``y`` points to ``*stack_alloc@main[v]`` * ``unify``: - ``*stack_alloc@main[u]`` is unified with ``*stack_alloc@main[v]`` A post-processing step replaces each points-to fact with a *canonical* one, where the canonical (allocation, allocation context) pair is the minimal one under some arbitrary total order. If ``*stack_alloc@main[u]`` is less than ``*stack_alloc@main[v]`` under this order, then the "finalized" results would have *all* the variables ``&u``, ``&v``, ``x``, and ``y`` pointing to ``*stack_alloc@main[u]``. See ``unification.dl`` on the details of total order and finalization process. Shared Code via Components ************************** The subset- and unification-based analyses share code via `components`_. As a representative example, the ``StripCtx`` component provides context-insensitive projections of the core points-to relations. These rules are agnostic to the style of the analysis, they only require that the "input relations" (``callgraph_edge``, ``*_points_to``) have appropriate arities and types. .. code-block:: .comp StripCtx { .decl callgraph_edge(?calleeCtx: Context, ?callee: FunctionDecl, ?callerCtx: Context, ?callerInstr: Instruction) inline .decl operand_points_to(?aCtx: Context, ?alloc: Allocation, ?ctx: Context, ?operand: Operand) inline .decl ptr_points_to(?aCtx: Context, ?alloc: Allocation, ?ctx: Context, ?ptr: Allocation) inline .decl var_points_to(?aCtx: Context, ?alloc: Allocation, ?ctx: Context, ?var: Variable) inline .decl stripctx_callgraph_edge(?callee: FunctionDecl, ?callerInstr: Instruction) stripctx_callgraph_edge(?callee, ?callerInstr) :- callgraph_edge(_, ?callee, _, ?callerInstr). .decl stripctx_var_points_to(?alloc: Allocation, ?var: Variable) stripctx_var_points_to(?alloc, ?var) :- var_points_to(_, ?alloc, _, ?var). .decl stripctx_ptr_points_to(?to: Allocation, ?from: Allocation) stripctx_ptr_points_to(?to, ?from) :- ptr_points_to(_, ?to, _, ?from). .decl stripctx_operand_points_to(?to: Allocation, ?from: Operand) stripctx_operand_points_to(?to, ?from) :- operand_points_to(_, ?to, _, ?from). } This component is then *instantiated* in the core pointer analysis component, which both the subset- and unification-based analyses derive from. The points-to relations are copied to the inputs of the ``StripCtx`` component. .. code-block:: .comp PointsTo { // ... .init stripctx = StripCtx stripctx.callgraph_edge(?calleeCtx, ?callee, ?callerCtx, ?callerInstr) :- callgraph.callgraph_edge(?calleeCtx, ?callee, ?callerCtx, ?callerInstr). stripctx.operand_points_to(?aCtx, ?alloc, ?ctx, ?operand) :- operand_points_to(?aCtx, ?alloc, ?ctx, ?operand). stripctx.ptr_points_to(?aCtx, ?alloc, ?ctx, ?ptr) :- ptr_points_to(?aCtx, ?alloc, ?ctx, ?ptr). stripctx.var_points_to(?aCtx, ?alloc, ?ctx, ?var) :- var_points_to(?aCtx, ?alloc, ?ctx, ?var). // ... } .comp SubsetPointsTo : PointsTo { // ... } .init subset = SubsetPointsTo .comp UnificationPointsTo : PointsTo { // ... } .init unification = UnificationPointsTo // Can refer to subset.stripctx.stripctx_callgraph_edge, // unification.stripctx.stripctx_var_points_to, etc. In this way, both analysis have their own, separate version of these rules, while sharing the implementation. Other components require different ways of setting up their input relations depending on the analysis, these are instantiated in ``subset.dl`` and ``unification.dl`` instead of in the ``PointsTo`` component. All rules that depend (even indirectly) on the points-to relations are parameterized on them. This means that the subset and unification analyses can actually be run *at the same time* (or in sequence) during the same run of the overall Datalog program. At the moment, this is mostly used for testing - in particular, there are `assertions `_ to the effect that the results of the unification analysis are a superset of those of the subset analysis. .. _project: Project Files ************* Like C, the Soufflé language has no native notion of splitting a program into multiple files. Soufflé uses the C pre-processor to concatenate many files into a single translation unit. cclyzer++ has several top-level "project files" that ``#include`` different subsets of the Datalog files, for different purposes: - ``debug.project`` includes all the files in cclyzer++, and exports almost all of the relations. - ``subset.project`` includes only the files necessary to run the subset analysis and export the core points-to relations. - ``unification.project`` includes only the files necessary to run the unification analysis and export the core points-to relations. .. _assertions: Assertions ********** The file ``points-to/assertions.dl`` contains several "assertion relations", which have names starting with ``assert_``. The assertions "fail" when they are inhabited - but it's up to the consumer of the analysis to actually check if this occurred and take appropriate action. Assertions are not included in the points-to `project files `_, as they can be quite expensive to check and are primarily intended for use in testing the analysis. .. _Soufflé Datalog: https://souffle-lang.github.io/ .. _tutorial: http://yanniss.github.io/points-to-tutorial15.pdf .. _records: https://souffle-lang.github.io/types#record-types .. _eqrel: https://souffle-lang.github.io/relations#equivalence-relations .. _eqrel-paper: https://ieeexplore.ieee.org/document/8891656 .. _eqrel-perf: https://github.com/souffle-lang/souffle/issues/2054 .. _subsumption: https://souffle-lang.github.io/subsumption .. _choice-domain: https://souffle-lang.github.io/choice .. _components: https://souffle-lang.github.io/components