The Code Property Graph¶
MATE primarily finds vulnerabilities by static program analysis over the target’s CPG, which combines representations of a program’s syntax, control-flow, and dependencies into a unified graph structure that can be queried to identify potential flaws. The MATE CPG consists of the target’s:
abstract syntax tree (AST)
call graph (CG)
control-flow graph (CFG)
inter-procedural control-flow graph (ICFG)
inter-procedural dataflow-graph (DFG)
control-dependence graph (CDG)
points-to graph (PTG)
source-code to machine-code mapping
memory layout and DWARF type graph
The CPG is a labeled property graph, meaning a graph where every node or edge
has a collection of attributes, i.e. key/value pairs. For example, many of the
nodes in the CPG represent parts of the syntax of the input program, like
Function
nodes. Function
nodes have attributes like “name”, which can be
accessed through the Python query API.
See CPG Schema for detailed information about the various kinds of nodes, edges, and attributes in the CPG.
The CPG is stored in a PostgreSQL database.
General Guidelines¶
One of the central features of the MATE CPG is a comprehensive and accurate mapping between program representations that occur at different phases of compilation, from the source level (LLVM bitcode), through the LLVM middle-end, all the way to the binary and its embedded DWARF debugging information. Thus, the CPG contains three different representations of the input program corresponding to the following stages of compilation:
LLVM IR
LLVM middle-end (nodes and edges that start with “MI”)
Binary (nodes and edges that start with “MC” and “ASM”)
DWARF (nodes and edges that start with “DWARF”)
The LLVM IR is considered the “default” representation, since it is most similar
to the source language (C/C++). Therefore, generic names like Function
,
Block
, Instruction
, and Argument
refer to LLVM IR constructs. Nodes
representing LLVM middle-end constructs have kinds prefixed by Machine
, and
binary-level node kinds are prefixed by ASM
.
Provenance¶
The various “layers” of the CPG are constructed by different tools.
abstract syntax tree (AST): ASTGraphWriter
call graph (CG): pointer analysis (see also LLVM middle-end passes)
control-flow graph (CFG): ASTGraphWriter for intra-procedural control-flow
inter-procedural control-flow graph (ICFG): pointer analysis
inter-procedural dataflow-graph (DFG): pointer analysis for flows through memory and calls) and ASTGraphWriter for intra-procedural flows
control-dependence graph (CDG): ASTGraphWriter
points-to graph (PTG): pointer analysis
source-code to machine-code mapping: machine-code mapping
memory layout and DWARF type graph: machine-code mapping