(verifying-code)= # Verifying Code Using Symbolic Execution Proving things about Cryptol models can be quite useful; however, as discussed before, the most common use case for SAW is probably verifying code against a Cryptol model. There are three parts to this: writing the specification to verify against, loading the code to verify, and then running the verification. The specification is the most complicated part of this, so we'll take it up first. There are three symbolic execution backends in SAW; they all wrap different languages around the Crucible symbolic execution engine. They handle LLVM bitcode, Java byte code, and Rust's MIR intermediate representation. (There is also a fourth experimental backend, interconnected with the LLVM backend, that handles raw x86_64 binaries.) Each of these has its own collection of SAWScript builtins; the three sets of builtins are very similar but not quite identical. Similarly, when using the Python bindings each has its own interface, and the three are very similar but not quite identical. ## Simple Example for Getting Started :::{warning} This section is under construction! ::: (spec-languages)= ## Languages for Specifications Each backend also has its own specification language. This is necessary because all three backends execute code in a wider world than Cryptol and SAWCore. SAWCore is a pure lambda calculus; Cryptol is largely restricted to finite objects. LLVM, Java, and Rust all operate in a world that has pointers, memory allocation, mutable state, global variables, partial functions where some combinations of inputs may be invalid, and other real-world phenomena. All three of them also (and not uncoincidentally) have a broader set of types than either Cryptol or SAWCore, such as pointers. Consequently, each backend has its own language that includes these types and their values. This is the language that specifications for code in that backend must be written in, and those specifications must typecheck in that language. In each case it is, like the specification languages for other production language verification frameworks, akin to but not quite the same as a pure subset of the original source language. Unfortunately, at least for the time being, the backend specification languages are entirely implicit. The types and values of these languages appear as SAWScript program objects, but they have no concrete syntax. Instead, types and values are assembled programmatically, and then assertions and specifications are assembled programmatically out of those. Values can be lifted out of Cryptol (and/or SAWCore) as needed. In the typical case of verifying code against a Cryptol model, the specification works by lifting the results of Cryptol-level computations into the specification language and relating them to the results produced by the target code. There is thus an implicit correspondence between Cryptol types and certain types in each backend, and SAWScript operations that lift Cryptol values to specification-language values according to this correspondence. The SAWScript type of LLVM type objects in the LLVM backend is `LLVMType`; the SAWScript type of LLVM value objects is `LLVMValue`. The function `llvm_term` lifts a Cryptol-level value of SAWScript type `Term` to an LLVM value object of type `LLVMValue`. Similarly, in the JVM backend, types appear as `JavaType`, values as `JVMValue`, and the value-lifting function is `jvm_term`. In the MIR backend, types are `MIRType`, values are `MIRValue`, and the lifting function is `mir_term`. Users frequently find this embedding, these functions, and the sometimes apparently arbitrary need for their use confusing. It is important to build a robust mental model of the multiple layers involved and how the mapping between them works. Cryptol values can be lifted to LLVM, JVM, or MIR values, and SAWScript is its own layer above both. Rather than diving directly into the implicit backend languages (including the details of those mappings) without context, we'll first look in a general way at how specifications for code work. ## Specifications Given the above, it should be clear that in general the specification for an LLVM, JVM, or MIR function is more involved than just naming a corresponding Cryptol function, even if that Cryptol function is a complete behavioral specification. SAW specifications for these backends (sometimes, somewhat inaccurately, also called "Crucible specifications") each set up a call to the target function. This involves creating symbolic variables, defining the setup needed to make a call (called the _prestate_), providing (generally symbolic) arguments to the function, and then defining the expected results (called the _poststate_). Defining the prestate and poststate can each be subdivided into two parts: first, allocating memory, and second, making assertions about symbolic variables and the contents of memory (called _preconditions_ and _postconditions_ respectively). The prestate and poststate logic are divided by the argument specification. This can be thought of as the declaration of the function arguments; it can also be thought of as the point in the specification where the target function gets invoked. There is a specific builtin for this in each back end; for example, in the LLVM backend it is `llvm_execute_func`. There must be exactly one call to this builtin in each specification. Each backend provides a set of operations for creating symbolic (also called _fresh_) variables of various types. In general you want to prove that your function behaves a certain way for all valid inputs; the premise of verification via symbolic execution is that we can execute on inputs that range over many possible values. Generally, for base types symbolic values can be generated at the Cryptol level and then lifted into the code specification layer. For complex types that have no Cryptol analogues, symbolic values can be created either by constructing specification-level values from symbolic base values (so for example, one might create a struct with two integer fields from two symbolic integer values lifted into the specification layer) or with convenience functions that create complex values and fill them with fresh values. While in general fresh variables are created at the beginning of a specification (because these would be the quantifier-bound variables if the specification were written declaratively as a theorem) it is also possible to create new ones in the poststate logic. These are treated as outputs; essentially they're existentially bound in the output part of the specification. Pointers and references (other than null or uninitialized pointers, in backends where they exist) must be created by explicitly allocating memory. Allocations done in the prestate logic are presumed to have been done before the function call, and the existence of those allocations is part of the overall precondition. Allocations done in the poststate logic are presumed to be done _by_ the function call, and the resulting pointers are treated as outputs. Each distinct memory region must be allocated separately, and separately allocated memory regions are (by default, with some exceptions) presumed to be distinct. In situations where a function takes multiple pointer arguments that might or might not alias, it is usually necessary to write multiple specifications to cover the different cases. Assertions relate symbolic variables to other symbolic variables, or to constants, or to computations from a Cryptol model. Assertions in the prestate logic create obligations that must be satisfied to make the call to the function; the function's code is thus entitled to assume they hold. Assertions in the poststate logic create obligations that the function must satisfy. The return value of the function (if any) is specified in the poststate as a special kind of assertion. This is all very abstract, so as an example, consider the following somewhat silly C code: ```c #include unsigned *add(unsigned *x, unsigned *y) { unsigned *ret; ret = malloc(sizeof(*ret)); *ret = *x + *y; return ret; } ``` The function `add` takes two pointers to integers, and returns a pointer to a newly allocated integer that holds the sum. We use unsigned integers to avoid having to worry about whether the addition overflows, which would be undefined for signed integers. We can write the following LLVM specification for it, which has all of the elements discussed above: ```sawscript let add_spec = do { xval <- llvm_fresh_var "xval" (llvm_int 32); yval <- llvm_fresh_var "yval" (llvm_int 32); x <- llvm_alloc (llvm_int 32); y <- llvm_alloc (llvm_int 32); llvm_points_to x (llvm_term xval); llvm_points_to y (llvm_term yval); llvm_execute_func [x, y]; ret <- llvm_alloc (llvm_int 32); llvm_points_to ret (llvm_term {{ xval + yval }}); llvm_return ret; }; ``` The specification is always a SAWScript do-block, which will be executed top-to-bottom when we run the verification. First we create two symbolic variables `xval` and `yval` to hold the values pointed to by the pointers `x` and `y`. These are 32-bit LLVM integers; however, `llvm_fresh_var` creates them at the Cryptol level and the SAWScript `xval` and `yval` variables have type `Term`. Then we allocate memory for the prestate. On entry, we want both pointers to be valid. In this specification we presume they should be disjoint, so we do two separate allocations, one for `x` and one for `y`. This gives us two pointer values. (This will not match later calls to `add` that pass the same pointer as both arguments. If we want to allow that, we need a second spec that allocates only one pointer and therefore also has only one fresh value. Writing this spec by cutting down the one above should be a fairly straightforward exercise.) Then we assert the prestate conditions: `x` points to `xval` and `y` points to `yval`. We need to lift `xval` and `yval` to the LLVM level with `llvm_term` because, as noted above, `llvm_fresh_var` created them at the Cryptol level. Then we call the function and pass `x` and `y` as the arguments. We do not need `llvm_term` here because `x` and `y`, being pointers, are already LLVM-level values. The rest of the spec is poststate logic: first we allocate another pointer for the return value (matching the allocation within the function). Then we assert that it points to, not another fresh variable but the sum of the existing ones, whatever that is. We can write that sum in a Cryptol block that refers to `xval` and `yval` precisely because those are Cryptol-level variables with Cryptol types. We wrap the Cryptol block in `llvm_term` because the Cryptol block is, unsurprisingly, a Cryptol-level value and needs to be lifted into the LLVM world. The final line specifies that the return value of the function is the pointer `ret`. This is a special kind of poststate assertion. (You must provide a return value if the function returns a value. You must not if it doesn't. C and Java functions of return type `void` do not. Rust functions of return type `()` can be construed either way: you can specify that the return value is `()`, but you need not.) If you don't want to specify the return value exactly, you can create a fresh value for it in the poststate logic and return that, optionally restricting its value to a subset of possible values if desired. We again don't need `llvm_term` with `ret` because it's an LLVM-level value. In order to actually verify this function we also need to compile the C code: ```shell clang -g -c -emit-llvm example.c -o example.bc ``` and we need two additional lines along with the specification itself in a SAWScript file: ```sawscript bc <- llvm_load_module "example.bc"; llvm_verify bc "add" [] true add_spec z3; ``` This loads the compiler output, giving us an LLVM module handle `bc`, and them runs the verification by calling `llvm_verify`. (Note that `jvm_verify` and `mir_verify` are essentially the same, except that they work with JVM and MIR code modules and specs.) Running SAW on the resulting file gives the following output: ```console Loading file "example.saw" Verifying add... Simulating add... Checking proof obligations add... Proof succeeded! add ``` The arguments to `llvm_verify` are, first, the LLVM module handle; the name of the function; a list of what we call _override_ specifications (more on that in a moment), a flag that says whether to enable path satisfiability checking (leave this set to true for now), the specification, and a proof script. (overrides-and-compositional-verification)= ### Overrides and Compositional Verification `llvm_verify` returns a SAWScript value of type `LLVMSpec`. This represents a proved/verified specification. Here we ignore it; in a larger development you would likely capture it for later use. This function does not call any other functions. However, imagine a function that calls `add`. When verifying _that_ function, we could let the symbolic execution engine execute directly into `add`, or we could provide the specification for `add` that we just proved and have it apply that instead. Such specifications are called _override_ specifications, because they override the direct execution of the function specified. In the case of `add`, using the proved specification to override later executions probably is no faster than just executing the function directly, and might even be slower. However, for functions that do a lot of work, applying the proved specification will require a lot less computation than, essentially, redoing it every time. The use of override specifications is critical for verifying nontrivial code. This is called _compositional verification_ because it allows composing specifications for different chunks of code. If you want to use a proved specification as an override when calling `llvm_verify` (or `jvm_verify` or `mir_verify`), you can put it in the list that's empty in the example above. It is also possible to write specifications and assume that they're valid instead of prove them, and then use them as overrides. This is useful for library code, where you might not have the source or where verifying the library is a whole additional project. The LLVM function for this is `llvm_unsafe_assume_spec`; it is akin to `llvm_verify` but takes only the LLVM module handle, function name, and spec. (There are corresponding `jvm_unsafe_assume_spec` and `mir_unsafe_assume_spec` functions as well.) (One might instead think to provide a proof script that just admits all goals given to it. However, that will still run the symbolic execution, which still requires most of the specification setup and can still fail, so in practice it isn't useful.) ### Additional Restrictions on Arrays In all three backends the type construction for arrays requires a concrete SAWScript integer. Proof results are thus dependent on the array size being as indicated, and may not hold for other sizes. Note however that it is intended to be possible to pass an array of length _M_ to a function that expects a length _N_, where _M_ > _N_, such that the function will just ignore the extra space. In cases where this is not sufficient, it is also possible to make the specification a function parameterized by the length _N_ and then prove it repeatedly for different sizes. See [the SAWScript programming chapter](#sawscript). ### Summary A specification is a SAWScript do-block divided into two parts by the call to `llvm_execute_func`, `jvm_execute_func`, or `mir_execute_func`, which provides the arguments to the function being verified. The first half of the block contains prestate logic for creating fresh variables, allocating memory, and setting up the allowable start states. The second half of the block contains poststate logic for (potentially) creating more fresh variables, allocating more memory, and specifying the allowable output states. We can now go into each backend's implicit specification language and its driving SAWScript builtins in more detail. Each backend has its own types and values, and its own SAWScript functions. However, there is a good deal of commonality across the backends. ## The LLVM Backend and LLVM Specifications In the LLVM backend, types appear in SAWScript as values of SAWScript type `LLVMType`, and values have the SAWScript type `LLVMValue`. LLVM code modules have the type `LLVMModule`, and the specification do-blocks have the type `LLVMSetup ()`. The function argument declaration that divides the pre- and post- portions of a specification is done with `llvm_execute_func`. (llvm-types)= ### LLVM Types The following SAWScript functions construct LLVM types: - `llvm_int` _`n`_ produces an _`n`_-bit-wide machine integer/bitvector. Like Cryptol, SAW doesn't have separate signed and unsigned bitvector types. The `llvm_int` types correspond to the Cryptol bitvector types `[`_`n`_`]`. Lifting a Cryptol bitvector value to an LLVM value produces a value of this type. - `llvm_float` is the type of (32-bit) single-precision floating point numbers, and `llvm_double` is the type of (64-bit) double-precision floating point numbers. These correspond to the Cryptol floating point types of the proper sizes. However, SAW's support for floating point is rudimentary so these types are currently of little use. - `llvm_array` _`n`_ _`ty`_ produces an array type of element type _`ty`_ that is _`n`_ entries long. This type corresponds to the Cryptol sequence type `[`_`n`_`][`_`cty`_`]` where _`cty`_ is the Cryptol type corresponding to _`ty`_, if it exists. Lifting a Cryptol sequence value to an LLVM value produces a value of this type. (Note that Cryptol bitvectors are also sequences. A Cryptol sequence of `Bit` / `Bool` is a bitvector and lifts to `llvm_int` of some size. A Cryptol sequence of some other type lifts to an `llvm_array`.) - `llvm_struct_type` _`tys`_ produces a struct type whose elements have types _`tys`_. _`tys`_ is a list of `LLVMType`. This corresponds to a Cryptol tuple type containing the Cryptol types corresponding to _`tys`_, if they exist. - `llvm_packed_struct_type` _`tys`_ is like `llvm_struct_type` but produces a packed struct type. (That is, one that has no alignment padding between members.) This also corresponds to a Cryptol tuple type containing the Cryptol types corresponding to _`tys`_, if they exist. - `llvm_pointer` _`ty`_ is the type of a pointer to _`ty`_. This has no Cryptol equivalent. Furthermore, - `llvm_type` takes a string in LLVM's type syntax and produces the corresponding `LLVMType`. For example, `llvm_type` `"i32"` produces the same type as `llvm_int` `32`. - `llvm_alias` takes a string that's the name of a type alias and looks up the `LLVMType` for it. When your LLVM module is the result of compiling C code, all ordinary C integer types map to the `llvm_int` type of the proper size. The C `bool` type may appear as either `llvm_int` `1` or `llvm_int` `8`, depending on context: loose `bool`s are generally `llvm_int` `1`, but members of structs and arrays will usually be `llvm_int` `8`. This is due to a peculiarity in the way Clang compiles `bool` down to LLVM. When in doubt about how a `bool` is represented, check the LLVM bitcode by compiling your code with `clang -S -emit-llvm`. ### LLVM Values The primary way to produce LLVM values of base type is to lift Cryptol values with `llvm_term`. For example, `llvm_term {{ 123 : [32] }}` produces an LLVM value whose type is `llvm_int` `32`. LLVM values of compound type can be constructed with these functions: - `llvm_array_value` _`vals`_ constructs an LLVM array value; _`vals`_ is a list of values to put in the array. - `llvm_struct_value` _`vals`_ constructs an LLVM struct value; _`vals`_ is a list of values to put in the struct, in order. - `llvm_packed_struct_value` _`vals`_ is like `llvm_struct_value` but produces a packed struct. The elements of LLVM values of compound type can be extracted with these functions: - `llvm_elem` _`val`_ _`n`_ extracts element _`n`_ of the LLVM array or struct (or packed struct) _`val`_. However, it only works on pointers: _`val`_ should be a pointer that points at an array or struct, and the returned value is a pointer to the element. There is no way to extract elements from an array or struct value directly. - `llvm_field` _`val`_ _`f`_ extracts the named field _`f`_ of a struct or packed struct _`val`_. Like `llvm_elem`, it only works on pointers. The LLVM module that defines the type must have been compiled with debugging information. Sometimes also the debugging information is insufficient to identify the correct type, in which case one should fall back to `llvm_elem`. - `llvm_union` _`val`_ _`f`_ is like `llvm_field` except that it operates on a union. It too only works on pointers, and the LLVM module that defines the type must likewise have been compiled with debugging information. Note that because the address of a union is the same as the address of all its members, this function only changes the type of the pointer. One can use `llvm_cast_pointer` instead. However, `llvm_union` avoids baking in the identity of the type of the union field, which is both convenient and improves robustness. ### Binding Quantified Variables in LLVM The basic function for this in the LLVM backend is `llvm_fresh_var` _`x_` _`ty`_. _`x`_ provides an internal name for the variable; _`ty`_ names its type. This type is an LLVM type (a value of type `LLVMType`) rather than a Cryptol type. However, it must be an LLVM type that has a corresponding Cryptol type, because the result is a Cryptol-level SAWScript value. (Thus, it has type `Term` and not type `LLVMValue`.) For example: ```sawscript x <- llvm_fresh_var "x" (llvm_int 32); ``` This produces a fresh 32-bit machine integer. There is also an experimental function `llvm_fresh_cryptol_var` that takes a Cryptol type instead. It is reasonable to use the same identifier for the internal name and for the SAWScript variable, and, where applicable, to make it the same name as used for the same thing in the code under verification. The function `llvm_fresh_expanded_val` _`ty`_ creates a fresh value of LLVM-level type (thus, the type argument is an `LLVMType`) and populates it recursively with fresh variables. This can be used for struct and array types. In general for values of compound type it is possible to create fresh primitive values for the elements and create compound values with functions like `llvm_array_value`; it is also possible to use `llvm_fresh_expanded_val` and reason about the contents by using functions like `llvm_elem`. Which is preferable is in most cases purely a matter of convenience. ### Allocation and Pointers in the LLVM Backend LLVM pointer values, other than null and uninitialized ones, cannot be arbitrarily created; one must in general explicitly allocate memory to create a pointer. The pointer can then be passed as an argument to the function under verification, or used to initialize other values. Pointers allocated in the prestate logic are presumed to exist before the function is called; pointers allocated (in the `llvm_alloc` sense) in the poststate logic are expected to be allocated (in the `malloc` runtime sense) by the function. There are two chief exceptions to pointers needing to be allocated: - `llvm_null` is a null pointer. - `llvm_fresh_pointer` creates a pointer that does not point to anything. Accessing it is an error. This provides a useful simplification for functions that handle, but do not dereference, pointers. In the LLVM backend the primary allocation function is `llvm_alloc`. It takes an LLVM-level type (a value of type `LLVMType`), allocates memory to hold a value of that type, and produces an `LLVMValue` representing the pointer. Note that allocating memory and reasoning about its contents are separate steps. In most cases each `llvm_alloc` should be paired with an `llvm_points_to` assertion (see below) that says something about the value found in the allocated memory. (Otherwise the memory is treated as uninitialized, and attempts to read from it will be treated as an error. Sometimes this is what you want; for example, a pointer argument whose purpose is to return a complex value should not be read from before it's written to, and thus leaving its contents uninitialized allows detecting such mistakes.) There are multiple variants of `llvm_alloc`: - `llvm_alloc_aligned` takes an extra integer argument _`n`_ that's a power of 2, and allocates memory that's guaranteed to be aligned to an _`n`_-byte boundary. - `llvm_alloc_readonly` creates a pointer that cannot be used for writing, like `const` pointers in C. The aliasing restrictions for allocations are relaxed for regions allocated with this function: read-only regions are allowed to alias one another. - `llvm_alloc_readonly_aligned` is like both `llvm_alloc_aligned` and `llvm_alloc_readonly` at once. - `llvm_alloc_with_size` takes an extra integer argument, which must be greater than the size of the type, and allocates that much memory. This is useful for dealing with types that are headers for memory blocks. - `llvm_alloc_sym_init` is like `llvm_alloc` but also implicitly fills the allocated memory with fresh byte values. - `llvm_symbolic_alloc` takes three arguments: a boolean, which if true makes the allocation read-only like `llvm_alloc_readonly`, a power-of-two integer, which specifies the alignment like in `llvm_alloc_aligned`, and instead of a type the third argument is a size in bytes. There is also a function `llvm_cast_pointer` that takes a pointer value and an LLVM type, and yields a new pointer value with an updated pointed-to type. This is mostly useful for accessing pointers to unions in cases where `llvm_union` isn't available. Unsafe or undefined accesses constructed via pointer casts will fail. ### LLVM Global Variables The LLVM world includes global (including file-static) variables. Some of these get put in the data segment are mutable, and others get put in read-only data and cannot be changed at runtime. (Typically, globals declared in C with `const` end up in read-only data, but there are some corner cases where that isn't true. If in doubt, check in the LLVM bitcode whether the variable is marked `constant`, or check the section in the output object file.) By default, immutable global variables are always available for use, and mutable globals that a function doesn't touch can be ignored. However, each mutable global variable used by a function under verification must be explicitly enabled in its specification. This requirement arises because using a mutable global variable creates an obligation to reason about its state. Explicitly enabling mutable globals allows SAW to easily check that all variables used are enabled, and all variables enabled have been covered in both the pre-state and post-state assertions. The function `llvm_alloc_global` _`name`_ enables the mutable global _`name`_. It returns nothing (not a pointer). (The name arises because under the covers it does actually allocate: it sets up the data-segment memory for the variable.) To refer to the global, use the function `llvm_global` _`name`_; this returns a pointer to it. If you want the initializer value for a global, you can get it with `llvm_global_initializer` _`name`_. Sometimes this will be the prestate value you want, more often not. For example, consider this code: :::{code-block} c int x = 0; int f(int y) { x = x + 1; return x + y; } int g(int z) { x = x + 2; return x + z; } ::: The following specifications are rejected because they do not allocate the global `x`. (And, even if they did, they provide neither a prestate value nor a poststate value for it.) :::{code-block} sawscript m <- llvm_load_module "./test.bc"; f_spec <- llvm_verify m "f" [] true (do { y <- llvm_fresh_var "y" (llvm_int 32); llvm_execute_func [llvm_term y]; llvm_return (llvm_term {{ 1 + y : [32] }}); }) abc; g_spec <- llvm_verify m "g" [] true (do { z <- llvm_fresh_var "z" (llvm_int 32); llvm_execute_func [llvm_term z]; llvm_return (llvm_term {{ 2 + z : [32] }}); }) abc; ::: If we want to assume that `f` will only be called once and never have `g` written before it, we could write this spec: :::{code-block} sawscript m <- llvm_load_module "./test.bc"; let init_global name = do { llvm_alloc_global name; llvm_points_to (llvm_global name) (llvm_global_initializer name); }; f_spec <- llvm_verify m "f" [] true (do { y <- llvm_fresh_var "y" (llvm_int 32); init_global "x"; llvm_precond {{ y < 2^^31 - 1 }}; llvm_execute_func [llvm_term y]; llvm_return (llvm_term {{ 1 + y : [32] }}); }) abc; ::: which initializes `x` to whatever it is initialized to in the C code at the beginning of verification. (It also includes a constraint on the argument `y` to avoid undefined behavior arising from signed integer overflow.) However, it does not specify an output value for `x` so it cannot be used compositionally. We could also write this more complete spec: :::{code-block} sawscript m <- llvm_load_module "test-signed.bc"; f_spec <- llvm_verify m "f" [] true (do { x <- llvm_fresh_var "x" (llvm_int 32); y <- llvm_fresh_var "y" (llvm_int 32); llvm_alloc_global "x"; llvm_points_to (llvm_global "x") (llvm_term x); llvm_precond {{ x + y < 2^^31 - 1 }}; llvm_execute_func [llvm_term y]; llvm_points_to (llvm_global "x") (llvm_term {{ 1 + x : [32] }}); llvm_return (llvm_term {{ 1 + x + y : [32] }}); }) abc; ::: which allows `x` to have any value on entry and includes a specification of its resulting value afterwards. One can also restrict the starting value of `x` if, for example, the point of having it there is to keep track of how many calls to `f` and `g` there have been and it's not supposed to exceed some limit. Then later uses of these functions will need to require that the limit has not yet been exceeded. There is a fly in this ointment, however, which is that it is possible for immutable globals to depend on mutable globals. This does not work in SAW's model. In particular, an immutable global initialized to the address of a mutable global will not work correctly. :::{code-block} c int thingy_store; int *const thingy = &thingy_store; ::: Here `thingy` is a immutable pointer (what it points to cannot be changed) that points to a mutable global. The internal initialization of `thingy` will fail before the specification can enable it with `llvm_alloc`. To work around this problem, SAW has an experimental global allocation mode. After using `enable_experimental`, you can use one of these three `TopLevel` functions to choose, before calling `llvm_verify`, how LLVM globals are allocated: - `llvm_alloc_constant_globals` specifies the default behavior, namely that immutable globals are automatically allocated and mutable globals are not. This is the safest mode and should be used unless one of the others is needed. - `llvm_alloc_all_globals` specifies that *all* globals are to be automatically allocated, regardless of mutability. This will ensure that examples like the one above can be handled, but at the cost of expecting that the specification provide both pre-state and post-state for _all_ mutable globals. This is not fully checked and unsafe in general. See [Compositional Verification and Mutable Global Variables](#compositional-verification-and-mutable-global-variables) for discussion. - `llvm_alloc_no_globals` disables all automatic initialization of globals. All allocation must be handled explicitly by the `LLVMSetup` block. This is in turn unsafe in the sense that immutable globals can be given arbitrary (and unrealizable) prestate values other than their actual initialization value, which then can lead to apparently successful verifications not grounded in reality. Always use `llvm_global_initializer` to generate the prestate values for immutable globals. For example: :::{code-block} sawscript llvm_points_to (llvm_global "thingy") (llvm_global_initializer "thingy"); ::: Note that the value for a pointer like `thingy` in the above example, but that is _not_ immutable (for example, might be set at program startup to point to one of several possible globals), can be specified with an assertion like the following: :::{code-block} sawscript llvm_points_to (llvm_global "thingy") (llvm_global "thingy_store"); ::: ### Assertions in the LLVM Backend The basic function for LLVM assertions is `llvm_assert`. It takes one argument, which is a `Term` (Cryptol-level value); frequently that term will be a Cryptol block written with double-braces. If in the prestate logic, it generates a precondition; in the poststate logic, it generates a postcondition. The function `llvm_precond` is equivalent, except it is only allowed in the prestate logic. Similarly, `llvm_postcond` is only allowed in the poststate logic. The function `llvm_equal` is comparable to using `llvm_assert` with an equality; it takes two arguments and asserts that they are equal. However, the arguments it takes are LLVM-level values (`LLVMValue`) rather than Cryptol-level values (`Term`), so it can express equalities that would be messy with `llvm_assert`. The use of `llvm_equal` when applicable can also sometimes lead to more efficient symbolic execution. The function `llvm_return` _`val`_ asserts that the value returned by the function is _`val`_. This is only allowed in the poststate logic. The function `llvm_points_to` is the basic way to assert about a pointer. It takes two arguments that are both LLVM-level values; the first is the pointer and the second is the contents. There are some variants of `llvm_points_to`: - `llvm_conditional_points_to` _`cond`_ _`ptr`_ _`val`_ asserts that the pointer _`ptr`_ points to the value _`val`_, but only when _`cond`_ is true. _`cond`_ is a Cryptol-level value of type `Term` that can be symbolic. - `llvm_points_to_at_type` _`ptr`_ _`ty`_ _`val`_ casts the pointer _`ptr`_ to point at _`ty`_ before examining _`val`_. This can be used, for example, to read or write a prefix of an array. It can also be used in conjunction with C code that reinterprets the memory representation of one type in terms of another. One can also use `llvm_cast_pointer` instead. - `llvm_conditional_points_to_at_type` _`cond`_ _`ptr`_ _`ty`_ _`val`_ is both the previous functions rolled together. - `llvm_points_to_untyped` _`ptr`_ _`val`_ and `llvm_conditional_points_to_untyped` _`cond`_ _`ptr`_ _`val`_ disable typechecking of the pointer and value entirely. - `llvm_points_to_array_prefix` _`ptr`_ _`arr`_ _`n`_ indicates that the pointer _`ptr`_ points to the first _`n`_ elements of _`arr`_. This is experimental. ### LLVM Bitfields SAW has experimental support for specifying `struct`s with bitfields. (The material below requires using `enable_experimental`.) Consider the following example: :::{code-block} c struct s { uint8_t x:1; uint8_t y:1; }; ::: Normally, a `struct` with two `uint8_t` fields would have an overall size of two bytes. However, because the `x` and `y` fields are declared with bitfield syntax, they are instead packed together into a single byte. Because bitfields have somewhat unusual memory representations in LLVM, some special care is required to write SAW specifications involving bitfields. For this reason, there is a dedicated `llvm_points_to_bitfield` function for this purpose: - `llvm_points_to_bitfield : LLVMValue -> String -> LLVMValue -> LLVMSetup ()` The type of `llvm_points_to_bitfield` is similar that of `llvm_points_to`, except that it takes the name of a field within a bitfield as an additional argument. For example, here is how to assert that the `y` field in the `struct` example above should be `0`: :::{code-block} sawscript ss <- llvm_alloc (llvm_alias "struct.s"); llvm_points_to_bitfield ss "y" (llvm_term {{ 0 : [1] }}); ::: Note that the type of the right-hand side value (`0`, in this example) must be a bitvector whose length is equal to the size of the field within the bitfield. In this example, the `y` field was declared as `y:1`, so `y`'s value must be of type `[1]`. Note that the following specification is _not_ equivalent to the one above: :::{code-block} sawscript ss <- llvm_alloc (llvm_alias "struct.s"); llvm_points_to (llvm_field ss "y") (llvm_term {{ 0 : [1] }}); ::: `llvm_points_to` works quite differently from `llvm_points_to_bitfield` under the hood, so using `llvm_points_to` on bitfields will almost certainly not work as expected. In order to use `llvm_points_to_bitfield`, one must also use the `enable_lax_loads_and_stores` command: - `enable_lax_loads_and_stores: TopLevel ()` The `enable_lax_loads_and_stores` command relaxes some of SAW's assumptions about uninitialized memory, which is necessary to make `llvm_points_to_bitfield` work under the hood. For example, reading from uninitialized memory normally results in an error in SAW, but with `enable_lax_loads_and_stores`, such a read will instead return a symbolic value. At present, `enable_lax_loads_and_stores` only works with What4-based tactics (e.g., `w4_unint_z3`); using it with SBV-based tactics (e.g., `sbv_unint_z3`) will result in an error. Note that SAW relies on LLVM debug metadata in order to determine which struct fields reside within a bitfield. As a result, you must pass `-g` to `clang` when compiling code involving bitfields in order for SAW to be able to reason about them. ### Loading LLVM Modules To load LLVM code, simply provide the location of a valid bitcode file to the `llvm_load_module` function. - `llvm_load_module : String -> TopLevel LLVMModule` The resulting module handle can be passed to `llvm_verify` and similar functions. The LLVM bitcode parser should generally work with LLVM versions between 3.5 and 21, though it may be incomplete for some versions. Debug metadata has changed somewhat throughout that version range, so is the most likely case of incompleteness. We aim to support every version after 3.5, however, so report any parsing failures as [issues on GitHub](https://github.com/GaloisInc/saw-script/issues). #### Tips on Compiling LLVM For generating LLVM with `clang`, it can be helpful to: - Turn on debugging symbols with `-g` so that SAW can find source locations of functions, names of variables, etc. - Optimize with `-O1` so that the generated bitcode more closely matches the C/C++ source, making the results more comprehensible. - Use `-fno-threadsafe-statics` to prevent `clang` from emitting unnecessary pthread code. - Link all relevant bitcode with `llvm-link` (including, e.g., the C++ standard library when analyzing C++ code). All SAW proofs include side conditions to rule out undefined behavior, and proofs will only succeed if all of these side conditions have been discharged. However the default SAW notion of undefined behavior is with respect to the semantics of LLVM, rather than C or C++. If you want to rule out undefined behavior according to the C or C++ standards, consider compiling your code with `-fsanitize=undefined` or one of the related flags[^1] to `clang`. Generally, you'll also want to use `-fsanitize-trap=undefined`, or one of the related flags, to cause the compiled code to use `llvm.trap` to indicate the presence of undefined behavior. Otherwise, the compiled code will call a separate function, such as `__ubsan_handle_shift_out_of_bounds`, for each type of undefined behavior, and SAW currently does not have built in support for these functions (though you could manually create overrides for them in a verification script). [^1]: #### Working With C++ The distance between C++ code and LLVM is greater than between C and LLVM, so some additional considerations come into play when analyzing C++ code with SAW. The first key issue is that the C++ standard library is large and complex, and tends to be widely used by C++ applications. To analyze most C++ code, it will be necessary to link your code with a version of the `libc++` library[^2] compiled to LLVM bitcode. The `wllvm` program[^3] can be useful for this. The C++ standard library includes a number of key global variables, and any code that touches them will require that they be initialized using `llvm_alloc_global`. Many C++ names are slightly awkward to deal with in SAW. They may be mangled relative to the text that appears in the C++ source code. SAW currently only understands the mangled names. The `llvm-nm` program can be used to show the list of symbols in an LLVM bitcode file, and the `c++filt` program can be used to demangle them, which can help in identifying the symbol you want to refer to. In addition, C++ names from namespaces can sometimes include quote marks in their LLVM encoding. For example: :::{code-block} llvm %"class.quux::Foo" = type { i32, i32 } ::: This can be mentioned in SAW by saying: :::{code-block} sawscript llvm_type "%\"class.quux::Foo\"" ::: Finally, there is no support for calling constructors in specifications, so you will need to construct objects piece-by-piece using, e.g., `llvm_alloc` and `llvm_points_to`. [^2]: [^3]: #### Multiple LLVM Bitcode Files If you have multiple bitcode modules, you can join them together before loading using the `llvm-link` utility. Alternatively, you can load them each independently with a separate `llvm_load_module` command and then combine them with the `llvm_combine_modules` function. `llvm_combine_modules` _`main`_ _`others`_ links the modules in _`others`_ to the module _`main`_ and returns (in `TopLevel`) a new LLVM module handle. It is by no means a full-blown linker implementation and should only be used with groups of modules that were meant to be linked together. The modules should be listed in order from caller to callee, the same as on the command line of a Unix linker. If duplicate names appear, names in earlier modules will shadow names in later modules. ### LLVM Verification Verification of LLVM is done with the `llvm_verify` command. :::{code-block} sawscript llvm_verify : LLVMModule -> String -> [LLVMSpec] -> Bool -> LLVMSetup () -> ProofScript () -> TopLevel LLVMSpec ::: The first two arguments specify the module and function name to verify. The third argument specifies the list of already-verified specifications to use as overrides for compositional verification. The fourth argument specifies whether to do path satisfiability checking, and the fifth gives the specification of the function to be verified. Finally, the last argument gives the proof script to use for verification. The result is a proved specification that can be used to simplify verification of functions that call this one. ## The JVM Backend and JVM Specifications In the JVM backend, types appear as values of SAWSCript type `JavaType` (not `JVMType`, by historical accident), and values have the type `JVMValue`. JVM code modules are class files and thus have the type `JavaClass`. JVM specification do-blocks have the type `JVMSetup ()`. The function argument declaration that divides the pre- and post- portions of a specification is done with `jvm_execute_func`. (jvm-types)= ### JVM Types JVM types appear in SAWScript as values of type `JVMType`. The following SAWScript functions construct JVM types: - `java_bool` is the Java `bool` type. This corresponds to the Cryptol `Bit` or `Bool` type. Lifting a Cryptol `Bit` value with `jvm_term` produces a Java `bool`. - `java_byte` is the Java `byte` type, which is an 8-bit bitvector. This corresponds to the Cryptol `[8]` type. - `java_char` is the Java `char` type, which is a 16-bit bitvector. (Recall that Java is stuck with UTF-16 strings for legacy reasons.) This corresponds to the Cryptol `[16]` type. - `java_short` is the Java `short` type, which is also a 16-bit bitvector. This corresponds to the Cryptol `[16]` type. - `java_int` is the Java `int` type, which is a 32-bit bitvector. This corresponds to the Cryptol `[32]` type. - `java_long` is the Java `long` type, which is a 64-bit bitvector. This corresponds to the Cryptol `[64]` type. - `java_float` and `java_double` are the Java `float` and `double` types, which are 32-bit and 64-bit floating point numbers respectively. These correspond to the Cryptol floating point types of the proper size. However, SAW's support for floating point is rudimentary so these types are currently of little use. - `java_array` _`n`_ _`ty`_ is an array of length _`n`_ and element type _`ty`_. This type corresponds to the Cryptol sequence type `[`_`n`_`][`_`cty`_`]` where _`cty`_ is the Cryptol type corresponding to _`ty`_, assuming there is one. Lifting a Cryptol sequence value to a JVM value produces a value of this type. (Note that Cryptol bitvectors are also sequences. A Cryptol sequence of `Bit` / `Bool` is a bitvector and lifts to a Java integer type if one of the right size exists. Otherwise it fails. A Cryptol sequence of some other type lifts to an `java_array`.) - `java_class` _`cl`_ is the type of the Java class called _`cl`_. This does not correspond to any Cryptol type. It is not possible to generate a Java array of `bool` with `jvm_term` and a Cryptol value; a Cryptol sequence of `Bit` will produce a machine integer type instead. As a workaround one can allocate a fresh array with `jvm_alloc_array` and assert about its elements individually. Allocation is discussed below in the next section. ### JVM Values and Allocation The primary way to produce JVM values of base type is to lift Cryptol values with `jvm_term`. For example, `jvm_term {{ 123 : [32] }}` produces a JVM value whose type is `java_int`. All compound values in Java are instances or arrays, and all class instance or array values are pointers/references. This allows a simpler interface than in the LLVM and MIR backends: instances and arrays are allocated, and then one can assert directly about their fields and elements. There is no need to materialize instance or array values directly. - `jvm_alloc_object` _`cl`_ allocates an object instance of type _`cl`_. - `jvm_alloc_array` _`n`_ _`ty`_ allocates an array of _`n`_ elements of type _`ty`_. These functions return `JVMValue` values whose JVM-level type is `java_class` _`cl`_ and `java_array` _`n`_ _`ty`_ respectively. As in the Java language, these are implicitly pointers/references, and there are no explicit pointers or references needed. Instances or arrays allocated in the prestate logic are presumed to exist before the function is called; those allocated (in the `jvm_alloc_object` sense) in the poststate logic are expected to be allocated (in the `new` runtime sense) by the function. The Java null pointer is called `jvm_null`. ### Binding Quantified JVM Values The basic function in the JVM backend is `jvm_fresh_var` _`x`_ _`ty`_. _`x`_ is a string, which provides an internal name for the variable, and _`ty`_ is the type to generate. This type is a JVM-level type, not a Cryptol type. However, it must be a JVM type that has a corresponding Cryptol type, because the result is a Cryptol-level value of type `Term`. For example: ```sawscript x <- jvm_fresh_var "x" java_int; ``` This produces a fresh 32-bit machine integer. It is reasonable to use the same identifier for the internal name and for the SAWScript variable, and, where applicable, to make it the same name as used for the same thing in the code under verification. ### JVM Static Variables In the JVM backend, static values always exist and can be reasoned about with `jvm_static_field_is`. There are some unresolved questions about initialization for static values; see [issue #916](https://github.com/GaloisInc/saw-script/issues/916). ### JVM Assertions The basic function for JVM assertions is `jvm_assert`. It takes one argument, which is a `Term` (Cryptol-level value); frequently that term will be a Cryptol block written with double-braces. If in the prestate logic, it generates a precondition; in the poststate logic, it generates a postcondition. The function `jvm_precond` is equivalent, except it is only allowed in the prestate logic. Similarly, `jvm_postcond` is only allowed in the poststate logic. The function `jvm_equal` is comparable to using `jvm_assert` with an equality; it takes two arguments and asserts that they are equal. However, the arguments it takes are JVM-level values (`JVMValue`) rather than Cryptol-level values (`Term`), so it can express equalities that would be messy with `jvm_assert`. The use of `jvm_equal` when applicable can also sometimes lead to more efficient symbolic execution. The function `jvm_return` _`val`_ asserts that the value returned by the function is _`val`_. This is only allowed in the poststate logic. The following functions are used for reasoning about the values of class instance and array elements: - `jvm_field_is` _`obj`_ _`f`_ _`val`_ asserts that field _`f`_ of an object value _`obj`_ has value _`val`_. _`val`_ is a JVM-level value. - `jvm_static_field_is` _`f`_ _`val`_ asserts that the static field _`f`_ (which may be qualified with a class name) has value _`val`_. _`val`_ is a JVM-level value. If no class name is included, the class of the method being verified is assumed. - `jvm_elem_is` _`arr`_ _`n`_ _`val`_ asserts that the element _`n`_ of the array _`arr`_ has the value _`val`_. _`val`_ is a JVM-level value. - `jvm_array_is` _`arr`_ _`val`_ asserts that the array _`arr`_ has the contents _`val`_, which is a _Cryptol_-level value of sequence type. There are variant versions of each of these that do not take a value; instead they implicitly create a fresh value. These are convenient when the value produced is not specified or indeterminate/unpredictable. They can also be used to write partial specifications when detailed reasoning about the output value isn't currently necessary. - `jvm_modifies_field` _`obj`_ _`f`_ - `jvm_modifies_static_field` _`f`_ - `jvm_modifies_elem` _`arr`_ _`n`_ - `jvm_modifies_array` _`arr`_ ### Loading JVM Code Loading Java code is more complex than with the other backends, because of the more structured nature of Java packages. When running `saw`, three command-line options control where to look for classes: - The `-b` flag takes the path where the `java` executable lives, which is used to locate the Java standard library classes and add them to the class database. Alternatively, one can put the directory where `java` lives [on the `PATH`](path-definition), which SAW will search if `-b` is not set. - The `-j` flag takes the name of a JAR file as an argument and adds the contents of that file to the class database. - The `-c` flag takes the name of a directory as an argument and adds all class files found in that directory (and its subdirectories) to the class database. By default, the current directory is included in the class path. Most Java programs will only require setting the `-b` flag (or the `PATH`), as that is enough to bring in the standard Java libraries. Note that when searching the `PATH`, SAW makes assumptions about where the standard library classes live. These assumptions are likely to hold on JDK 7 or later, but they may not hold on older JDKs on certain operating systems. If you are using an old version of the JDK and SAW is unable to find a standard Java class, you may need to specify the location of the standard classes' JAR file with the `-j` flag (or, alternatively, with the `SAW_JDK_JAR` environment variable). Once the class path is configured, you can pass the name of a class to the `java_load_class` function. - `java_load_class : String -> TopLevel JavaClass` The resulting `JavaClass` can be passed to `jvm_verify` and related functions. Java class files from any JDK newer than version 6 should work. However, support for JDK 9 and later is experimental. Verifying code that only uses primitive data types is known to work well, but there are some as-of-yet unresolved issues in verifying code involving classes such as `String`. For more information on these issues, refer to [this GitHub issue](https://github.com/GaloisInc/crucible/issues/641). For Java, the only compilation flag that tends to be valuable is `-g` to retain information about the names of function arguments and local variables. ### JVM Verification Verification of JVM byte code is done with the `jvm_verify` command. :::{code-block} sawscript jvm_verify : JavaClass -> String -> [JVMSpec] -> Bool -> JVMSetup () -> ProofScript () -> TopLevel JVMSpec ::: The first two arguments specify the class and method name to verify. The third argument specifies the list of already-verified specifications to use as overrides for compositional verification. The fourth argument specifies whether to do path satisfiability checking, and the fifth gives the specification of the function to be verified. Finally, the last argument gives the proof script to use for verification. The result is a proved specification that can be used to simplify verification of functions that call this one. ## The MIR backend and Rust/MIR Specifications In the MIR backend, types appear as values of SAWScript type `MIRType`. There are also separate handles for Rust algebraic data types, which appear as SAWScript type `MIRAdt`. Values have the type `MIRValue`; code modules have the type `MIRModule`; and specification do-blocks have the type `MIRSetup ()`. The function argument declaration that divides the pre- and post- portions of a specification is done with `mir_execute_func`. (mir-types)= ### MIR (Rust) Types The following SAWScript functions construct MIR types: - `mir_bool` is the Rust `bool` type. This corresponds to the Cryptol `Bit` or `Bool` type. Lifting a Cryptol `Bit` value with `mir_term` produces a Rust `bool`. - `mir_char` is the Rust `char` type. It corresponds to the Cryptol `[32]` type. - `mir_i8` and `mir_u8` are the Rust `i8` and `u8` types respectively. These are 8-bit bitvectors and (both) correspond to the Cryptol `[8]` type. - `mir_i16` and `mir_u16` are the Rust `i16` and `u16` types respectively. These are 16-bit bitvectors and (both) correspond to the Cryptol `[16]` type. - `mir_i32` and `mir_u32` are the Rust `i32` and `u32` types respectively. These are 32-bit bitvectors and (both) correspond to the Cryptol `[32]` type. - `mir_i64` and `mir_u64` are the Rust `i64` and `u64` types respectively. These are 64-bit bitvectors and (both) correspond to the Cryptol `[64]` type. - `mir_i128` and `mir_u128` are the Rust `i128` and `u128` types respectively. These are 128-bit bitvectors and (both) correspond to the Cryptol `[128]` type. - `mir_isize` and `mir_usize` are the Rust `isize` and `usize` types respectively. These are pointer-sized bitvectors and, in SAW, both correspond to the Cryptol `[64]` type. For the time being, SAW treats all Rust code as 64-bit. - `mir_f32` is the type of Rust single-precision floating point values. This corresponds to the Cryptol floating point type of the proper size. However, SAW's support for floating point is rudimentary so this type is of little use currently. - `mir_f64` is the type of Rust double-precision floating point values. This corresponds to the Cryptol floating point type of the proper size. However, SAW's support for floating point is rudimentary so this type is of little use currently. - `mir_tuple` _`tys`_ is the type of a Rust tuple containing the types _`tys`_, in order. This corresponds to the Cryptol tuple type containing the corresponding Cryptol types, if they exist. - `mir_array` _`n`_ _`ty`_ is the type of an array of element type _`ty`_ that is _`n`_ entries long. This type corresponds to the Cryptol sequence type `[`_`n`_`][`_`cty`_`]` where _`cty`_ is the Cryptol type corresponding to _`ty`_, if it exists. - `mir_adt` _`adt`_ is the type of a Rust algebraic data type. The _`adt`_ argument is a value of type `MIRAdt`, which can be looked up in a loaded MIR module by name. The `MIRAdt` type is a layer of indirection that makes ADT operations on non-ADT types ill-typed in SAWScript, which helps avoid various mistakes. Rust ADTs have no corresponding Cryptol type. - `mir_ref` _`ty`_ is the type of immutable references to _`ty`_. It has no corresponding Cryptol type. - `mir_ref_mut` _`ty`_ is the type of mutable references to _`ty`_. It has no corresponding Cryptol type. - `mir_raw_pointer_const` _`ty`_ is the type of immutable raw pointers to _`ty`_. It has no corresponding Cryptol type. - `mir_raw_pointer_mut` _`ty`_ is the type of mutable raw pointers to _`ty`_. It has no corresponding Cryptol type. - `mir_slice` _`ty`_ is the type of a (reference to a) Rust slice of element type _`ty`_. Currently SAW can only handle references to slices: `&[`_`ty`_`]` in Rust terms. This has no corresponding Cryptol type. - `mir_str` is the type of a (reference to a) Rust string slice. Currently SAW can only handle references to strings: `&str` in Rust terms. This has no corresponding Cryptol type. - `mir_vec` _`mod`_ _`ty`_ is the type of a Rust vector of element type _`ty`_: `Vec<`_`ty`_`>` in Rust terms. _`mod`_ is a handle for a MIR code module (see below) that is needed to look up some internal identifiers. Lifting Cryptol bitvector values produces the corresponding sized Rust/MIR integer types; however, whether it is the signed or unsigned version is unspecified. Typechecking at the MIR/Rust level is relaxed so that the signed and unsigned versions are considered equivalent. Also, it is not possible to generate a MIR array of `bool` with `mir_term` and a Cryptol value; a Cryptol sequence of `Bit` will produce a machine integer type instead. As a workaround one can create a fresh array value with `mir_fresh_expanded_value` (described below) and assert about its elements individually. ### MIR Values The primary way to produce MIR values of base type is to lift Cryptol values with `mir_term`. For example, `mir_term {{ 123 : [32] }}` produces a MIR value whose type is `mir_i32`. MIR values of compound type are produced with the following functions: - `mir_enum_value` _`adt`_ _`name`_ _`args`_ produces a MIR value whose type is a Rust/MIR enumerated type / algebraic data type. _`adt`_ is the type; ADTs can be retrieved using `mir_find_adt`. _`name`_ is a string; it is the constructor name for the desired variant. This should be the short form of the name within the type, such as `"None"`; it does not need to be a fully-qualified identifier. _`args`_ is a list of MIR-level values to use as the arguments to the data constructor. Note that `mir_enum_value` can only be used to construct a single specific variant. If you need to construct a symbolic enum value that can range over many potential variants, use `mir_fresh_expanded_value` instead. - `mir_tuple_value` _`elems`_ produces a MIR tuple value. _`elems`_ provides the contents of (and defines the arity of) the tuple. - `mir_array_value` _`ty`_ _`elems`_ produces a MIR array value. _`ty`_ is the element type; _`elems`_ is a list of values to populate the array with. The type must match the element list... unless the element list is empty. - `mir_struct_value` _`adt`_ _`vals`_ produces a MIR struct value. _`adt`_ gives the type (as with `mir_enum_value`, use `mir_find_adt` to look up the type handle) and _`vals`_ gives values for the fields, in order. - `mir_slice_value` _`arr`_ generates a slice value from an array reference. The slice contains the whole array. This is equivalent to `&`_`arr`_`[..]` in Rust. The argument must be an allocated array reference (not an array value) whose Rust type is `&[`_`ty`_; _`n`_`]` or `&mut [`_`ty`_; _`n`_`]`. The resulting slice is immutable (`&[`_`ty`_`]`) or mutable (`&mut [`_`ty`_`]`) depending on the mutability of the base reference. - `mir_slice_range_value` _`arr`_ _`lo`_ _`hi`_ is like `mir_slice_value` but generates a slice that refers to a subrange of the array; i.e. `&`_`arr`_`[`_`lo`_`..`_`hi`_`]`. It includes the elements from index _`lo`_ up to but not including index _`hi`_. It is an error for _`lo`_ to exceed _`hi`_ or for either to index past the end of the underlying array. Also note that the index arguments have the SAWScript type `Int`, that is, they are constants. It is not possible to create a slice with a symbolic length. If this limitation prevents you from using SAW, please file an issue [on GitHub](https://github.com/GaloisInc/saw-script/issues). - `mir_str_slice_value` _`arr`_ creates a string slice (of Rust type `&str`) from an array reference whose element type is `u8`. It is otherwise the same as `mir_slice_value`. The array is expected to be a UTF-8-encoded sequence of bytes. - `mir_str_slice_range_value` _`arr`_ _`lo`_ _`hi`_ has the same relationship to `mir_str_slice_value` that `mir_slice_range_value` has to `mir_slice_value`. The elements of MIR values of compound type can be extracted with these functions: - `mir_elem_value` _`arr`_ _`ix`_ takes an array value and an index, and returns the value in the array at that index. This returns a MIR-level value (type `MIRValue`). - `mir_elem_ref` _`arr`_ _`ix`_ is like `mir_elem_value`, but takes a reference (or a raw pointer) to an array and returns a reference (or, respectively, a raw pointer) to the element in the array. The reference or raw pointer must already point to an array value. (Unlike the analogous `llvm_elem`, `mir_elem_ref` cannot be used to initialize freshly-allocated memory.) - `mir_field_value` _`str`_ _`f`_ takes a struct value _`str`_ and a field name _`f`_, which should be a string, and returns the value of the field. For "tuple structs" with numbered fields, use a string containing an integer constant, such as `"1"`. - `mir_field_ref` _`str`_ _`f`_ is like `mir_field_value`, but takes a reference (or a raw pointer) instead of an immediate struct value and returns a reference (or, respectively, a raw pointer) to the named field. The reference or raw pointer must already point to a struct value. (Unlike the analogous `llvm_elem` or `llvm_field`, `mir_field_ref` cannot be used to initialize freshly-allocated memory.) It is also possible to create an if-then-else expression at the MIR value level using the function `mir_mux_values`. It takes three arguments: the first is a Cryptol-level `Term` whose underlying type should be boolean, and the second and third are MIR-level values to produce if the control value is true and false respectively. The control value may be symbolic. ### Slice Example Consider this Rust function, which accepts an arbitrary slice as an argument: :::{code-block} rust pub fn f(s: &[u32]) -> u32 { s[0] + s[1] } ::: We can write a specification that passes a slice to the array `[1, 2, 3, 4, 5]` as an argument to `f`: :::{code-block} sawscript let f_spec_1 = do { a <- mir_alloc (mir_array 5 mir_u32); mir_points_to a (mir_term {{ [1, 2, 3, 4, 5] : [5][32] }}); mir_execute_func [mir_slice_value a]; mir_return (mir_term {{ 3 : [32] }}); }; ::: Alternatively, we can write a specification that passes a part of this array over the range `[1..3]`, i.e., ranging from second element to the fourth. Because this is a half-open range, the resulting slice has length 2: :::{code-block} sawscript let f_spec_2 = do { a <- mir_alloc (mir_array 5 mir_u32); mir_points_to a (mir_term {{ [1, 2, 3, 4, 5] : [5][32] }}); mir_execute_func [mir_slice_range_value a 1 3]; mir_return (mir_term {{ 5 : [32] }}); }; ::: Note that we are passing _references_ to arrays to `mir_slice_value` and `mir_slice_range_value`. It would be an error to pass a bare array to these functions, so the following specification would be invalid: :::{code-block} sawscript let f_fail_spec = do { let arr = mir_term {{ [1, 2, 3, 4, 5] : [5][32] }}; mir_execute_func [mir_slice_value arr]; // BAD: `arr` is not a reference mir_return (mir_term {{ 3 : [32] }}); }; ::: ### Notes on Strings One unusual requirement about `mir_str_slice_value` and `mir_str_slice_range_value` is that they require the argument to be of type `&[u8; N]`, i.e., a reference to an array of bytes. This is an artifact of the way that strings are encoded in Cryptol. The following Cryptol expressions: - `"A"` - `"123"` - `"Hello World"` have the following types: - `[1][8]` - `[3][8]` - `[11][8]` This is because Cryptol strings are syntactic shorthand for sequences of bytes. The following Cryptol expressions are wholly equivalent to the string literals above: - `[0x41]` - `[0x31, 0x32, 0x33]` - `[0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72, 0x6c, 0x64]` These represent the strings in the extended ASCII character encoding. The Cryptol sequence type `[N][8]` corresponds to the Rust type `[u8; N]`, so the requirement to have something of type `&[u8; N]` as an argument reflects this design choice. Note that `mir_str_slice_value` and `mir_slice_value` applied to the same `u8` array reference are _not_ the same thing. The two commands represent different types of Rust values. While both commands take the same argument type, `mir_str_slice_value` will return a value of Rust type `&str` (or `&mut str`), whereas `mir_slice_value` will return a value of Rust type `&[u8]` (or `&mut [u8]`). These Rust types are checked when you pass these values as arguments to Rust functions (using `mir_execute_func`) or when you return these values (using `mir_return`), and it is an error to supply a `&str` value in a place where a `&[u8]` value is expected (and vice versa). As an example of how to write specifications involving string slices, consider this Rust function: :::{code-block} rust pub fn my_len(s: &str) -> usize { s.len() } ::: We can use `mir_str_slice_value` to write a specification for `my_len` when it is given the string `"hello"` as an argument: :::{code-block} sawscript let my_len_spec = do { s <- mir_alloc (mir_array 5 mir_u8); mir_points_to s (mir_term {{ "hello" }}); mir_execute_func [mir_str_slice_value s]; mir_return (mir_term {{ 5 : [64] }}); }; ::: Currently, Cryptol only supports characters that can be encoded in a single byte. As a result, it is not currently possible to take slices of strings with certain characters. For example, the string `"roșu"` cannot be used as a Cryptol expression, as the character `'ș'` would require 10 bits to represent instead of 8. The alternative is to use UTF-8 to encode such characters. For instance, the UTF-8 encoding of the string `"roșu"` is `"ro\200\153u"`, where `"\200\153"` is a sequence of two bytes that represents the `'ș'` character. SAW makes no attempt to ensure that string slices over a particular range aligns with UTF-8 character boundaries. For example, the following Rust code would panic: :::{code-block} rust let rosu: &str = "roșu"; let s: &str = &rosu[0..3]; println!("{:?}", s); ::: :::{code-block} console thread 'main' panicked at 'byte index 3 is not a char boundary; it is inside 'ș' (bytes 2..4) of `roșu`' ::: On the other hand, SAW will allow you define a slice of the form `mir_str_slice_range r 0 3`, where `r` is a reference to `"ro\200\153u"`. It is the responsibility of the SAW user to ensure that `mir_str_slice_range` indices align with character boundaries. ### Finding MIR Types Rust supports a greater degree of polymorphism than the languages behind the other backends; in particular Rust types can take parameters. This introduces various complications referring to type names. We collectively refer to MIR `struct`s and `enum`s together as _algebraic data types_, or ADTs for short. The function `mir_find_adt` looks up a handle for an algebraic data type by its name and type arguments for its type parameters. It takes three arguments: the first is a MIR module handle, the second is a string that names the type, and the third is a list of type arguments. The `MIRAdt` value returned can be given to the `mir_adt` function to get a `MIRType`, or passed directly to the `mir_enum_value` and `mir_struct_value` functions. Alternatively, one can use the function `mir_find_mangled_adt` to look up a particular instantiation of a type by its "mangled" name, such as `foo::Bar::_adt12345`. The code at the end corresponds to some particular set of type arguments. However, these codes are not stable (they change unpredictably at the whim of `rustc`) and generally must be discovered by grovelling manually in your `linked-mir.json` file. Thus one should use `mir_find_adt` instead whenever possible. `mir_find_mangled_adt` takes two arguments: a MIR module handle and a string containing the mangled name. For example, consider this code: :::{code-block} rust pub struct S { pub x: A, pub y: B, } pub fn f() -> S { S { x: 1, y: 2, } } pub fn g() -> S { S { x: 3, y: 4, } } ::: At the MIR level, the return types of `f` and `g` are distinct types each with their own names. For example, - `S` might give rise to `example/abcd123::S::_adt456` - `S` might give rise to `example/abcd123::S::_adt789` These can then be looked up with: - `mir_find_adt m "example::S" [mir_u8, mir_u16];` - `mir_find_adt m "example::S" [mir_u32, mir_u64];` or - `mir_find_mangled_adt m "example::S::_adt456";` - `mir_find_mangled_adt m "example::S::_adt789";` but in the second case one must discover the codes `456` and `789`. The lookup functions return a SAWScript value of type `MIRAdt`, which can then be used with `mir_adt` to get a `MIRType`, or with `mir_struct_value` and `mir_enum_value` to construct MIR-level values. #### Const Generics Rust ADTs can have _const generic_ parameters that allow the ADT to be generic over constant values. For instance, the following Rust code declares a const generic parameter `N` on the struct `S`, as well as on the functions `f` and `g` that compute `S` values: :::{code-block} rust pub struct S { pub x: [u32; N] } pub fn f(y: [u32; 1]) -> S<1> { S { x: y } } pub fn g(y: [u32; 2]) -> S<2> { S { x: y } } ::: Like with other forms of Rust generics, instantiating `S` with different constants will give rise to different identifiers in the compiled MIR code. SAW provides a `mir_const` function for specifying the values of constants used to instantiate const generic parameters: - `mir_const : MIRType -> Term -> MIRType` For instance, in order to look up `S<1>`, use `mir_const` in conjunction with `mir_find_adt` like so: :::{code-block} sawscript s_adt = mir_find_adt m "example::S" [mir_const mir_usize {{ 1 : [64] }}] ::: Unlike other forms of `MIRType`s, the type returned by `mir_const` is not a type that you can create values with. For instance, calling `mir_alloc` or `mir_fresh_var` at a type returned by `mir_const` will raise an error. `mir_const` is only useful for looking up ADTs via `mir_find_adt`. At present, `mir_const` only supports looking up constant values with the types listed [here](https://doc.rust-lang.org/1.86.0/reference/items/generics.html#r-items.generics.const.allowed-types) in the Rust Reference. Specifically, the `MIRType` argument must be one of the following, subject to the following restrictions: - A primitive integer type, i.e., `mir_u{8,16,32,64,128,size}` or `mir_i{8,16,32,64,128,size}`. The `Term` argument must be a bitvector of the corresponding size. For instance, if the `MIRType` is `mir_u8`, then the `Term` must be a bitvector of type `[8]`. - `mir_bool`. The `Term` argument must be of type `Bit`. - `mir_char`. The `Term` argument must be of type `[32]`. ### Finding MIR Values and Functions Rust also supports polymorphic functions. The Rust compiler does not generate a single generic implementation for polymorphic functions; instead it generates a separate version for every different type instantiation. Like instantiated polymorphic types, these are distinguished by hash codes, and the codes are not stable and not easy to discover. The function `mir_find_name` looks up a function by its name and a list of types used to instantiate its type parameters. (The first argument is a string, the second is a list of `MIRType`.) It returns the mangled name as a string. This string can then be passed to `mir_verify` and other related functions. ### Lifetime Values Rust ADTs and functions can have _lifetime_ parameters as well as type parameters. The following Rust code declares a lifetime parameter `'a` on the struct `S`, and also on the function `f` that computes an `S` value: :::{code-block} rust pub struct S<'a> { pub x: &'a u32, } pub fn f<'a>(y: &'a u32) -> S<'a> { S { x: y } } ::: When `mir-json` compiles a piece of Rust code that contains lifetime parameters, it will instantiate all of the lifetime parameters with a placeholder MIR type that is simply called `lifetime`. This is important to keep in mind when looking up ADTs with `mir_find_adt`, as you will also need to indicate to SAW that the lifetime parameter is instantiated with `lifetime`. In order to do so, use `mir_lifetime`. For example, here is how to look up `S` with `'a` instantiated to `lifetime`: :::{code-block} sawscript s_adt = mir_find_adt m "example::S" [mir_lifetime] ::: Note that this part of SAW's design is subject to change in the future. Ideally, users would not have to care about lifetimes at all at the MIR level; see [this issue](https://github.com/GaloisInc/mir-json/issues/58) for further discussion on this point. If that issue is fixed, then we will likely remove `mir_lifetime`, as it will no longer be necessary. ### Binding Quantified MIR Variables The basic function for this in the MIR backend is `mir_fresh_var` _`x`_ _`ty`_. _`x`_ provides an internal name for the variable; _`ty`_ gives its type. This type is a MIR type (a value of type `MIRType`) rather than a Cryptol type. However, it must be an MIR type that has a corresponding Cryptol type, because the result is a Cryptol-level SAWScript value. (Thus, it has type `Term` and not type `MIRValue`.) For example: ```sawscript x <- mir_fresh_var "x" mir_i32; ``` This produces a fresh 32-bit machine integer. There is also an experimental function `mir_fresh_cryptol_var` that takes a Cryptol type instead. It is reasonable to use the same identifier for the internal name and for the SAWScript variable, and, where applicable, to make it the same name as used for the same thing in the code under verification. The function `mir_fresh_expanded_value` _`name`_ _`ty`_ creates a fresh value of MIR-level type (thus, the type argument is an `MIRType`) and populates it recursively with fresh variables. The _`name`_ argument is used as a prefix for the names of the generated fresh variables. This can be used for struct and array types. However, types that are or contain references or raw pointers are not currently supported. In general, to reason about values of compound type, it is possible to create fresh primitive values for the elements and create compound values with functions like `mir_array_value`. It is also possible to use `mir_fresh_expanded_value` and reason about the contents by using functions like `mir_elem_value`. Which is preferable is in most cases purely a matter of convenience. ### Allocation, References, and Pointers in the MIR Backend MIR references and raw pointer values cannot be arbitrarily created; one must in general explicitly allocate memory to create a reference or pointer. The reference or pointer so created can then be passed as a function argument or used to initialize other values. Pointers allocated in the prestate logic are presumed to exist before the function is called; pointers allocated (in the `mir_alloc` sense) in the poststate logic are expected to be allocated (in the runtime sense) by the function. The main allocator functions in the MIR backend are: - `mir_alloc` _`ty`_ allocates an immutable reference. - `mir_alloc_mut` _`ty`_ allocates a mutable reference. Because Rust (and therefore MIR) distinguishes immutable references from mutable references at the type level, it's important to use the correct allocation function. The allocation cannot be used to allocate slices (including string slices). Slices are shared references to other allocations; they can be constructed directly using `mir_slice_value` and friends after allocating the underlying reference. There are also four functions for allocating raw pointers: - `mir_alloc_raw_ptr_const` _`ty`_ allocates a constant raw pointer to _`ty`_: `*const` _`ty`_ in Rust terms. - `mir_alloc_raw_ptr_mut` _`ty`_ allocates a mutable raw pointer to _`ty`_: `*mut` _`ty`_ in Rust terms. - `mir_alloc_raw_ptr_const_multi` _`n`_ _`ty`_ allocates a constant raw pointer to a sequence/array of _`n`_ _`ty`_s. - `mir_alloc_raw_ptr_mut_multi` _`n`_ _`ty`_ allocates a mutable raw pointer to a sequence/array of _`n`_ _`ty`_s. In low-level (and unsafe) Rust code, it is possible to get a raw pointer into an allocation of multiple values and do pointer arithmetic, much like C arrays. These pointers must be generated using the last two functions. A pointer thus generated can be matched against a larger allocation; a spec whose prestate allocates a multi-raw-pointer of length _`n`_ can be used as an override spec in a context where the pointer is actually longer, and a spec whose postate allocates a multi-raw-pointer of length _`n`_ will verify even if the function allocates more space than that. That is, these functions declare an allocation that contains at least _`n`_ values, and the actual allocation in the code can be larger. Note that allocating memory and reasoning about its contents are separate steps. In most cases each `mir_alloc` should be paired with an `mir_points_to` assertion (see below) that says something about the value found in the allocated memory. (Otherwise the memory is treated as uninitialized, and attempts to read from it will be treated as an error. This is sometimes necessary to model unsafe Rust code.) The following convenience functions are also available. - `mir_ref_of` _`val`_ and `mir_ref_of_mut` _`val`_ are convenience functions that combine `mir_alloc` (or `mir_alloc_mut`) with the `mir_points_to` assertion. - `mir_vec_of` _`name`_ _`ty`_ _`vals`_ is an experimental convenience function that creates a value of type `Vec<`_`ty`_`>`. [`Vec`](https://doc.rust-lang.org/std/vec/struct.Vec.html) is a commonly used data type in the Rust standard library. The _`vals`_ argument is a `MIRValue` that's a MIR array (not a list of `MIRValue`s) used to populate the `Vec`. This can be created with `mir_array_value` or lifted from a Cryptol sequence value. The _`name`_ argument is used as a prefix for the names of the internal symbolic variables inside the `Vec`. (Effectively, it is the name of the symbolic `Vec` being built.) `Vec` is just a regular struct and not a special language construct, so technically no special SAW support is needed. However, you would need to explicitly specify all the internal details and invariants of `Vec`, and that can get quite messy. This function may change in the future; there are unresolved internal design questions. There is a function `mir_cast_raw_ptr` that takes a raw pointer value and a MIR type, and yields a new pointer value with an updated pointed-to type. This is an unsafe operation needed for handling certain kinds of unsafe code. The pointer cannot actually be accessed unless it is cast back to its original type. (It is thus different from `llvm_cast_pointer`, which allows reinterpreting memory.) (mir-static-items)= ### MIR Static Variables Rust static values (global variables) can be referenced with the function `mir_static` _`name`_, which returns a reference to the static called _`name`_. (The name follows the same qualification rules as other MIR names.) This reference is mutable if and only if the static is mutable. It can be reasoned about in the prestate and poststate logic like any other reference. The initialization value for a static can be fetched with `mir_static_initializer` _`name`_. Sometimes this will be the prestate value you want. Explicit allocation like `llvm_alloc_global` is not required, even for mutable statics. Here is an example involving static items: :::{code-block} rust // statics.rs static S1: u8 = 1; static mut S2: u8 = 2; pub fn f() -> u8 { // Reading a mutable static item requires an `unsafe` block due to // aliasing and concurrency-related concerns. We have no aliasing, // and are only concerned about the behavior of this program in a // single-threaded context. let s2 = unsafe { S2 }; S1 + s2 } ::: We can write a specification for `f` like so: :::{code-block} sawscript // statics.saw let f_spec = do { mir_points_to (mir_static "statics::S2") (mir_static_initializer "statics::S2"); // Note that we do not need to initialize S1, as immutable static // items are implicitly initialized in every specification. mir_execute_func []; mir_return (mir_term {{ 3 : [8] }}); }; m <- mir_load_module "statics.linked-mir.json"; mir_verify m "statics::f" [] false f_spec z3; ::: In order to use a specification involving mutable static items for compositional verification, it is required to specify the value of all mutable static items using the `mir_points_to` command in the specification's postconditions. For further explanation, see the [Compositional Verification and Mutable Global Variables](#compositional-verification-and-mutable-global-variables) section. ### Assertions in the MIR Backend The basic function for MIR assertions is `mir_assert`. It takes one argument, which is a `Term` (Cryptol-level value); frequently that term will be a Cryptol block written with double-braces. If in the prestate logic, it generates a precondition; in the poststate logic, it generates a postcondition. The function `mir_precond` is equivalent, except it is only allowed in the prestate logic. Similarly, `mir_postcond` is only allowed in the poststate logic. The function `mir_equal` is comparable to using `mir_assert` with an equality; it takes two arguments and asserts that they are equal. However, the arguments it takes are MIR-level values (`MIRValue`) rather than Cryptol-level values (`Term`), so it can express equalities that would be messy with `mir_assert`. The use of `mir_equal` when applicable can also sometimes lead to more efficient symbolic execution. The function `mir_return` _`val`_ asserts that the value returned by the function is _`val`_. This is only allowed in the poststate logic. The function `mir_points_to` is the basic way to assert about a pointer. It takes two arguments that are both MIR-level values; the first is the reference (or raw pointer) and the second is the contents. There is a variant of `mir_points_to` for raw pointers that point at sequences of values: `mir_points_to_multi`. The first argument must be a raw pointer; the second should be a `MIRValue` of array type. The pointer must have been allocated with `mir_alloc_raw_ptr_const_multi` or `mir_alloc_raw_ptr_mut_multi` and should contain space for at least as many values as the argument array holds. The array is copied sequentially; it is not necessarily the case that the pointed-to values are the contents of a Rust array. Using a `MIRValue` allows lifting a Cryptol sequence with `mir_term`, which is often convenient. (loading-mir)= ### Loading MIR Modules To load a piece of Rust code, first compile it to a MIR JSON file, as described immediately below, and then provide the location of the JSON file to the `mir_load_module` function: - `mir_load_module : String -> TopLevel MIRModule` SAW currently supports Rust code that can be built with a [September 14, 2025 Rust nightly](https://static.rust-lang.org/dist/2025-09-14/). If you encounter a Rust feature that SAW does not support, please report it [on GitHub](https://github.com/GaloisInc/saw-script/issues). (compiling-mir)= #### Compiling MIR When verifying Rust code, SAW processes Rust's MIR (mid-level intermediate representation) language. In particular, SAW analyzes a particular form of MIR that our [`mir-json`](https://github.com/GaloisInc/mir-json) tool produces. You will need to install `mir-json` and run it on Rust code in order to produce MIR JSON files that SAW can load. You will also need to use `mir-json` to build custom versions of the Rust standard libraries that are more suited to verification purposes. If you are working from a checkout of the `saw-script` repo, you can install the `mir-json` tool and the custom Rust standard libraries by performing the following steps: 1. Make sure you have the proper versions of the [`crucible`](https://github.com/GaloisInc/crucible) and `mir-json` submodules like so: :::{code-block} console $ git submodule update deps/crucible deps/mir-json ::: 2. Navigate to the `mir-json` submodule: :::{code-block} console $ cd deps/mir-json ::: 3. Follow the instructions laid out in the [`mir-json` installation instructions](https://github.com/GaloisInc/mir-json#installation-instructions) in order to install `mir-json`. 4. Run the `mir-json-translate-libs` script in the `mir-json` submodule: :::{code-block} console $ mir-json-translate-libs ::: This will compile the custom versions of the Rust standard libraries using `mir-json`, placing the results under the `rlibs` subdirectory. 5. Finally, define a `SAW_RUST_LIBRARY_PATH` environment variable that points to the newly created `rlibs` subdirectory: :::{code-block} console $ export SAW_RUST_LIBRARY_PATH=<...>/mir-json/rlibs ::: For `cargo`-based projects, `mir-json` provides a `cargo` subcommand called `cargo saw-build` that builds a JSON file suitable for use with SAW. `cargo saw-build` integrates directly with `cargo`, so you can pass flags to it like any other `cargo` subcommand. For example: :::{code-block} console # Make sure that SAW_RUST_LIBRARY_PATH is defined, as described above $ cargo saw-build linking 11 mir files into <...>/example-364cf2df365c7055.linked-mir.json ::: Note that: - The full output of `cargo saw-build` here is omitted. The important part is the `.linked-mir.json` file that appears after `linking X mir files into`, as that is the JSON file that must be loaded with SAW. - `SAW_RUST_LIBRARY_PATH` should point to the MIR JSON files for the Rust standard library. `mir-json` also supports compiling individual `.rs` files through `mir-json`'s `saw-rustc` command. As the name suggests, it accepts all of the flags that `rustc` accepts. For example: :::{code-block} console # Make sure that SAW_RUST_LIBRARY_PATH is defined, as described above $ saw-rustc example.rs linking 11 mir files into <...>/example.linked-mir.json ::: ### MIR Verification Verification of Rust code in the MIR backend is done with the `mir_verify` command. :::{code-block} sawscript mir_verify : MIRModule -> String -> [MIRSpec] -> Bool -> MIRSetup () -> ProofScript () -> TopLevel MIRSpec ::: The first two arguments specify the module and function name to verify. The third argument specifies the list of already-verified specifications to use as overrides for compositional verification. The fourth argument specifies whether to do path satisfiability checking, and the fifth gives the specification of the function to be verified. Finally, the last argument gives the proof script to use for verification. The result is a proved specification that can be used to simplify verification of functions that call this one. The `String` supplied as an argument to `mir_verify` is expected to be a function _identifier_. This should have one of the following forms: - `/::` - `::` Where: - `` is the name of the crate in which the function is defined. (If you produced your MIR JSON file by compiling a single `.rs` file with `saw-rustc`, then the crate name is the same as the name of the file, but without the `.rs` file extension.) - `` is a hash of the crate and its dependencies. In extreme cases, it is possible for two different crates to have identical crate names, in which case the disambiguator must be used to distinguish between the two crates. In the common case, however, most crate names will correspond to exactly one disambiguator, and you are allowed to leave out the `/` part of the `String` in this case. If you supply an identifier with an ambiguous crate name and omit the disambiguator, then SAW will raise an error. - `` is the path to the function within the crate. Sometimes, this is as simple as the function name itself. In other cases, a function path may involve multiple _segments_, depending on the module hierarchy for the program being verified. For instance, a `read` function located in `core/src/ptr/mod.rs` will have the identifier: :::{code-block} text core::ptr::read ::: where `core` is the crate name and `ptr::read` is the function path, which has two segments `ptr` and `read`. There are also some special forms of segments that appear for functions defined in certain language constructs. For instance, if a function is defined in an `impl` block, then it will have `{impl}` as one of its segments, e.g., :::{code-block} text core::ptr::const_ptr::{impl}::offset ::: If you are in doubt about what the full identifier for a given function is, consult the MIR JSON file for your program. The Rust compiler generates a separate instance for each use of a polymorphic function at a different type. These instances have a compiler-generated name, so (as discussed above) the easiest way to refer to them is by using the command `mir_find_name : MIRModule -> String -> [MIRType] -> String`. Given a Rust module, the name of a polymorphic function, and a list of types, `mir_find_name` will return the name of the corresponding instantiation. It fails if the polymorphic function or the given instantiation are not found. (compositional-verification)= ## Compositional Verification As mentioned briefly above, SAW's specification scheme for code allows for compositional reasoning. That is, when proving properties of a given method or function, we can make use of properties we have already proved about its callees rather than analyzing them anew. This enables us to reason about much larger and more complex systems than otherwise possible. The `llvm_verify`, `jvm_verify`, and `mir_verify` functions return values of type `LLVMSpec`, `JVMSpec`, and `MIRSpec`, respectively. These values are opaque objects that internally contain both the information provided in the associated `LLVMSetup`, `JVMSetup`, or `MIRSetup` specification do-blocks, respectively, and the results of the verification process. Any of these `Spec` objects can be passed in via the third argument of the `..._verify` functions. For any function or method specified by one of these parameters, the simulator will not follow calls to the associated target. Instead, it will perform the following steps: - Check that all `llvm_points_to` and `llvm_precond` statements (or the corresponding JVM or MIR statements) in the specification are satisfied. - Update the simulator state and optionally construct a return value as described in the specification. More concretely, suppose we have an `add` function (simpler than the one earlier in the chapter, which was supposed to also illustrate memory allocation) and a doubling function written in terms of it: :::{code-block} c unsigned add(unsigned x, unsigned y) { return x + y; } unsigned dbl(unsigned *x) { return add(x, x); } ::: We can write a specification for `add`: :::{code-block} sawscript let add_spec = do { x <- llvm_fresh_var "x" (llvm_int 32); y <- llvm_fresh_var "y" (llvm_int 32); llvm_execute_func [llvm_term x, llvm_term y]; llvm_return (llvm_term {{ x + y }}); }; ::: And one for `dbl`, which is very similar to the one for `add`: :::{code-block} sawscript let dbl_spec = do { x <- llvm_fresh_var "x" (llvm_int 32); llvm_execute_func [llvm_term x]; llvm_return (llvm_term {{ x + x }}); }; ::: We can verify `add` and then verify `dbl` using what we proved about `add`: :::{code-block} sawscript add_theorem <- llvm_verify bc "add" [] true add_spec z3; dbl_theorem <- llvm_verify bc "dbl" [add_theorem] true dbl_spec z3; ::: When `dbl` calls `add`, `llvm_verify` uses the specification we proved instead of executing through it again. In this case, it doesn't save computational effort, since `add` is so simple. However, it illustrates the approach. Imagine we were doing something less trivial and `add` took ten minutes to verify. The compositional verification of `dbl` would, instead of taking another ten minutes to verify, be almost instantaneous. This example also illustrates a limitation of the technique: it only allows reusing the _proof effort_. The specification for `dbl` still needs to be a complete characterization of its behavior, and you will notice it does not share any logic with the specification for `add`. Reusing _specification effort_ by sharing elements of the specification is a different problem. For logic more complicated than addition, one can factor out elements of the behavioral specification into a shared Cryptol model. (Sharing elements of the call setup and theorem statement requires SAWScript programming, which is an advanced topic. See [The SAWScript Language](#sawscript).) #### Compositional Verification and Mutable Allocations A common pitfall when using compositional verification is to reuse a specification that underspecifies the value of a mutable allocation. In general, doing so can lead to unsound verification, so SAW goes to great length to check for this. Here is an example of this pitfall in an LLVM verification. Given this C code: :::{code-block} c void side_effect(uint32_t *a) { *a = 0; } uint32_t foo(uint32_t x) { uint32_t b = x; side_effect(&b); return b; } ::: And the following SAW specifications: :::{code-block} sawscript let side_effect_spec = do { a_ptr <- llvm_alloc (llvm_int 32); a_val <- llvm_fresh_var "a_val" (llvm_int 32); llvm_points_to a_ptr (llvm_term a_val); llvm_execute_func [a_ptr]; }; let foo_spec = do { x <- llvm_fresh_var "x" (llvm_int 32); llvm_execute_func [llvm_term x]; llvm_return (llvm_term x); }; ::: Should SAW be able to verify the `foo` function against `foo_spec` using compositional verification? That is, should the following be expected to work? :::{code-block} sawscript side_effect_ov <- llvm_verify m "side_effect" [] false side_effect_spec z3; llvm_verify m "foo" [side_effect_ov] false foo_spec z3; ::: A literal reading of `side_effect_spec` would suggest that the `side_effect` function allocates `a_ptr` but then does nothing with it, implying that `foo` returns its argument unchanged. This is incorrect, however, as the `side_effect` function actually changes its argument to point to `0`, so the `foo` function ought to return `0` as a result. SAW should not verify `foo` against `foo_spec`, and indeed it does not. The problem is that `side_effect_spec` underspecifies the value of `a_ptr` in its postconditions, which can lead to the potential unsoundness seen above when `side_effect_spec` is used in compositional verification. To prevent this source of unsoundness, SAW will _invalidate_ the underlying memory of any mutable pointers (i.e., those declared with `llvm_alloc`, not `llvm_alloc_global`) allocated in the preconditions of compositional override that do not have a corresponding `llvm_points_to` statement in the postconditions. Attempting to read from invalidated memory constitutes an error, as can be seen in this portion of the error message when attempting to verify `foo` against `foo_spec`: :::{code-block} console invalidate (state of memory allocated in precondition (at side.saw:3:12) not described in postcondition) ::: To fix this particular issue, add an `llvm_points_to` statement to `side_effect_spec`: :::{code-block} sawscript let side_effect_spec = do { a_ptr <- llvm_alloc (llvm_int 32); a_val <- llvm_fresh_var "a_val" (llvm_int 32); llvm_points_to a_ptr (llvm_term a_val); llvm_execute_func [a_ptr]; // This is new llvm_points_to a_ptr (llvm_term {{ 0 : [32] }}); }; ::: After making this change, SAW will reject `foo_spec` for a different reason, as it claims that `foo` returns its argument unchanged when it actually returns `0`. Note that invalidating memory itself does not constitute an error, so if the `foo` function never read the value of `b` after calling `side_effect(&b)`, then there would be no issue. It is only when a function attempts to _read_ from invalidated memory that an error is thrown. In general, it can be difficult to predict when a function will or will not read from invalidated memory, however. For this reason, it is recommended to always specify the values of mutable allocations in the postconditions of your specs, as it can avoid pitfalls like the one above. The same pitfalls apply to compositional MIR verification, with a couple of key differences. In MIR verification, mutable references are allocated using `mir_alloc_mut`. Here is a Rust version of the pitfall program above: :::{code-block} rust pub fn side_effect(a: &mut u32) { *a = 0; } pub fn foo(x: u32) -> u32 { let mut b: u32 = x; side_effect(&mut b); b } ::: :::{code-block} sawscript let side_effect_spec = do { a_ref <- mir_alloc_mut mir_u32; a_val <- mir_fresh_var "a_val" mir_u32; mir_points_to a_ref (mir_term a_val); mir_execute_func [a_ref]; }; let foo_spec = do { x <- mir_fresh_var "x" mir_u32; mir_execute_func [mir_term x]; mir_return (mir_term {{ x }}); }; ::: Just like above, if you attempted to prove `foo` against `foo_spec` using compositional verification: :::{code-block} sawscript side_effect_ov <- mir_verify m "test::side_effect" [] false side_effect_spec z3; mir_verify m "test::foo" [side_effect_ov] false foo_spec z3; ::: Then SAW would throw an error, as `side_effect_spec` underspecifies the value of `a_ref` in its postconditions. `side_effect_spec` can similarly be repaired by adding a `mir_points_to` statement involving `a_ref` in `side_effect_spec`'s postconditions. MIR verification differs slightly from LLVM verification in how it catches underspecified mutable allocations when using compositional overrides. The LLVM memory model achieves this by invalidating the underlying memory in underspecified allocations. The MIR memory model, on the other hand, does not have a direct counterpart to memory invalidation. As a result, any MIR overrides must specify the values of all mutable allocations in their postconditions, _even if the function that calls the override never uses the allocations_. To illustrate this point more finely, suppose that the `foo` function had instead been defined like this: :::{code-block} rust pub fn foo(x: u32) -> u32 { let mut b: u32 = x; side_effect(&mut b); 42 } ::: Here, it does not particularly matter what effects the `side_effect` function has on its argument, as `foo` will now return `42` regardless. Still, if you attempt to prove `foo` by using `side_effect` as a compositional override, then it is strictly required that you specify the value of `side_effect`'s argument in its postconditions, even though the answer that `foo` returns is unaffected by this. This is in contrast with LLVM verification, where one could get away without specifying `side_effect`'s argument in this example, as the invalidated memory in `b` would never be read. (compositional-verification-and-mutable-global-variables)= #### Compositional Verification and Mutable Global Variables Just like with local mutable allocations (see the previous section), specifications used in compositional overrides must specify the values of mutable global variables in their postconditions. To illustrate this using LLVM verification, here is a variant of the C program from the previous example that uses a mutable global variable `a`: :::{code-block} c uint32_t a = 42; void side_effect(void) { a = 0; } uint32_t foo(void) { side_effect(); return a; } ::: If we attempted to verify `foo` against this `foo_spec` specification using compositional verification: :::{code-block} sawscript let side_effect_spec = do { llvm_alloc_global "a"; llvm_points_to (llvm_global "a") (llvm_global_initializer "a"); llvm_execute_func []; }; let foo_spec = do { llvm_alloc_global "a"; llvm_points_to (llvm_global "a") (llvm_global_initializer "a"); llvm_execute_func []; llvm_return (llvm_global_initializer "a"); }; side_effect_ov <- llvm_verify m "side_effect" [] false side_effect_spec z3; llvm_verify m "foo" [side_effect_ov] false foo_spec z3; ::: Then SAW would reject it, as `side_effect_spec` does not specify what `a`'s value should be in its postconditions. Just as with local mutable allocations, SAW will invalidate the underlying memory in `a`, and subsequently reading from `a` in the `foo` function will throw an error. The solution is to add an `llvm_points_to` statement in the postconditions that declares that `a`'s value is set to `0`. The same concerns apply to MIR verification, where mutable global variables are referred to as `static mut` items. (See the [MIR static items](#mir-static-items) section for more information). Here is a Rust version of the program above: :::{code-block} rust static mut A: u32 = 42; pub fn side_effect() { unsafe { A = 0; } } pub fn foo() -> u32 { side_effect(); unsafe { A } } ::: :::{code-block} sawscript let side_effect_spec = do { mir_points_to (mir_static "test::A") (mir_static_initializer "test::A"); mir_execute_func []; }; let foo_spec = do { mir_points_to (mir_static "test::A") (mir_static_initializer "test::A"); mir_execute_func []; mir_return (mir_static_initializer "test::A"); }; side_effect_ov <- mir_verify m "side_effect" [] false side_effect_spec z3; mir_verify m "foo" [side_effect_ov] false foo_spec z3; ::: Just as above, we can repair this by adding a `mir_points_to` statement in `side_effect_spec`'s postconditions that specifies that `A` is set to `0`. Recall from the previous section that MIR verification is stricter than LLVM verification when it comes to specifying mutable allocations in the postconditions of compositional overrides. This is especially true for mutable static items. In MIR verification, any compositional overrides must specify the values of all mutable static items in the entire program in their postconditions, _even if the function that calls the override never uses the static items_. For example, if the `foo` function were instead defined like this: :::{code-block} rust pub fn foo() -> u32 { side_effect(); 42 } ::: Then it is still required for `side_effect_spec` to specify what `A`'s value will be in its postconditions, despite the fact that this has no effect on the value that `foo` will return. ## Using Ghost State In some cases, information relevant to verification is not directly present in the concrete state of the program being verified. This can happen for at least two reasons: - When providing specifications for external functions, for which source code is not present. The external code may read and write global state that is not directly accessible from the code being verified. - When the abstract specification of the program naturally uses a different representation for some data than the concrete implementation in the code being verified does. One solution to these problems is the use of _ghost_ state. This can be thought of as additional global state that is visible only to the verifier. Ghost state with a given name can be declared at the top level with the following function: - `declare_ghost_state : String -> TopLevel Ghost` Ghost state variables do not initially have any particular type, and can store data of any type. Given an existing ghost variable the following functions can be used to specify its value: - `llvm_ghost_value : Ghost -> Term -> LLVMSetup ()` - `jvm_ghost_value : Ghost -> Term -> JVMSetup ()` - `mir_ghost_value : Ghost -> Term -> MIRSetup ()` These can be used in either prestate or poststate, to specify the value of ghost state either before or after the execution of the function, respectively. ## Extracting Code In many simple cases (such as the `max` function to return the largest of two values), the relevant inputs and outputs are immediately apparent. The function takes two integer arguments, always uses both of them, and returns a single integer value, making no other changes to the program state. In cases like this, a direct translation to SAWCore is possible, given only an identification of which code to execute. Three functions exist to handle such simple code, one for each Crucible backend: - `llvm_extract : LLVMModule -> String -> TopLevel Term` - `jvm_extract : JavaClass -> String -> TopLevel Term` - `mir_extract : MIRModule -> String -> TopLevel Term` The structure of these extraction functions is essentially identical. The first argument describes where to look for code (in an LLVM module, Java class, or MIR module, loaded as described above). The second argument is the name of the method or function to extract. When the extraction functions complete, they return a `Term` corresponding to the value returned by the function or method as a function of its arguments. These functions currently work only for code that has specific argument and result types: - For `llvm_extract`, the extracted function must take some fixed number of integral parameters and return an integral result. - For `jvm_extract`, the extracted function's argument and result types must be scalar types (i.e., not classes or arrays). - For `mir_extract`, the extracted function's argument and result types must be a primitive integer type (e.g., `u8` or `i8`), a `bool`, a `char`, an array, or a tuple. Although it is disallowed to extract functions that use pointers, classes, or references in the extracted function's type signature, the implementation of the extracted function is allowed to allocate memory during execution. Also note the following requirements for interacting with global variables: - For `llvm_extract`, the extracted function is allowed to read from immutable global variables during execution, but it is not allowed to read or write from mutable global variables during execution. - For `jvm_extract`, the extracted function is allowed to read from or write to any class field or static field (regardless of mutability) during execution. The class and static fields will be given their initial values during extraction (unless they are overwritten during execution). - For `mir_extract`, the extracted function is allowed to read from immutable static items during execution, and it is allowed to write to mutable static items during execution. The extracted function is not allowed to read from a mutable static item during execution unless the function has written another value to the static item earlier during execution. ## A Heap-Based Example To tie all of the command descriptions from the previous sections together, consider the case of verifying the correctness of a C program that computes the dot product of two vectors, where the length and value of each vector are encapsulated together in a `struct`. The dot product can be concisely specified in Cryptol as follows: :::{code-block} cryptol dotprod : {n, a} (fin n, fin a) => [n][a] -> [n][a] -> [a] dotprod xs ys = sum (zip (*) xs ys) ::: To implement this in C, let's first consider the type of vectors: :::{code-block} c typedef struct { uint32_t *elts; uint32_t size; } vec_t; ::: This struct contains a pointer to an array of 32-bit elements, and a 32-bit value indicating how many elements that array has. We can compute the dot product of two of these vectors with the following C code (which uses the size of the shorter vector if they differ in size). :::{code-block} c uint32_t dotprod_struct(vec_t *x, vec_t *y) { uint32_t size = MIN(x->size, y->size); uint32_t res = 0; for(size_t i = 0; i < size; i++) { res += x->elts[i] * y->elts[i]; } return res; } ::: The entirety of this implementation can be found in the `examples/llvm/dotprod_struct.c` file in the `saw-script` repository. To verify this program in SAW, it will be convenient to define a couple of utility functions (which are generally useful for many heap-manipulating programs). First, combining allocation and initialization to a specific value can make many scripts more concise: :::{code-block} sawscript let alloc_init ty v = do { p <- llvm_alloc ty; llvm_points_to p v; return p; }; ::: This creates a pointer `p` pointing to enough space to store type `ty`, and then indicates that the pointer points to value `v` (which should be of that same type). A common case for allocation and initialization together is when the initial value should be entirely symbolic. :::{code-block} sawscript let ptr_to_fresh n ty = do { x <- llvm_fresh_var n ty; p <- alloc_init ty (llvm_term x); return (x, p); }; ::: This function returns the pointer just allocated along with the fresh symbolic value it points to. Given these two utility functions, the `dotprod_struct` function can be specified as follows: :::{code-block} sawscript let dotprod_spec n = do { let nt = llvm_term {{ `n : [32] }}; (xs, xsp) <- ptr_to_fresh "xs" (llvm_array n (llvm_int 32)); (ys, ysp) <- ptr_to_fresh "ys" (llvm_array n (llvm_int 32)); let xval = llvm_struct_value [ xsp, nt ]; let yval = llvm_struct_value [ ysp, nt ]; xp <- alloc_init (llvm_alias "struct.vec_t") xval; yp <- alloc_init (llvm_alias "struct.vec_t") yval; llvm_execute_func [xp, yp]; llvm_return (llvm_term {{ dotprod xs ys }}); }; ::: Any instantiation of this specification is for a specific vector length `n`, and assumes that both input vectors have that length. That length `n` automatically becomes a type variable in the subsequent Cryptol expressions, and the backtick operator is used to reify that type as a bit vector of length 32. The entire script can be found in the `dotprod_struct-crucible.saw` file alongside `dotprod_struct.c`. Running this script results in the following: :::{code-block} console Loading file "dotprod_struct.saw" Proof succeeded! dotprod_struct Registering override for `dotprod_struct` variant `dotprod_struct` Symbolic simulation completed with side conditions. Proof succeeded! dotprod_wrap ::: ## An Extended Example To tie together many of the concepts in this manual, we now present a non-trivial verification task in its entirety. All of the code for this example can be found in the `examples/salsa20` directory of [the SAWScript repository](https://github.com/GaloisInc/saw-script). ### Salsa20 Overview Salsa20 is a stream cipher developed in 2005 by Daniel J. Bernstein, built on a pseudorandom function utilizing add-rotate-XOR (ARX) operations on 32-bit words[^4]. Bernstein himself has provided several public domain implementations of the cipher, optimized for common machine architectures. For the mathematically inclined, his specification for the cipher can be found [here](http://cr.yp.to/snuffle/spec.pdf). The repository referenced above contains three implementations of the Salsa20 cipher: A reference Cryptol implementation (which we take as correct in this example), and two C implementations, one of which is from Bernstein himself. For this example, we focus on the second of these C implementations, which more closely matches the Cryptol implementation. Full verification of Bernstein's implementation is available in `examples/salsa20/djb`, for the interested. The code for this verification task can be found in the files named according to the pattern `examples/salsa20/(s|S)alsa20.*`. ### Specifications We now take on the actual verification task. This will be done in two stages: We first define some useful utility functions for constructing common patterns in the specifications for this type of program (i.e. one where the arguments to functions are modified in-place.) We then demonstrate how one might construct a specification for each of the functions in the Salsa20 implementation described above. #### Utility Functions We first define the function `alloc_init : LLVMType -> Term -> LLVMSetup LLVMValue`. `alloc_init ty v` returns a `LLVMValue` consisting of a pointer to memory allocated and initialized to a value `v` of type `ty`. `alloc_init_readonly` does the same, except the memory allocated cannot be written to. :::{code-block} sawscript import "Salsa20.cry"; let alloc_init ty v = do { p <- llvm_alloc ty; llvm_points_to p (llvm_term v); return p; }; let alloc_init_readonly ty v = do { p <- llvm_alloc_readonly ty; llvm_points_to p (llvm_term v); return p; }; ::: We now define `ptr_to_fresh : String -> LLVMType -> LLVMSetup (Term, LLVMValue)`. `ptr_to_fresh n ty` returns a pair `(x, p)` consisting of a fresh symbolic variable `x` of type `ty` and a pointer `p` to it. `n` specifies the name that SAW should use when printing `x`. `ptr_to_fresh_readonly` does the same, but returns a pointer to space that cannot be written to. :::{code-block} sawscript let ptr_to_fresh n ty = do { x <- llvm_fresh_var n ty; p <- alloc_init ty x; return (x, p); }; let ptr_to_fresh_readonly n ty = do { x <- llvm_fresh_var n ty; p <- alloc_init_readonly ty x; return (x, p); }; ::: Finally, we define `oneptr_update_func : String -> LLVMType -> Term -> LLVMSetup ()`. `oneptr_update_func n ty f` specifies the behavior of a function that takes a single pointer (with a printable name given by `n`) to memory containing a value of type `ty` and mutates the contents of that memory. The specification asserts that the contents of this memory after execution are equal to the value given by the application of `f` to the value in that memory before execution. :::{code-block} sawscript let oneptr_update_func n ty f = do { (x, p) <- ptr_to_fresh n ty; llvm_execute_func [p]; llvm_points_to p (llvm_term {{ f x }}); }; ::: #### The `quarterround` operation The C function we wish to verify has type `void s20_quarterround(uint32_t *y0, uint32_t *y1, uint32_t *y2, uint32_t *y3)`. The function's specification generates four symbolic variables and pointers to them in the precondition/setup stage. The pointers are passed to the function during symbolic execution via `llvm_execute_func`. Finally, in the postcondition/return stage, the expected values are computed using the trusted Cryptol implementation and it is asserted that the pointers do in fact point to these expected values. :::{code-block} sawscript let quarterround_setup : LLVMSetup () = do { (y0, p0) <- ptr_to_fresh "y0" (llvm_int 32); (y1, p1) <- ptr_to_fresh "y1" (llvm_int 32); (y2, p2) <- ptr_to_fresh "y2" (llvm_int 32); (y3, p3) <- ptr_to_fresh "y3" (llvm_int 32); llvm_execute_func [p0, p1, p2, p3]; let zs = {{ quarterround [y0,y1,y2,y3] }}; // from Salsa20.cry llvm_points_to p0 (llvm_term {{ zs@0 }}); llvm_points_to p1 (llvm_term {{ zs@1 }}); llvm_points_to p2 (llvm_term {{ zs@2 }}); llvm_points_to p3 (llvm_term {{ zs@3 }}); }; ::: #### Simple Updating Functions The following functions can all have their specifications given by the utility function `oneptr_update_func` implemented above, so there isn't much to say about them. :::{code-block} sawscript let rowround_setup = oneptr_update_func "y" (llvm_array 16 (llvm_int 32)) {{ rowround }}; let columnround_setup = oneptr_update_func "x" (llvm_array 16 (llvm_int 32)) {{ columnround }}; let doubleround_setup = oneptr_update_func "x" (llvm_array 16 (llvm_int 32)) {{ doubleround }}; let salsa20_setup = oneptr_update_func "seq" (llvm_array 64 (llvm_int 8)) {{ Salsa20 }}; ::: #### 32-Bit Key Expansion The next function of substantial behavior that we wish to verify has the following prototype: :::{code-block} c void s20_expand32( uint8_t *k , uint8_t n[static 16] , uint8_t keystream[static 64] ) ::: This function's specification follows a similar pattern to that of `s20_quarterround`, though for extra assurance we can make sure that the function does not write to the memory pointed to by `k` or `n` using the utility `ptr_to_fresh_readonly`, as this function should only modify `keystream`. Besides this, we see the call to the trusted Cryptol implementation specialized to `a=2`, which does 32-bit key expansion (since the Cryptol implementation can also specialize to `a=1` for 16-bit keys). This specification can easily be changed to work with 16-bit keys. :::{code-block} sawscript let salsa20_expansion_32 = do { (k, pk) <- ptr_to_fresh_readonly "k" (llvm_array 32 (llvm_int 8)); (n, pn) <- ptr_to_fresh_readonly "n" (llvm_array 16 (llvm_int 8)); pks <- llvm_alloc (llvm_array 64 (llvm_int 8)); llvm_execute_func [pk, pn, pks]; let rks = {{ Salsa20_expansion`{a=2}(k, n) }}; llvm_points_to pks (llvm_term rks); }; ::: #### 32-bit Key Encryption Finally, we write a specification for the encryption function itself, which has type :::{code-block} c enum s20_status_t s20_crypt32( uint8_t *key , uint8_t nonce[static 8] , uint32_t si , uint8_t *buf , uint32_t buflen ) ::: As before, we can ensure this function does not modify the memory pointed to by `key` or `nonce`. We take `si`, the stream index, to be 0. The specification is parameterized on a number `n`, which corresponds to `buflen`. Finally, to deal with the fact that this function returns a status code, we simply specify that we expect a success (status code 0) as the return value in the postcondition stage of the specification. :::{code-block} sawscript let s20_encrypt32 n = do { (key, pkey) <- ptr_to_fresh_readonly "key" (llvm_array 32 (llvm_int 8)); (v, pv) <- ptr_to_fresh_readonly "nonce" (llvm_array 8 (llvm_int 8)); (m, pm) <- ptr_to_fresh "buf" (llvm_array n (llvm_int 8)); llvm_execute_func [ pkey , pv , llvm_term {{ 0 : [32] }} , pm , llvm_term {{ `n : [32] }} ]; llvm_points_to pm (llvm_term {{ Salsa20_encrypt (key, v, m) }}); llvm_return (llvm_term {{ 0 : [32] }}); }; ::: ### Verifying Everything Finally, we can verify all of the functions. Notice the use of compositional verification and that path satisfiability checking is enabled for those functions with loops not bounded by explicit constants. Notice that we prove the top-level function for several sizes; this is due to the limitation that SAW can only operate on finite programs (while Salsa20 can operate on any input size.) :::{code-block} sawscript let main : TopLevel () = do { m <- llvm_load_module "salsa20.bc"; qr <- llvm_verify m "s20_quarterround" [] false quarterround_setup abc; rr <- llvm_verify m "s20_rowround" [qr] false rowround_setup abc; cr <- llvm_verify m "s20_columnround" [qr] false columnround_setup abc; dr <- llvm_verify m "s20_doubleround" [cr,rr] false doubleround_setup abc; s20 <- llvm_verify m "s20_hash" [dr] false salsa20_setup abc; s20e32 <- llvm_verify m "s20_expand32" [s20] true salsa20_expansion_32 abc; s20encrypt_63 <- llvm_verify m "s20_crypt32" [s20e32] true (s20_encrypt32 63) abc; s20encrypt_64 <- llvm_verify m "s20_crypt32" [s20e32] true (s20_encrypt32 64) abc; s20encrypt_65 <- llvm_verify m "s20_crypt32" [s20e32] true (s20_encrypt32 65) abc; print "Done!"; }; ::: [^4]: ## Verifying Cryptol FFI functions SAW has special support for verifying the correctness of Cryptol's [`foreign` functions](https://galoisinc.github.io/cryptol/master/FFI.html), implemented in a language such as C which compiles to LLVM, provided that there exists a [reference Cryptol implementation](https://galoisinc.github.io/cryptol/master/FFI.html#cryptol-implementation-of-foreign-functions) of the function as well. Since the way in which `foreign` functions are called is precisely specified by the Cryptol FFI, SAW is able to generate a `LLVMSetup ()` spec directly from the type of a Cryptol `foreign` function. This is done with the `llvm_ffi_setup` command, which is experimental and requires `enable_experimental;` to be run beforehand. - `llvm_ffi_setup : Term -> LLVMSetup ()` For instance, for the simple imported Cryptol foreign function `foreign add : [32] -> [32] -> [32]` we can obtain a `LLVMSetup` spec simply by writing :::{code-block} sawscript let add_setup = llvm_ffi_setup {{ add }}; ::: which behind the scenes expands to something like :::{code-block} sawscript let add_setup = do { in0 <- llvm_fresh_var "in0" (llvm_int 32); in1 <- llvm_fresh_var "in1" (llvm_int 32); llvm_execute_func [llvm_term in0, llvm_term in1]; llvm_return (llvm_term {{ add in0 in1 }}); }; ::: ### Polymorphism In general, Cryptol `foreign` functions can be polymorphic, with type parameters of kind `#` (number types), representing e.g. the sizes of input sequences. However, any individual `LLVMSetup ()` spec only specifies the behavior of the LLVM function on inputs of concrete sizes. We handle this by allowing the argument term of `llvm_ffi_setup` to contain any necessary type arguments in addition to the Cryptol function name, so that the resulting term is monomorphic. The user can then define a parameterized specification simply as a SAWScript function in the usual way. For example, for a function `foreign f : {n, m} (fin n, fin m) => [n][32] -> [m][32]`, we can obtain a parameterized `LLVMSetup` spec by :::{code-block} sawscript let f_setup (n : Int) (m : Int) = llvm_ffi_setup {{ f`{n, m} }}; ::: Note that the `Term` parameter that `llvm_ffi_setup` takes is restricted syntactically to the format described above (``{{ fun`{tyArg0, tyArg1, ..., tyArgN} }}``), and cannot be any arbitrary term. ### Supported types `llvm_ffi_setup` supports all Cryptol types that are supported by the Cryptol FFI, with the exception of `Integer`, `Rational`, `Z`, and `Float`. `Integer`, `Rational`, and `Z` are not supported since they are translated to `gmp` arbitrary-precision types which are hard for SAW to handle without additional overrides. There is no fundamental obstacle to supporting `Float`, and in fact `llvm_ffi_setup` itself does work with Cryptol floating point types, but the underlying functions such as `llvm_fresh_var` do not, so until that is implemented `llvm_ffi_setup` can generate a spec involving floating point types but it cannot actually be run. Note also that for the time being only the `c` calling convention is supported. Support for the recently-added `abstract` calling convention has not been written yet. See [issue #2546](https://github.com/GaloisInc/saw-script/issues/2546). ### Performing the verification The resulting `LLVMSetup ()` spec can be used with the existing `llvm_verify` function to perform the actual verification. And the `LLVMSpec` output from that can be used as an override as usual for further compositional verification. :::{code-block} sawscript f_ov <- llvm_verify mod "f" [] true (f_setup 3 5) z3; ::: As with the Cryptol FFI itself, SAW does not manage the compilation of the C source implementations of `foreign` functions to LLVM bitcode. For the verification to be meaningful, is expected that the LLVM module passed to `llvm_verify` matches the compiled dynamic library actually used with the Cryptol interpreter. Alternatively, on x86_64 Linux, SAW can perform verification directly on the `.so` ELF file with the experimental `llvm_verify_x86` command. ## Multi-Step Proofs and Loop Invariants with Cuts The LLVM backend (and so far, only the LLVM backend) has the ability to divide a function under verification into two (or more) parts and apply assertions at the middle point as well as the beginning and end. This is called a "cut", after the similar operation in proof derivations, and we call the point of division a "cutpoint". This feature is highly experimental (not to mention complicated and confusing) and should be approached with caution. At the code level, the cutpoint is a call to an empty placeholder function with a magic name. This does require modifying the code, although not in a way that has any material effect. When verifying, SAW recognizes the magic name when loading the LLVM module and transforms the code by moving the rest of the code in the real function to the placeholder function and replacing it with a call to the placeholder. Then one may write SAW specifications for the original function name (representing the whole original function) and for the placeholder function (representing only the second part of it). The magic names SAW recognizes and applies this feature to are any function names beginning with `__cutpoint__`. A simple example of a two-stage function, adapted from SAW's test suite (see `intTests/test_cutpoint`): :::{code-block} c #include extern size_t __cutpoint__add2(size_t *); size_t add2(size_t x) { ++x; __cutpoint__add2(&x); ++x; return x; } ::: This function adds two to its argument in separate steps, and we'll verify each half separately. Note that the function `__cutpoint_add2` does not actually exist in this example, and does not need to for verification. It only exists to mark the cut. (If you wanted to run this code, you'd provide an empty implementation that does nothing.) The corresponding SAW specifications are, first for `add2`: :::{code-block} sawscript let add2_spec = do { x <- llvm_fresh_var "x" (llvm_int 64); llvm_execute_func [llvm_term x]; llvm_return (llvm_term {{ x + 2 }}); }; ::: And for `__cutpoint__add2`: :::{code-block} sawscript let cutpoint_add2_spec = do { p <- llvm_alloc (llvm_int 64); x <- llvm_fresh_var "x" (llvm_int 64); llvm_points_to p (llvm_term x); llvm_execute_func [p]; llvm_return (llvm_term {{ x + 1 }}); }; ::: There are several things to note here. First, the spec for `add2` spans the whole function: the return value it specifies is the overall return value, not the value of `x` at the cutpoint. There is no direct way to address the value of `x` at the cutpoint in the spec for the original function. The spec for the cutpoint function receives that value of `x`, and can write assertions about it, but there is in turn no easy way to get at the original value of `x` as of entry to `add2`. (The only way to do so is to change the code around to preserve it in another variable, which is almost always undesirable.) The verification will check that that the value of `x` as of the cutpoint satisfies the precondition given in the cutpoint function spec. However, for complex code these limitations may make writing that precondition difficult or impossible. (We might extend or alter the behavior in the future to make it possible. As noted above the feature is experimental; it has some sharp edges, of which this is one.) Second, the spec for `cutpoint_add2` always covers the entire remainder of the function. If you use two cutpoints to divide a function into three parts, the first spec spans the whole function, the second spans the second and third part, and the third spans only the third part. There is (currently) no way to flip this around so the first spec spans just the first part of the function and the last spans the whole function. Third, the value of x is passed to the cutpoint through a pointer. This is necessary to make the propagation of the values through the transformed code work properly. It is also necessary to pass the values of all live variables, via pointers, to the cutpoint function, in the order they appear in the LLVM stack frame. Determining which variables these are and what order they appear in may require disassembling the LLVM bitcode (with `llvm-dis` or SAW's `:llvmdis` REPL command) and inspecting it. As of this writing, if any of this is not correct, SAW crashes deep inside the logic that performs the code transformation, and, alas, with a fairly mysterious message about being unable to find register values. Now, to verify this function you do the following, assuming that `m` is the LLVM module: :::{code-block} sawscript cp <- llvm_verify m "__cutpoint__add2#add2" [] true cutpoint_add2_spec z3; llvm_verify m "add2" [cp] true add2_spec z3; ::: That is: first you verify the cutpoint function against its spec, then you use it as an override when verifying the original function. Note the extra bit in the name in the first `llvm_verify` call: you must provide the name of the parent function after a `#`. SAW needs this to find the transformed code. ### Applying Cuts Under Conditionals If you place a cutpoint in a block that is conditionally executed (e.g. inside an if whose control is not constant) the cutpoint function still continues for the entire remainder of the function. Thus, the spec for the cutpoint function must include the behavior of all the code that occurs after it, not just the contents of the conditional block. However, it does not need to reason about the other branch of the conditional at all. Similarly, if you place cutpoints in both sides of a conditional, both cutpoint specs must cover the entire remainder of the function. However, in some situations this can allow being more specific about what happens than is easily done without the cutpoints. This can sometimes be helpful for handling functions with diamond-shaped control flow. You cannot, unfortunately, splice the execution back together after the conditionals have ended. Note that while the spec for the whole function must still give some overall return value, that characterizes what happens after both sides of the conditional, it need not necessarily be fully specific about it. ### Using Cut with Loops While dividing a large or complicated sequential function into multiple parts can be useful in its own right, the real purpose of the cut feature is to allow reasoning about loops by providing loop invariants. When you place a cutpoint in the condition of a loop, the cutpoint function becomes the rest of the original function... which is to say, it executes the loop body and then comes back to the loop condition. In particular, the loop is converted into recursion and the recursive call becomes another call to the cutpoint function. This makes the precondition of the cutpoint function a loop invariant. It also means you can escape SAW's restriction about loops being concretely bounded. The trick is to first assume the spec for the cutpoint function, then use that as an override when proving it. The proof will check that the loop body correctly assumes and then restores the loop invariant, and the override using the assumed version will let SAW conclude that the loop invariant remains true after the recursive call without having to execute through it again. However, beware: this technique only provides partial correctness. It does not allow reasoning about the termination of the loop. It allows you to prove that _if_ the loop terminates, the original function returns according to the spec. If the loop doesn't terminate, it doesn't terminate. Here is an example, also adapted from the test suite: :::{code-block} c #include extern size_t __cutpoint__inv(size_t*, size_t*, size_t*) __attribute__((noduplicate)); size_t count_n(size_t n) { size_t c = 0; for (size_t i = 0; __cutpoint__inv(&n, &c, &i), i < n; ++i) { ++c; } return c; } ::: This rather naively counts up to `n`. SAW can't handle this function by default, or at least not in general: you can prove it works for any given value of `n`, but proving a general theorem about it for all `n` requires it to try the loop all `n` times. (Because `n` is a bitvector, that's at least not infinitely many times; however, because it's a 64-bit bitvector, it is too many to check in any effective amount of time.) Instead we can write this spec for the cutpoint: :::{code-block} sawscript let ptr_to_fresh n ty = do { p <- llvm_alloc ty; x <- llvm_fresh_var n ty; llvm_points_to p (llvm_term x); return (p, x); }; let inv_spec = do { (pn, n) <- ptr_to_fresh "n" (llvm_int 64); (pc, c) <- ptr_to_fresh "c" (llvm_int 64); (pi, i) <- ptr_to_fresh "i" (llvm_int 64); llvm_precond {{ 0 <= i /\ i <= n }}; llvm_execute_func [pn, pc, pi]; llvm_return (llvm_term {{ c + (n - i) }}); }; ::: This is about what one would expect: it takes all three of the live variables as pointer arguments, and the loop invariant is the loop condition extended to the case where the loop terminates. The return value seems slightly unexpected; surely the return value should be just `n`? If you try that, it doesn't work. But that's because the loop invariant isn't strong enough; it doesn't say anything about the relationship of `c` and `i`. What's here is true even if they are out of sync: when the loop terminates (if it terminates), `i` is equal to `n` and we return `c`, whatever it is, which is what the C code does. One can also write this spec: :::{code-block} sawscript let inv_spec = do { (pn, n) <- ptr_to_fresh "n" (llvm_int 64); (pc, c) <- ptr_to_fresh "c" (llvm_int 64); (pi, i) <- ptr_to_fresh "i" (llvm_int 64); llvm_precond {{ 0 <= i /\ i <= n /\ c == i }}; llvm_execute_func [pn, pc, pi]; llvm_return (llvm_term {{ n }}); }; ::: and this will also verify. However, in either case this spec for the original function will still verify successfully: :::{code-block} sawscript let count_n_spec = do { n <- llvm_fresh_var "n" (llvm_int 64); llvm_execute_func [llvm_term n]; llvm_return (llvm_term n); }; ::: Verifying this function only reaches the loop when `c` and `n` are equal, and that's sufficient for SAW to show that, because `i` is equal to `n` when the loop terminates (if it terminates), `c` is also equal to `n`. Choosing the correct loop invariant is a sometimes subtle art. (It is not in any way SAW-specific and much has been written about it elsewhere.) Note however that as with the sequential usage, the whole rest of the function after the loop belongs to the cutpoint function; if there is code after the loop, the cutpoint spec needs to account for it. This is how you actually run the verification: :::{code-block} sawscript inv <- llvm_unsafe_assume_spec m "__cutpoint__inv#count_n" inv_spec; llvm_verify m "__cutpoint__inv#count_n" [inv] false inv_spec abc; llvm_verify m "count_n" [inv] false count_n_spec abc; ::: As described above, first you assume the spec for the cutpoint function, then you verify it using the assumed version as an override for the recursive call. (This is a proof by induction: the code leading up to the loop is the base case that establishes the inductive hypothesis, which is the loop invariant; then the loop body is the inductive case. Then for any finite number of iterations the loop invariant holds, and so it holds when the loop terminates... if it terminates.) Notice that the cutpoint function is labeled `noduplicate`; this prevents LLVM from doing optimizations that might break SAW's ability to do the cut transform.