About mir-json
We are interested in verifying code written in Rust, but Rust is an extremely rich programming language that has many distinct language features. To make the process of verifying Rust simpler, SAW targets an intermediate language used in the Rust compiler called MIR (short for “Mid-level IR”). MIR takes the variety of different features and syntactic extensions in Rust and boils them down to a minimal subset that is easier for a computer program to analyze.
The process of extracting MIR code that is suitable for SAW’s needs is somewhat
tricky, so we wrote a suite of tools called mir-json to automate this
process. mir-json provides a plugin for tools like rustc and cargo that
lets you compile Rust code as you normally would and produce an additional
.json file as output. This .json file contains the MIR code that was
produced internally by the Rust compiler, with some additional minor tweaks to
make it suitable for SAW’s needs.
mir-json is not a single tool but rather a suite of related tools that
leverage the same underlying plugin. For SAW purposes, the two mir-json tools
that are most relevant are:
saw-rustc: A thin wrapper aroundrustc(the Rust compiler), which is suitable for individual.rsfiles.cargo-saw-build: A thin wrapper around thecargo buildcommand, which is suitable forcargo-based Rust projects.
Most of the examples in this tutorial involve self-contained examples, which
will use saw-rustc. Later in the tutorial, we will examine a Salsa20 case
study that involves a cargo-based project, which will use cargo-saw-build.
A first example with saw-rustc
Let’s try out saw-rustc on a small example file, which we’ll name
first-example.rs:
pub fn id_u8(x: u8) -> u8 {
x
}
This is the identity function, but specialized to the type u8. We can compile
this with saw-rustc like so:
$ saw-rustc first-example.rs
<snip>
note: Emitting MIR for first_example/abef32c5::id_u8
linking 1 mir files into first-example.linked-mir.json
<snip>
saw-rustc prints out some additional information that rustc alone does not
print, and we have displayed the parts of this information that are most
interesting. In particular:
saw-rustcnotes that is isEmitting MIR for first_example/abef32c5::id_u8, wherefirst_example/abef32c5::id_u8is the full identifier used to uniquely refer to theid_u8function. It’s entirely possible that theabef32c5bit will look different on your machine; we’ll talk more about identifiers in the “Identifiers” section.Once
saw-rustcproduced a MIR JSON file namedfirst-example.linked-mir.json. This is an important bit of information, as SAW will ingest this JSON file.
If you’d like, you can inspect the first-example.linked-mir.json file with
JSON tools (e.g., jq), but it is not
important to understand everything that is going on there. This is
machine-generated JSON, and as such, it is not meant to be particularly
readable by human eyes.
The SAW_RUST_LIBRARY_PATH environment variable
Rust has a large set of standard libraries that ship with the compiler, and
parts of the standard library are quite low-level and tricky. SAW’s primary
objective is to provide a tool that can analyze code in a tractable way. For
this reason, SAW sometimes needs to invoke simplified versions of Rust standard
library functions that are more reasonable for an SMT-based tool like SAW to
handle. These simplified functions are equivalent in behavior, but avoid using
problematic code patterns (e.g., gnarly pointer arithmetic or the
transmute
function).
If you are only ever compiling self-contained pieces of code with saw-rustc,
there is a good chance that you can get away without needing to use SAW’s
custom version of the Rust standard libraries. However, if you ever need to
build something more complicated than that (e.g., the Salsa20 case study later
in this tutorial), then you will need the custom libraries. For this reason,
it is worthwhile to teach SAW the location of the custom libraries now.
At present, the best way to obtain the custom version of the Rust standard libraries is to perform the following steps:
Clone the
mir-jsonrepo like so:$ git clone https://github.com/GaloisInc/mir-json $ cd mir-json
Follow the instructions laid out in the
mir-jsoninstallation instructions in order to installmir-json.Run the
mir-json-translate-libsutility:$ mir-json-translate-libsThis will compile the custom versions of the Rust standard libraries using
mir-json, placing the results under therlibssubdirectory.Finally, define a
SAW_RUST_LIBRARY_PATHenvironment variable that points to the newly createdrlibssubdirectory:$ export SAW_RUST_LIBRARY_PATH=<...>/mir-json/rlibs
An upcoming release of SAW will include these custom libraries pre-built, which
will greatly simplify the steps above. Either way, you will need to set the
SAW_RUST_LIBRARY_PATH environment variable to point to the location of the
custom libraries.
A note about generics
The id_u8 function above is likely not how most Rust programmers would define
the identity function. Instead, it would seem more natural to define it
generically, that is, by parameterizing the function by a type parameter:
pub fn id<A>(x: A) -> A {
x
}
If you compile this with saw-rustc, however, the resulting JSON file will
lack a definition for id! We can see this by using jq:
$ saw-rustc generics-take-1.rs
<snip>
$ jq . generics-take-1.linked-mir.json
{
"fns": [],
"adts": [],
"statics": [],
"vtables": [],
"traits": [],
"intrinsics": [],
"tys": [],
"roots": []
}
What is going on here? This is the result of an important design choice that SAW makes: SAW only supports verifying monomorphic functions. To be more precise, SAW’s approach to symbolic simulation requires all of the code being simulated to have fixed types without any type parameters.
In order to verify a function using generics in your Rust code, you must provide a separate, monomorphic function that calls into the generic function. For example, you can rewrite the example above like so:
pub fn id<A>(x: A) -> A {
x
}
pub fn id_u8(x: u8) -> u8 {
id(x)
}
If you compile this version with saw-rustc, you’ll see:
$ saw-rustc generics-take-2.rs
<snip>
note: Emitting MIR for generics_take_2/8b1bf337::id_u8
note: Emitting MIR for generics_take_2/8b1bf337::id::_instaddce72e1232152c[0]
linking 1 mir files into generics-take-2.linked-mir.json
<snip>
This time, the resulting JSON file contains a definition for id_u8. The
reason that this works is because when id_u8 calls id, the Rust compile
will generate a specialized version of id where A is instantiated with the
type u8. This specialized version of id is named
id::_instaddce72e1232152c[0] in the output above. (You don’t have to remember
this name, thankfully!)
Although the resulting JSON file contains a definition for id_u8, it does
not contain a definition for the generic id function. As a result, SAW will
only be able to verify the id_u8 function from this file. If you are ever in
doubt about which functions are accessible for verification with SAW, you can
check this with jq like so:
$ jq '.intrinsics | map(.name)' generics-take-2.linked-mir.json
[
"generics_take_2/8b1bf337::id_u8",
"generics_take_2/8b1bf337::id::_instaddce72e1232152c[0]"
]
Here, “intrinsics” are monomorphic functions that are visible to SAW. Note that
saw-rustc will optimize away all definitions that are not accessible from one
of these intrinsic functions. This explains why the original program that only
defined a generic id function resulted in a definition-less JSON file, as
that program did not contain monomorphic functions (and therefore no
intrinsics).
Generally speaking, we prefer to verify functions that are defined directly in
the Rust source code, such as id_u8, as these functions’ names are more
stable than the specialized versions of functions that the compiler generates,
such as id::_instaddce72e1232152c[0]. Do note that SAW is capable of
verifying both types of functions, however. (We will see an example of
verifying an autogenerated function in the Salsa20 case study later in this
tutorial.)
Identifiers
When you compile a function named id_u8, saw-rustc will expand it to a much
longer name such as first_example/abef32c5::id_u8. This longer name is called
an identifier, and it provides a globally unique name for that function. In
the small examples we’ve seen up to this point, there hasn’t been any risk of
name collisions, but you could imagine compiling this code alongside another
file (or crate) that also defines an id_u8 function. If that happens, then it
is essential that we can tell apart all of the different id_u8 functions, and
identifiers provide us the mechanism for doing so.
Let’s take a closer look at what goes into an identifier. In general, an identifier will look like the following:
<crate name>/<disambiguator>::<function path>
<crate name> is the name of the crate in which the function is defined. All
of the examples we’ve seen up to this point have been defined in standalone
files, and as a result, the crate name has been the same as the file name, but
without the .rs extension and with all hyphens replaced with underscores
(e.g., first-example.rs is given the crate name first_example). In
cargo-based projects, the crate name will likely differ from the file name.
<disambiguator> is a hash of the crate and its dependencies. In extreme
cases, it is possible for two different crates to have identical crate names,
in which case the disambiguator must be used to distinguish between the two
crates. In the common case, however, most crate names will correspond to
exactly one disambiguator. (More on this in a bit.)
<function path> is the path to the function within the crate. Sometimes, this
is as simple as the function name itself. In other cases, a function path may
involve multiple segments, depending on the module hierarchy for the program
being verified. For instance, a read function located in
core/src/ptr/mod.rs will have the identifier:
core/<disambiguator>::ptr::read
Where core is the crate name and ptr::read is the function path, which
has two segments ptr and read. There are also some special forms of
segments that appear for functions defined in certain language constructs.
For instance, if a function is defined in an impl block, then it will have
{impl} as one of its segments, e.g.,
core/<disambiguator>::ptr::const_ptr::{impl}::offset
The most cumbersome part of writing an identifier is the disambiguator, as it is extremely sensitive to changes in the code (not to mention hard to remember and type). Luckily, the vast majority of crate names correspond to exactly one disambiguator, and we can exploit this fact to abbreviate identifiers that live in such crates. For instance, we can abbreviate this identifier:
core/<disambiguator>::ptr::read
To simply:
core::ptr::read
We will adopt the latter, shorter notation throughout the rest of the tutorial. SAW also understands this shorthand, so we will also use this notation when passing identifiers to SAW commands.