About mir-json
We are interested in verifying code written in Rust, but Rust is an extremely rich programming language that has many distinct language features. To make the process of verifying Rust simpler, SAW targets an intermediate language used in the Rust compiler called MIR (short for “Mid-level IR”). MIR takes the variety of different features and syntactic extensions in Rust and boils them down to a minimal subset that is easier for a computer program to analyze.
The process of extracting MIR code that is suitable for SAW’s needs is somewhat
tricky, so we wrote a suite of tools called mir-json
to automate this
process. mir-json
provides a plugin for tools like rustc
and cargo
that
lets you compile Rust code as you normally would and produce an additional
.json
file as output. This .json
file contains the MIR code that was
produced internally by the Rust compiler, with some additional minor tweaks to
make it suitable for SAW’s needs.
mir-json
is not a single tool but rather a suite of related tools that
leverage the same underlying plugin. For SAW purposes, the two mir-json
tools
that are most relevant are:
saw-rustc
: A thin wrapper aroundrustc
(the Rust compiler), which is suitable for individual.rs
files.cargo-saw-build
: A thin wrapper around thecargo build
command, which is suitable forcargo
-based Rust projects.
Most of the examples in this tutorial involve self-contained examples, which
will use saw-rustc
. Later in the tutorial, we will examine a Salsa20 case
study that involves a cargo
-based project, which will use cargo-saw-build
.
A first example with saw-rustc
Let’s try out saw-rustc
on a small example file, which we’ll name
first-example.rs
:
pub fn id_u8(x: u8) -> u8 {
x
}
This is the identity function, but specialized to the type u8
. We can compile
this with saw-rustc
like so:
$ saw-rustc first-example.rs
<snip>
note: Emitting MIR for first_example/abef32c5::id_u8
linking 1 mir files into first-example.linked-mir.json
<snip>
saw-rustc
prints out some additional information that rustc
alone does not
print, and we have displayed the parts of this information that are most
interesting. In particular:
saw-rustc
notes that is isEmitting MIR for first_example/abef32c5::id_u8
, wherefirst_example/abef32c5::id_u8
is the full identifier used to uniquely refer to theid_u8
function. It’s entirely possible that theabef32c5
bit will look different on your machine; we’ll talk more about identifiers in the “Identifiers” section.Once
saw-rustc
produced a MIR JSON file namedfirst-example.linked-mir.json
. This is an important bit of information, as SAW will ingest this JSON file.
If you’d like, you can inspect the first-example.linked-mir.json
file with
JSON tools (e.g., jq
), but it is not
important to understand everything that is going on there. This is
machine-generated JSON, and as such, it is not meant to be particularly
readable by human eyes.
The SAW_RUST_LIBRARY_PATH
environment variable
Rust has a large set of standard libraries that ship with the compiler, and
parts of the standard library are quite low-level and tricky. SAW’s primary
objective is to provide a tool that can analyze code in a tractable way. For
this reason, SAW sometimes needs to invoke simplified versions of Rust standard
library functions that are more reasonable for an SMT-based tool like SAW to
handle. These simplified functions are equivalent in behavior, but avoid using
problematic code patterns (e.g., gnarly pointer arithmetic or the
transmute
function).
If you are only ever compiling self-contained pieces of code with saw-rustc
,
there is a good chance that you can get away without needing to use SAW’s
custom version of the Rust standard libraries. However, if you ever need to
build something more complicated than that (e.g., the Salsa20 case study later
in this tutorial), then you will need the custom libraries. For this reason,
it is worthwhile to teach SAW the location of the custom libraries now.
At present, the best way to obtain the custom version of the Rust standard libraries is to perform the following steps:
Clone the
mir-json
repo like so:$ git clone https://github.com/GaloisInc/mir-json $ cd mir-json
Follow the instructions laid out in the
mir-json
installation instructions in order to installmir-json
.Run the
translate_libs.sh
script:$ ./translate_libs.sh
This will compile the custom versions of the Rust standard libraries using
mir-json
, placing the results under therlibs
subdirectory.Finally, define a
SAW_RUST_LIBRARY_PATH
environment variable that points to the newly createdrlibs
subdirectory:$ export SAW_RUST_LIBRARY_PATH=<...>/mir-json/rlibs
An upcoming release of SAW will include these custom libraries pre-built, which
will greatly simplify the steps above. Either way, you will need to set the
SAW_RUST_LIBRARY_PATH
environment variable to point to the location of the
custom libraries.
A note about generics
The id_u8
function above is likely not how most Rust programmers would define
the identity function. Instead, it would seem more natural to define it
generically, that is, by parameterizing the function by a type parameter:
pub fn id<A>(x: A) -> A {
x
}
If you compile this with saw-rustc
, however, the resulting JSON file will
lack a definition for id
! We can see this by using jq
:
$ saw-rustc generics-take-1.rs
<snip>
$ jq . generics-take-1.linked-mir.json
{
"fns": [],
"adts": [],
"statics": [],
"vtables": [],
"traits": [],
"intrinsics": [],
"tys": [],
"roots": []
}
What is going on here? This is the result of an important design choice that SAW makes: SAW only supports verifying monomorphic functions. To be more precise, SAW’s approach to symbolic simulation requires all of the code being simulated to have fixed types without any type parameters.
In order to verify a function using generics in your Rust code, you must provide a separate, monomorphic function that calls into the generic function. For example, you can rewrite the example above like so:
pub fn id<A>(x: A) -> A {
x
}
pub fn id_u8(x: u8) -> u8 {
id(x)
}
If you compile this version with saw-rustc
, you’ll see:
$ saw-rustc generics-take-2.rs
<snip>
note: Emitting MIR for generics_take_2/8b1bf337::id_u8
note: Emitting MIR for generics_take_2/8b1bf337::id::_instaddce72e1232152c[0]
linking 1 mir files into generics-take-2.linked-mir.json
<snip>
This time, the resulting JSON file contains a definition for id_u8
. The
reason that this works is because when id_u8
calls id
, the Rust compile
will generate a specialized version of id
where A
is instantiated with the
type u8
. This specialized version of id
is named
id::_instaddce72e1232152c[0]
in the output above. (You don’t have to remember
this name, thankfully!)
Although the resulting JSON file contains a definition for id_u8
, it does
not contain a definition for the generic id
function. As a result, SAW will
only be able to verify the id_u8
function from this file. If you are ever in
doubt about which functions are accessible for verification with SAW, you can
check this with jq
like so:
$ jq '.intrinsics | map(.name)' generics-take-2.linked-mir.json
[
"generics_take_2/8b1bf337::id_u8",
"generics_take_2/8b1bf337::id::_instaddce72e1232152c[0]"
]
Here, “intrinsics” are monomorphic functions that are visible to SAW. Note that
saw-rustc
will optimize away all definitions that are not accessible from one
of these intrinsic functions. This explains why the original program that only
defined a generic id
function resulted in a definition-less JSON file, as
that program did not contain monomorphic functions (and therefore no
intrinsics).
Generally speaking, we prefer to verify functions that are defined directly in
the Rust source code, such as id_u8
, as these functions’ names are more
stable than the specialized versions of functions that the compiler generates,
such as id::_instaddce72e1232152c[0]
. Do note that SAW is capable of
verifying both types of functions, however. (We will see an example of
verifying an autogenerated function in the Salsa20 case study later in this
tutorial.)
Identifiers
When you compile a function named id_u8
, saw-rustc
will expand it to a much
longer name such as first_example/abef32c5::id_u8
. This longer name is called
an identifier, and it provides a globally unique name for that function. In
the small examples we’ve seen up to this point, there hasn’t been any risk of
name collisions, but you could imagine compiling this code alongside another
file (or crate) that also defines an id_u8
function. If that happens, then it
is essential that we can tell apart all of the different id_u8
functions, and
identifiers provide us the mechanism for doing so.
Let’s take a closer look at what goes into an identifier. In general, an identifier will look like the following:
<crate name>/<disambiguator>::<function path>
<crate name>
is the name of the crate in which the function is defined. All
of the examples we’ve seen up to this point have been defined in standalone
files, and as a result, the crate name has been the same as the file name, but
without the .rs
extension and with all hyphens replaced with underscores
(e.g., first-example.rs
is given the crate name first_example
). In
cargo
-based projects, the crate name will likely differ from the file name.
<disambiguator>
is a hash of the crate and its dependencies. In extreme
cases, it is possible for two different crates to have identical crate names,
in which case the disambiguator must be used to distinguish between the two
crates. In the common case, however, most crate names will correspond to
exactly one disambiguator. (More on this in a bit.)
<function path>
is the path to the function within the crate. Sometimes, this
is as simple as the function name itself. In other cases, a function path may
involve multiple segments, depending on the module hierarchy for the program
being verified. For instance, a read
function located in
core/src/ptr/mod.rs
will have the identifier:
core/<disambiguator>::ptr::read
Where core
is the crate name and ptr::read
is the function path, which
has two segments ptr
and read
. There are also some special forms of
segments that appear for functions defined in certain language constructs.
For instance, if a function is defined in an impl
block, then it will have
{impl}
as one of its segments, e.g.,
core/<disambiguator>::ptr::const_ptr::{impl}::offset
The most cumbersome part of writing an identifier is the disambiguator, as it is extremely sensitive to changes in the code (not to mention hard to remember and type). Luckily, the vast majority of crate names correspond to exactly one disambiguator, and we can exploit this fact to abbreviate identifiers that live in such crates. For instance, we can abbreviate this identifier:
core/<disambiguator>::ptr::read
To simply:
core::ptr::read
We will adopt the latter, shorter notation throughout the rest of the tutorial. SAW also understands this shorthand, so we will also use this notation when passing identifiers to SAW commands.