What is polonius?

This is a core library that models the borrow check. It implements the analysis described in this blogpost.

Why the name "Polonius"?

The name comes from the famous quote "Neither borrower nor lender be", which comes from the character Polonius in Shakespeare's Hamlet.

Current status

Polonius has been provisionally integrated into rustc. You can run a nightly rustc with the -Zpolonius flag to try it out. However, it is not really ready for widespread use.

Our current roadmap is as follows:

Complete the analysis: Polonius represents only a portion of the full borrow checker analysis. We would like to move as much as possible from the handwritten Rust code in rustc into the datalog-based approach of polonius.
Optimize: Naively implementing the polonius rules can be quite slow, so it's important that we optimize the rules and produce a more optimized version. This will also likely require some special case optimizations to help with specific cases, such as large static constants.

After those two steps are done, and presuming all goes well, we expect to replace the existing rustc borrow checker with the polonius crate. The crate will continue to exist.

Want to help?

Check out the polonius working group of the compiler team.

Want to run the code?

One of the goals with this repo is to experiment and compare different implementations of the same algorithm. You can run the analysis by using cargo run and you can choose the analysis with -a. So for example to run against an example extract from clap, you might do:

> cargo +nightly run --release -- -a DatafrogOpt inputs/clap-rs/app-parser-{{impl}}-add_defaults/
    Finished release [optimized] target(s) in 0.05 secs
     Running `target/release/borrow-check 'inputs/clap-rs/app-parser-{{impl}}-add_defaults/'`
--------------------------------------------------
Directory: inputs/clap-rs/app-parser-{{impl}}-add_defaults/
Time: 3.856s

You could also try -a Naive to get the naive rules (more readable, slower) -- these are the exact rules described in the blogpost. You can also use -a LocationInsensitive to use a location insensitive analysis (faster, but may yield spurious errors).

By default, cargo run just prints timing. If you also want to see the results, try --show-tuples (which will show errors) and maybe -v (to show more intermediate computations). You can supply --help to get more docs.

How to generate your own inputs

To run the borrow checker on an input, you first need to generate the input facts. For that, you will need to run rustc with the -Znll-facts option:

> rustc -Znll-facts inputs/issue-47680/issue-47680.rs

Or, for generating the input facts of a crate using the #![feature(nll)] flag:

> cargo rustc -- -Znll-facts

This will generate a nll-facts directory with one subdirectory per function:

> ls -F nll-facts
{{impl}}-maybe_next/  main/

You can then run on these directories.

Rules

These chapters document and explain the Polonius rules, primarily in datalog form.

First, we'll describe the atoms, and the relations they are stored in. Then, we'll look at the polonius computation in more detail. It's a pipeline consisting of multiple steps and analyses:

Initialization analysis will compute move and initialization errors, as well as the initialization and uninitialization data used by the next step.
Liveness analysis will compute which origins are live at which points in the control flow graph, used by the next step.
Loan analysis (the core of "borrow checking") will compute illegal access errors, and illegal subset relationships errors. This is currently done with different variants (with different datalog rules) which will be described in that section.

Atoms

Polonius defines the following atoms. To Polonius, these are opaque identifiers that identify particular things within the input program (literally they are newtype'd integers). Their meaning and relationships come entirely from the input relations.

Example

We'll use this snippet of Rust code to illustrate the various kinds of atoms that can exist.


#![allow(unused)]
fn main() {
let x = (vec![22], vec![44]);
let y = &x.1;
let z = x.0;
drop(y);
}

Variables

A variable represents a user variable defined by the Rust source code. In our snippet, x, y, and z are variables. Other kinds of variables include parameters.

Path

A path indicates a path through memory to a memory location -- these roughly correspond to places in MIR, although we only support a subset of the full places (that is, every MIR place maps to a Path, but sometimes a single Path maps back to multiple MIR places).

Each path begins with a variable (e.g., x) but can be extended with fields (e.g., x.1), with an "index" (e.g., x[]) or with a deref *x. Note that the index paths (x[]) don't track the actual index that was accessed, since the borrow check treats all indices as equivalent.

The grammar for paths would thus look something like this:

Path = Variable
     | Path "." Field // field access
     | Path "[" "]"   // index
     | "*" Path

Each path has a distinct atom associated with it. So there would be an atom P1 for the path x and another atom P2 for the path x.0. These atoms are related to one another through the path_parent relation.

Node

Nodes are, well, nodes in the control-flow graph. They are related to one another by the cfg_edge relation.

For each statement (resp. terminator) S in the MIR, there are actually two associated nodes. One represents the "start" of S -- before S has begun executing -- the other is called the "mid node" -- which represents the point where S "takes effect". Each start node has exactly one successor, the mid node.

Loans

A loan represents some borrow that occurs in the source. Each loan has an associated path that was borrowed along with a mutability. So, in our example, there would be a single loan, for the &x.1 expression.

Origins

An origin is what it typically called in Rust a lifetime. In Polonius, an origin refers to the set of loans from which a reference may have been created.

Input relations

Polonius computes its analyses starting from "input facts", which can be seen as a little database of information about a piece of Rust code (most often: a function).

In this analogy, the database is the AllFacts struct, which contains all the data in tables (or relations), here as a handful of Vecs of rows. The table rows are these "facts": this terminology comes from Datalog, which Polonius uses to do its computations (and the reason for the rustc flag outputting this data being named -Znll-facts, and the files themselves *.facts).

In order to be used in various contexts (mainly: in-memory from rustc, and from on-disk test files in the Polonius repository) this structure is generic over the types of facts, only requiring them to be Atoms. The goal is to use interned values, represented as numbers, in the polonius computations.

These generic types of facts are the concepts Polonius manipulates: abstract origins containing loans at points in the CFG (in the liveness computation, and move/overwrite analysis, there are also: variables and paths), and the relations are their semantics (including the specific relationships between the different facts). More details about these atoms can be found in their dedicated chapter.

Let's start with the simplest relation: the representation of the Control Flow Graph, in the cfg_edge relation.

1. `cfg_edge`

cfg_edge(point1, point2): as its name suggests, this relation stores that there's a CFG edge between the point point1 and the point point2.

For each MIR statement location, 2 Polonius points are generated: the "Start" points and the "Mid" points (some of the other Polonius inputs will later be recorded at each of the points). These 2 points are linked by an edge recorded in this relation.

Then, another edge will be recorded, linking this MIR statement to its successor statement(s): from the mid point of the current location to the start point of the successor location. Even though it's encoded differently in MIR, this will similarly apply when the successor location is in another block, linking the mid point of the current location to the start point of the successor block's starting location.

For example, for this MIR (edited from the example for clarity, and to only show the parts related to the CFG):


#![allow(unused)]
fn main() {
bb0: {
  ...          // bb0[0]
  ...          // bb0[1]
  goto -> bb3; // bb0[2]
}

...

bb3: {
  ... // bb3[0]
}

}

we will record these input facts (as mentioned before, they'll be interned) in the cfg_edge relation, shown here as pseudo Rust:


#![allow(unused)]
fn main() {
cfg_edge = vec![
  // statement at location bb0[0]:
  (bb0-0-start, bb0-0-mid),
  (bb0-0-mid, bb0-1-start),
  
  // statement at location bb0[1]:
  (bb0-1-start, bb0-1-mid),
  (bb0-1-mid, bb0-2-start),
  
  // terminator at location bb0[2]:
  (bb0-2-start, bb0-2-mid),
  (bb0-2-mid, bb3-0-start),
];
}

2. `loan_issued_at`

loan_issued_at(origin, loan, point): this relation stores that the loan loan was "issued" at the given point point, creating a reference with the origin origin. The origin origin may refer to data from loan loan from the point point and onwards (this is usually the point after a borrow rvalue). The origin in which the loan is issued is called the "issuing origin" (but has been called borrow_region historically, so you may still encounter this term in Polonius or rustc).

For every borrow expression, a loan will be created and there will be a fact stored in this relation to link this loan to the origin of the borrow expression.

For example, with:


#![allow(unused)]
fn main() {
let mut a = 0;
let r = &mut a; // this creates the loan L0
//      ^ let's call this 'a
}

there will be a loan_issued_at fact linking L0 to 'a at this point. This loan will flow along the CFG and the subset relationships between origins, and the computation will require that its terms are respected or it will generate an illegal access error.

3. `placeholder` (and `universal_region`)

placeholder(origin, loan): stores that the origin is a placeholder origin, with its associated placeholder loan loan (universal_region(origin) currently still exists, describing the same thing about origin, without the loan, and is being phased out). These origins have been historically called different things, mostly in rustc, like "universal region" and "free region", but represent origins that are not defined in the MIR body we're checking. They are parts of the caller of this function: its loans are unknown to the current function and it cannot make assumptions about the origin (besides the relationships it may have with different placeholder origins, as we'll see below for the known_placeholder_subset relation). For computations where a loan from these placeholders can be useful (e.g. the illegal subset relationships errors), the associated placeholder loan can be used.

Those are the default placeholder origins ('static) and the ones defined on functions which are generic over a lifetime. For example, with


#![allow(unused)]
fn main() {
fn my_function<'a, 'b>(x: &'a u32, y: &'b u32) {
    ...
}
}

the placeholder relation will also contain facts for 'a, and 'b.

4. `loan_killed_at`

loan_killed_at(loan, point): this relation stores that a prefix of the path borrowed in loan loan is assigned/overwritten at the point point. This indicates that the path borrowed by the loan has changed in some way that the loan no longer needs to be tracked. (In particular, mutations to the path that was borrowed no longer invalidate the loan)

For example, with:


#![allow(unused)]
fn main() {
let mut a = 1;
let mut b = 2;
let mut q = &mut a;
let r = &mut *q; // loan L0 of `*q`
// `q` can't be used here, one has to go through `r`
q = &mut b; // killed(L0)
// `q` and `r` can be used here
}

the loan L0 will be "killed" by the assignment, and this fact stored in the loan_killed_at relation. When we compute which loans origins contain along the CFG, the loan_killed_at points will stop this loan's propagation to the next CFG point.

5. `subset_base`

subset_base(origin1, origin2, point): this relation stores that the origin origin1 outlives origin origin2 at the point point.

This is the standard Rust syntax 'a: 'b where the lifetime 'a outlives the lifetime 'b. From the point of view of origins as sets of loans, this is seen as a subset-relation: with all the loans in 'a flowing into 'b, 'a contains a subset of the loans 'b contains.

The type system defines subtyping rules for references, which will create "outlives" facts to relate the reference type to the referent type as a subset.

(The _base suffix comes from the fact this relation is not transitive, and will be the base of the transitive closure computation)

For example:


#![allow(unused)]
fn main() {
let a: u32 = 1;
let b: &u32 = &a;
//            ^ let's call this 'a
//     ^ and let's call this 'b
}

To be valid, this last expression requires that the type &'a u32 is a subtype of &'b u32. This requires 'a: 'b and the subset_base relation will contain this basic fact that 'a outlives / is a subset of / flows into 'b at this point.

6. `origin_live_at`

origin_live_at(origin, point): this relation stores that the origin origin appears in a live variable at the point point.

These facts are created by the liveness computation, and its facts and relations will be described later in a lot more detail. In the meantime, its implementation is in liveness.rs here.

7. `loan_invalidated_at`

loan_invalidated_at(point, loan): this relation stores that a loan loan is invalidated by some action taking place at the point point.

Loans have terms which must be respected: ensuring shared loans are only used to read and not write or mutate, or that a mutable loan is the only way to access a referent. An illegal access of the path borrowed by the loan is said to invalidate the terms of the loan, and this fact will be recorded in the loan_invalidated_at relation. Any such action on a live loan will be an error.

Since the goal of the borrow checking analysis is to find these possible errors, this relation is important to the computation. Any loans it contains, and in turn, any origin containing those loans, are key facts the computation tracks.

8. `known_placeholder_subset`

known_placeholder_subset(origin1, origin2): this relation store the relationship between two placeholder origins, that the origin1 placeholder origin is a subset of the origin2 placeholder origin. They can be declared by the user on function declarations, or inferred via implied bounds.

For example, the function:


#![allow(unused)]
fn main() {
fn foo<'a, 'b: 'a, 'c>(x: &'c &'a u32) {
    ...
}
}

would have two known_placeholder_subset entries:

one for the user-supplied subset 'b: 'a
one for the 'a: 'c implied bound from the x parameter

Note that the transitive subset 'b: 'c resulting from these two entries is not necessarily included explicitly in this relation. Polonius will infer all the transitive subsets to do its illegal subset relationships errors analysis: if the function analysis finds that two placeholders are related, and this was not declared in the known subsets, that will be an error.

Initialization analysis

These rules are not yet fully merged.

Liveness analysis

These rules are not yet described.

Loan analysis

Loan analysis is the heart of the borrow checker, and will compute:

illegal access errors: an action on a loan, that is illegal to perform
illegal subset relations errors: missing relationships between placeholder origins

This is done in multiple variants, whose goals are different: performance, readability, tests and validation.

Broadly speaking, the goals of the analysis are 1) to track loans:

from the point and origin in which they are issued, to the points where they are invalidated
flowing from origin to origin at a single point, via their subset relationships
flowing from point to point in the CFG, according to the origins' liveness (stopping at points where a loan is killed)

And 2) to track undeclared relationships between placeholder origins.

Any live loan which is invalidated will be an illegal access error, any placeholder which flows into another placeholder unexpectedly will be an illegal subset relation error.

Inputs

The input relations will be described below, but the dedicated page will have more information about them.

// Indicates that the `loan` was "issued" at the given `point`, creating a
// reference with the `origin`. Effectively, `origin` may refer to data from
// `loan` starting at `point` (this is usually the point *after* a borrow rvalue).
.decl loan_issued_at(Origin:origin, Loan:loan, Point:point)
.input loan_issued_at

// When some prefix of the path borrowed at `loan` is assigned at `point`.
// Indicates that the path borrowed by the `loan` has changed in some way that the
// loan no longer needs to be tracked. (In particular, mutations to the path that
// was borrowed no longer invalidate the loan)
.decl loan_killed_at(Loan:loan, Point:point)
.input loan_killed_at

// Indicates that the `loan` is invalidated by some action
// taking place at `point`; if any origin that references this loan is live,
// this is an error.
.decl loan_invalidated_at(Loan:loan, Point:point)
.input loan_invalidated_at

// When we require `origin1@point: origin2@point`.
// Indicates that `origin1 <= origin2` -- i.e., the set of loans in `origin1`
// are a subset of those in `origin2`.
.decl subset_base(Origin1:origin, Origin2:origin, Point:point)
.input subset_base

// Describes a placeholder `origin`, with its associated placeholder `loan`.
.decl placeholder(Origin:origin, Loan:loan)
.input placeholder

// These reflect the `'a: 'b` relations that are either declared by the user on
// function declarations or which are inferred via implied bounds.
// For example: `fn foo<'a, 'b: 'a, 'c>(x: &'c &'a u32)` would have two entries:
// - one for the user-supplied subset `'b: 'a`
// - and one for the `'a: 'c` implied bound from the `x` parameter,
// (note that the transitive relation `'b: 'c` is not necessarily included
// explicitly, but rather inferred by polonius).
.decl known_placeholder_subset(Origin1:origin, Origin2:origin)
.input known_placeholder_subset

The datalog rules below are considered the "naive" implementation, as it computes the whole transitive closure of the subset relation, but are easy to describe and explain. They are implemented using the datafrog engine in the Naive variant.

Some trivial differences exist with the implementation:

the use of the ; alternative operator in the rules
some API limitations about joins in the implementation, sometimes requiring intermediate steps per join (and these can sometimes be shared between different rules)

Subsets between origins

The rules below compute the complete graph of subsets between origins: starting from the non-transitive subsets, we close over this relation at a given point in the CFG (regardless of liveness). Liveness is then taken into account to propagate these transitive subsets along the CFG: if an origin flows into another at a given point, and they both are live at the successor points (reminder: placeholder origins are considered live at all points), the relationship is propagated to the successor points.

.decl subset(Origin1:origin, Origin2:origin, Point:point)

// R1: the initial subsets are the non-transitive `subset_base` static input
subset(Origin1, Origin2, Point) :-
  subset_base(Origin1, Origin2, Point).

// R2: compute the subset transitive closure, at a given point
subset(Origin1, Origin3, Point) :-
  subset(Origin1, Origin2, Point),
  subset(Origin2, Origin3, Point).

// R3: propagate subsets along the CFG, according to liveness
subset(Origin1, Origin2, TargetPoint) :-
  subset(Origin1, Origin2, SourcePoint),
  cfg_edge(SourcePoint, TargetPoint),
  (origin_live_on_entry(Origin1, TargetPoint); placeholder(Origin1, _)),
  (origin_live_on_entry(Origin2, TargetPoint); placeholder(Origin2, _)).

The origins contain loans

The rules below compute what loans are contained in which origins, at given points of the CFG: starting from the "issuing point and origin", a loan is propagated via the subsets computed above, at a given point in the CFG. Liveness is then taken into account to propagate these loans along the CFG: if a loan is contained in an origin at a given point, and that the origin is live at the successor points, the loan is propagated to the successor points. A subtlety here is that there are points in the CFG where a loan can be killed, and that will stop propagation. Rule 6 uses both liveness and kill points to decide whether the loan should be propagated further in the CFG.

.decl origin_contains_loan_on_entry(Origin:origin, Loan:loan, Point:point)

// R4: the issuing origins are the ones initially containing loans
origin_contains_loan_on_entry(Origin, Loan, Point) :-
  loan_issued_at(Origin, Loan, Point).

// R5: propagate loans within origins, at a given point, according to subsets
origin_contains_loan_on_entry(Origin2, Loan, Point) :-
  origin_contains_loan_on_entry(Origin1, Loan, Point),
  subset(Origin1, Origin2, Point).

// R6: propagate loans along the CFG, according to liveness
origin_contains_loan_on_entry(Origin, Loan, TargetPoint) :-
  origin_contains_loan_on_entry(Origin, Loan, SourcePoint),
  !loan_killed_at(Loan, SourcePoint),
  cfg_edge(SourcePoint, TargetPoint),
  (origin_live_on_entry(Origin, TargetPoint); placeholder(Origin, _)).

Loan liveness, and illegal access errors

With the information computed above, we can compute illegal accesses errors. It is an error to invalidate a loan that is live at a given point. A loan is live at a point if it is contained in an origin that is live at that point.

.decl loan_live_at(Loan:loan, Point:point)

// R7: compute whether a loan is live at a given point, i.e. whether it is
// contained in a live origin at this point
loan_live_at(Loan, Point) :-
  origin_contains_loan_on_entry(Origin, Loan, Point),
  (origin_live_on_entry(Origin, Point); placeholder(Origin, _)).

.decl errors(Loan:loan, Point:point)

// R8: compute illegal access errors, i.e. an invalidation of a live loan
errors(Loan, Point) :-
  loan_invalidated_at(Loan, Point),
  loan_live_at(Loan, Point).

Placeholder subsets, and illegal subset relations errors

These errors can be computed differently depending on the variant, but the goal is the same: if the analysis detects that a placeholder origin ultimately flows into another placeholder origin, that relationship needs to be declared or it is an error.

The Naive rules variant computes the complete subset transitive closure and can more easily detect whether one of these facts links two placeholder origins. The LocationInsensitive rules variant does not compute transitive subsets at all, and uses loan propagation to detect these errors (if a placeholder loan flows into a placeholder origin). The Opt optimized rules variant only computes the transitive closure of some origins according to their liveness and possible contribution to any error (mostly the ones dying along an edge, and the origins they can reach), and tracks the transitive subsets of placeholders explicitly.

.decl subset_errors(Origin1:origin, Origin2:origin, Point:point)

// R9: compute illegal subset relations errors, i.e. the undeclared subsets
// between two placeholder origins.
subset_errors(Origin1, Origin2, Point) :-
  subset(Origin1, Origin2, Point),
  placeholder_origin(Origin1),
  placeholder_origin(Origin2),
  !known_placeholder_subset(Origin1, Origin2).

Location Insensitive analysis

The rules above document the Naive variant of loan analysis, as it is conceptually simple and describes all the important parts computed by the Polonius model. This variant is "naive" in the sense that to stay clear and simple, the rules compute more things than strictly required. In particular, it computes the complete transitive subsets of all origins, as well as the loans contained by each origin at every point of the CFG.

In practice, different "grades" of borrow-checking can be useful: each with different levels of precision in what it accepts and with different computational complexity requirements. The lowest of such grades, the LocationInsensitive variant, trades off precision for speed by ignoring both the location where subsets happen, and the origins' contents at the CFG points. The idea is: if an analysis would find no error when ignoring path- and flow-sensitivity, then the full analysis would find no error either. If it does find potential errors, then the full analysis will find a subset of these location-insensitive errors.

This can be used as a quick pre-pass: if there are no errors, a full, expensive, analysis does not need to run, otherwise, only the loans where potential errors occur would need to be fully checked to remove false positives.

The inputs are the same as the Naive variant, but ignore the CFG points from the subsets. Subsets are not tracked, and are used to approximate loan propagation inside origins (regardless of liveness and location-sensitivity) in origin_contains_loan:

.decl subset(Origin1:origin, Origin2:origin)

// R1: the subsets are the non-transitive `subset_base` static input,
// with their location stripped.
subset(Origin1, Origin2) :-
  subset_base(Origin1, Origin2, _).

.decl origin_contains_loan(Origin:origin, Loan:loan)

// R2: the issuing origins are the ones initially containing loans.
origin_contains_loan(Origin, Loan) :-
  loan_issued_at(Origin, Loan, _).

// R3: the placeholder origins also contain their placeholder loan.
origin_contains_loan(Origin, Loan) :-
  placeholder_loan(Origin, Loan).

// R4: propagate the loans from the origins to their subsets.
origin_contains_loan(Origin2, Loan) :-
  origin_contains_loan(Origin1, Loan),
  subset(Origin1, Origin2).

.decl loan_live_at(Loan:loan, Point:point)

// R5a: Approximate loan liveness. If an origin is live at a given
// point, and it contains a loan *anywhere* in the CFG, that loan is
// considered live at that point.
loan_live_at(Loan, Point) :-
  origin_contains_loan(Origin, Loan),
  (origin_live_on_entry(Origin, Point); placeholder_origin(Origin)).

.decl potential_errors(Loan:loan, Point:point)

// R5b: Compute potential illegal access errors, i.e. invalidations
// of live loans.
potential_errors(Loan, Point) :-
  loan_invalidated_at(Loan, Point),
  loan_live_at(Loan, Point).

Note: rules "5a" and "5b" above are named to match the implementation which computes potential_errors as a single "rule 5" without materializing the loan_live_at intermediate relation of "rule 5a".

Illegal subset relation errors (which are by definition about "subsets") can still be computed by propagating the placeholder loans, and detecting when they unexpectedly flow into another placeholder origin: one where this specific relationship between the two placeholders was not declared.

.decl potential_subset_errors(Origin1:origin, Origin2:origin)

// R6: compute potential illegal subset relations errors, i.e. the
// placeholder loans which ultimately flowed into another placeholder
// origin unexpectedly.
potential_subset_errors(Origin1, Origin2) :-
  placeholder(Origin1, Loan1),
  placeholder(Origin2, _),
  origin_contains_loan(Origin2, Loan1),
  !placeholder_known_to_contain(Origin2, Loan1).

This requires a simple input equivalent to the transitive closure of known_placeholder_subset, tracking the placeholder loans a given placeholder origin is known to contain instead, and is computed like so:

.decl placeholder_known_to_contain(Origin:origin, Loan:loan)

placeholder_known_to_contain(Origin, Loan) :-
  placeholder(Origin, Loan).

placeholder_known_to_contain(Origin2, Loan1) :-
  placeholder_known_to_contain(Origin1, Loan1),
  known_placeholder_subset(Origin1, Origin2).

To be continued

In the current implementation, this quick LocationInsensitive filter is used as a pre-pass to another optimized variant, as part of the Hybrid algorithm.

A more detailed description of the rules in this Opt variant will be added later but it computes the same data as the Naive variant described above, more efficiently, by limiting where the subset transitive closure is computed: some origins are short-lived, or part of a subsection of the subset graph into which no loan ever flows, and therefore don't contribute to errors or loan propagation. There's no need to track these specific cases.

In the meantime, the implementation documents the relations and rules it uses in its computation.

Testing Polonius

Rust UI Tests with Polonius Compare Mode

There is a mode of the Rust test suite that compares Polonius' output to the current NLL one. You can invoke it by using --compare-mode polonius. For example, the following will run the UI tests:

$ ./x.py test -i --stage 1 --compare-mode polonius src/test/ui

Polonius' Own Unit Test

(Not yet written, but this section should describe how to use polonius-parser to generate input for unit tests.)