Intro
formality_core
: the Formality system
a-mir-formality
is build on the formality core system,
defined by the formality_core
crate.
Formality core is a mildly opnionated series of macros, derives, and types
that let you write high-level Rust code
in a way that resembles standard type theory notation.
Its primary purpose is to be used by a-mir-formality
but it can be used for other projects.
Defining your language
The very first thing you do to define your own language is to use the formality_core::declare_language!
macro.
You use it like so:
#![allow(unused)] fn main() { formality_core::declare_language! { pub mod rust { const NAME = "Rust"; type Kind = crate::grammar::ParameterKind; type Parameter = crate::grammar::Parameter; const BINDING_OPEN = '<'; const BINDING_CLOSE = '>'; const KEYWORDS = [ "mut", "struct", "enum", "union", "const", "true", "false", "static", ]; } } }
The declare_language
macro will create a module with the name you specify (here, rust
).
You have to tell it a few things:
- The
NAME
of your language, a string for debugging. - An enum defining the kinds of variables in your language; see the variables chapter for more information. For Rust, the kinds are types, lifetimes, and constants.
- An enum defining a parameter, which is the terms that can be used to represent the value of a variable; see the variables chapter for more information.
- Two characters
BINDER_OPEN
andBINDER_CLOSED
defining the opening and closing characters for binders, e.g.,<
and>
, to use when parsing.
Contents of the language module
The language module you create has various items in it:
- A
struct FormalityLang
that defines your language. Some of the contains offormality_core
(notably the traits that involve bound variables) are
Specifying the language for a crate
That module will contain a language struct named FormalityLang
.
Other parts of the formality system (e.g., autoderives and the like)
will need to know the current language you are defining,
and they expect to find it at crate::FormalityLang
.
Best practice is to add a use
at the top of your crate defining your language.
For example, the formality_types
crate has:
#![allow(unused)] fn main() { use crate::rust::FormalityLang; }
and other crates like formality_rust
have:
#![allow(unused)] fn main() { // Defines the language used by derive(term) and friends. use formality_types::rust::FormalityLang; }
Defining terms with the term
macro
There are two or three key things to understand. The first is the #[term]
macro. This is a procedural macro that you can attach to a struct
or enum
declaration that represents a piece of Rust syntax or part of the trait checking rules. It auto-derives a bunch of traits and functionality...
- rules for parsing from a string
- a
Debug
impl to serialize back out - folding and substitution
- upcasting and downcasting impls for convenient type conversion
For some types, we opt not to use #[term]
, and instead implement the traits by hand. There are also derives so you can derive some of the traits but not all.
Using #[term]
Let's do a simple example. If we had a really simple language, we might define expressions like this:
#![allow(unused)] fn main() { #[term] enum Expr { #[cast] Variable(Variable), #[grammar($v0 + $v1)] Add(Box<Expr>, Box<Expr>), } #[term($name)] struct Variable { name: String } }
The #[term]
macro says that these are terms and we should generate all the boilerplate automatically. Note that it will generate code that references crate::FormalityLang
so be sure to define your language appropriately.
The #[term]
also accepts some internal annotations:
#[cast]
can be applied on an enum variant with a single argument. It says that this variant represents an "is-a" relationship and hence we should generate upcast/downcast impls to allow conversion. In this case, a variable is a kind of expression -- i.e,. wrapping a variable up into an expression doesn't carry any semantic meaning -- so we want variables to be upcastable to expressions (and expressions to be downcast to variables). The#[cast]
attribute will therefore generate an implVariable: Upcast<Expr>
that lets you convert a variable to an expression, and a downcast implExpr: Downcast<Variable>
that lets you try to convert from an expression to a variable. Downcasting is fallible, which means that the downcast will only succeed if this is anExpr::Variable
. If this is aExpr::Add
, the downcast would returnNone
.- There is a special case version of
#[cast]
called#[variable]
. It indicates that the variant represents a (type) variable -- we are not using it here because this an expression variable. Variables are rather specially in folding/parsing to allow for substitution, binders, etc.
- There is a special case version of
#[grammar]
tells the parser and pretty printer how to parse this variant. The$v0
and$v1
mean "recursively parse the first and second arguments". This will parse aBox<Expr>
, which of course is implemented to just parse anExpr
.
If you are annotating a struct, the #[term]
just accepts the grammar directly, so #[term($name)] struct Variable
means "to parse a variable, just parse the name field (a string)".
We could also define types and an environment, perhaps something like
#![allow(unused)] fn main() { #[term] enum Type { Integer, // Default grammar is just the word `integer` String // Default grammar is just the word `string` } #[term] // Default grammar is just `env($bindings)` struct Env { bindings: Set<(Variable, Type)> } }
You can see that the #[term]
macro will generate some default parsing rules if you don't say anything.
We can then write code like
#![allow(unused)] fn main() { let env: Env = term("env({(x, integer)})"); }
This will parse the string, panicking if either the string cannot be parsed or or if it is ambiguous (can be parsing in mutiple ways). This is super useful in tests.
These terms are just Rust types, so you can define methods in the usual way, e.g. this Env::get
method will search for a variable named v
:
#![allow(unused)] fn main() { impl Env { pub fn get(&self, v: &Variable) -> Option<&Type> { self.bindings.iter() .filter(|b| &b.0 == v) .map(|b| &b.1) .next() } } }
Parsing
Formality's #[term]
and #[grammar]
attributes let you specify the grammar that will be used to parse your structs/enums.
For structs, there is a single grammar, specified as part of the term:
#![allow(unused)] fn main() { #[term( /* grammar here */ )] struct MyStruct { } }
For enums, the grammar is placed in #[grammar]
attributes on each variant:
#![allow(unused)] fn main() { #[term] enum MyEnum { #[grammar( /* grammar here */ )] Foo(...), } }
Ambiguity and greedy parsing
When parsing an enum there will be multiple possibilities. We will attempt to parse them all. If more than one succeeds, the parser will attempt to resolve the ambiguity by looking for the longest match. However, we don't just consider the number of characters, we look for a reduction prefix:
- When parsing, we track the list of things we had to parse. If there are two variants at the same precedence level, but one of them had to parse strictly more things than the other and in the same way, we'll prefer the longer one. So for example if one variant parsed a
Ty
and the other parsed aTy Ty
, we'd take theTy Ty
.- When considering whether a reduction is "significant", we take casts into account. See
ActiveVariant::mark_as_cast_variant
for a more detailed explanation and set of examples.
- When considering whether a reduction is "significant", we take casts into account. See
Precedence and left-recursive grammars
We support left-recursive grammars like this one from the parse-torture-tests
:
#![allow(unused)] fn main() { #[term] pub enum Path { #[cast] Id(Id), #[grammar($v0 . $v1)] Field(Arc<Path>, Id), #[grammar($v0 [ $v1 ])] Index(Arc<Path>, Arc<Path>), } formality_core::id!(Id); }
We also support ambiguous grammars. For example, you can code up arithmetic expressions like this:
#![allow(unused)] fn main() { #[term] pub enum Expr { #[cast] Id(Id), #[grammar($v0 + $v1)] #[precedence(1)] Add(Arc<Expr>, Arc<Expr>), #[grammar($v0 * $v1)] #[precedence(2)] Mul(Arc<Expr>, Arc<Expr>), } formality_core::id!(Id); }
When specifying the #[precedence]
of a variant, the default is left-associativity, which can be written more explicitly as #[precedence(L, left)]
. If you prefer, you can specify right-associativity (#[precedence(L, right)]
) or non-associativity #[precedence(L, none)]
. This affects how things of the same level are parsed:
1 + 1 + 1
when left-associative is(1 + 1) + 1
1 + 1 + 1
when right-associative is1 + (1 + 1)
1 + 1 + 1
when none-associative is an error.
Symbols
A grammar consists of a series of symbols. Each symbol matches some text in the input string. Symbols come in two varieties:
- Most things are terminals or tokens: this means they just match themselves:
- For example, the
*
in#[grammar($v0 * $v1)]
is a terminal, and it means to parse a*
from the input. - Delimeters are accepted but must be matched, e.g.,
( /* tokens */ )
or[ /* tokens */ ]
.
- For example, the
- Things beginning with
$
are nonterminals -- they parse the contents of a field. The grammar for a field is generally determined from its type.- If fields have names, then
$field
should name the field. - For position fields (e.g., the T and U in
Mul(Expr, Expr)
), use$v0
,$v1
, etc. - Exception:
$$
is treated as the terminal'$'
.
- If fields have names, then
- Nonterminals have various modes:
$field
-- just parse the field's type$*field
-- the field must be a collection ofT
(e.g.,Vec<T>
,Set<T>
) -- parse any number ofT
instances. Something like[ $*field ]
would parse[f1 f2 f3]
, assumingf1
,f2
, andf3
are valid values forfield
.$,field
-- similar to the above, but uses a comma separated list (with optional trailing comma). So[ $,field ]
will parse something like[f1, f2, f3]
.$?field
-- will parsefield
and useDefault::default()
value if not present.$<field>
-- parse<E1, E2, E3>
, wherefield
is a collection ofE
$<?field>
-- parse<E1, E2, E3>
, wherefield
is a collection ofE
, but accept empty string as empty vector$(field)
-- parse(E1, E2, E3)
, wherefield
is a collection ofE
$(?field)
-- parse(E1, E2, E3)
, wherefield
is a collection ofE
, but accept empty string as empty vector$[field]
-- parse[E1, E2, E3]
, wherefield
is a collection ofE
$[?field]
-- parse[E1, E2, E3]
, wherefield
is a collection ofE
, but accept empty string as empty vector${field}
-- parse{E1, E2, E3}
, wherefield
is a collection ofE
${?field}
-- parse{E1, E2, E3}
, wherefield
is a collection ofE
, but accept empty string as empty vector$:guard <nonterminal>
-- parses<nonterminal>
but only if the keywordguard
is present. For example,$:where $,where_clauses
would parsewhere WhereClause1, WhereClause2, WhereClause3
but would also accept nothing (in which case, you would get an empty vector).
Greediness
Parsing is generally greedy. So $*x
and $,x
, for example, consume as many entries as they can. Typically this works best if x
begins with some symbol that indicates whether it is present.
Default grammar
If no grammar is supplied, the default grammar is determined as follows:
- If a
#[cast]
or#[variable]
annotation is present, then the default grammar is just$v0
. - Otherwise, the default grammar is the name of the type (for structs) or variant (for enums), followed by
()
, with the values for the fields in order. SoMul(Expr, Expr)
would have a default grammarmul($v0, $v1)
.
Customizing the parse
If you prefer, you can customize the parse by annotating your term with #[customize(parse)]
. In the Rust case, for example, the parsing of RigidTy
is customized (as is the debug impl):
#![allow(unused)] fn main() { #[term((rigid $name $*parameters))] #[customize(parse, debug)] pub struct RigidTy { pub name: RigidName, pub parameters: Parameters, } }
You must then supply an impl of Parse
yourself. Because Parse
is a trait alias, you actually want to implement CoreParse<L>
for your language type L
. Inside the code you will want to instantiate a Parser
and then invoke parse_variant
for every variant, finally invoking finish
.
In the Rust code, the impl for RigidTy
looks as follows:
#![allow(unused)] fn main() { // Implement custom parsing for rigid types. impl CoreParse<Rust> for RigidTy { fn parse<'t>(scope: &Scope<Rust>, text: &'t str) -> ParseResult<'t, Self> { Parser::multi_variant(scope, text, "RigidTy", |parser| { // Parse a `ScalarId` (and upcast it to `RigidTy`) with the highest // precedence. If someone writes `u8`, we always interpret it as a // scalar-id. parser.parse_variant_cast::<ScalarId>(Precedence::default()); // Parse something like `Id<...>` as an ADT. parser.parse_variant("Adt", Precedence::default(), |p| { // Don't accept scalar-ids as Adt names. p.reject_nonterminal::<ScalarId>()?; let name: AdtId = p.nonterminal()?; let parameters: Vec<Parameter> = parse_parameters(p)?; Ok(RigidTy { name: name.upcast(), parameters, }) }); // Parse `&` parser.parse_variant("Ref", Precedence::default(), |p| { p.expect_char('&')?; let lt: Lt = p.nonterminal()?; let ty: Ty = p.nonterminal()?; Ok(RigidTy { name: RigidName::Ref(RefKind::Shared), parameters: seq![lt.upcast(), ty.upcast()], }) }); parser.parse_variant("RefMut", Precedence::default(), |p| { p.expect_char('&')?; p.expect_keyword("mut")?; let lt: Lt = p.nonterminal()?; let ty: Ty = p.nonterminal()?; Ok(RigidTy { name: RigidName::Ref(RefKind::Mut), parameters: seq![lt.upcast(), ty.upcast()], }) }); parser.parse_variant("Tuple", Precedence::default(), |p| { p.expect_char('(')?; p.reject_custom_keywords(&["alias", "rigid", "predicate"])?; let types: Vec<Ty> = p.comma_nonterminal()?; p.expect_char(')')?; let name = RigidName::Tuple(types.len()); Ok(RigidTy { name, parameters: types.upcast(), }) }); }) } } }
Customizing the debug
By default, the #[term]
macro will generate a Debug
impl that is guided by the #[grammar]
attributes on your type (see the parsing section for more details). But sometimes you want to generate custom logic. You can include a #[customize(debug)]
declaration to allow that. Most of the type, when you do this, you will also want to customize parsing, as the RigidTy
does:
#![allow(unused)] fn main() { #[term((rigid $name $*parameters))] #[customize(parse, debug)] pub struct RigidTy { pub name: RigidName, pub parameters: Parameters, } }
Now you must simply implement Debug
in the usual way. Here is the RigidTy
declaration:
#![allow(unused)] fn main() { impl Debug for RigidTy { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { let RigidTy { name, parameters } = self; match name { RigidName::AdtId(name) => { write!( f, "{:?}{:?}", name, PrettyParameters::new("<", ">", parameters) ) } RigidName::ScalarId(s) if parameters.is_empty() => { write!(f, "{:?}", s) } RigidName::Ref(RefKind::Shared) if parameters.len() == 2 => { write!(f, "&{:?} {:?}", parameters[0], parameters[1]) } RigidName::Ref(RefKind::Mut) if parameters.len() == 2 => { write!(f, "&mut {:?} {:?}", parameters[0], parameters[1]) } RigidName::Tuple(arity) if parameters.len() == *arity => { if *arity != 0 { write!(f, "{:?}", PrettyParameters::new("(", ")", parameters)) } else { // PrettyParameters would skip the separators // for 0 arity write!(f, "()") } } _ => { write!(f, "{:?}{:?}", name, PrettyParameters::angle(parameters)) } } } } }
Constructors
Unless you include #[customize(constructors)]
, the #[term]
macro automatically creates constructors as follows:
- For a
struct
, defines anew
method that takes animpl Upcast<T>
for each of your fields. - For an
enum
, defines a method per variant that has fields (converted to snake-case).- If the name of the variant is a Rust keyword like
Struct
, the method will be calledstruct_
. - We do not generate constructors for variants with no arguments.
- If the name of the variant is a Rust keyword like
Variables
Collections
When using formality, it's best to use the following collection types:
- for sequences, use the standard
Vec
type formality_core::Set<T>
-- equivalent toBTreeSet
but shorter. We use aBTreeSet
because it has deterministic ordering for all operations.formality_core::Map<K, V>
-- equivalent toBTreeMap
but shorter. We use aBTreeMap
because it has deterministic ordering for all operations.
Macros
We also define macros:
seq![...]
-- equivalent tovec![]
but permits flattening with..
notation, as described belowset![...]
-- likeseq!
, but produces aSet
In these macros you can "flatten" things that support IntoIterator
, so set![..a, ..b]
will effectively perform a set union of a
and b
.
Casting between collections and tuples
It is possible to upcast from variable tuple types to produce collections:
- A
Vec<E1>
can be upcast to aVec<E2>
ifE1: Upcast<E2>
. - A
Set<E1>
can be upcast to aSet<E2>
ifE1: Upcast<E2>
. - Tuples of elements (e.g.,
(E1, E2)
or(E1, E2, E3)
) can be upcast to sets up to a fixed arity. - Sets and vectors can be downcast to
()
and(E, C)
, where()
succeeds only for empty collections, and(E, C)
extracts the first elementE
and a collectionC
with all remaining elements (note that elements in sets are always ordered, so the first element is well defined there). This is useful when writing judgment rules that operate over sequences and sets.
Judgment functions and inference rules
The next thing is the judgment_fn!
macro. This lets you write a judgment using inference rules. A "judgment" just means some kind of predicate that the computer can judge to hold or not hold. Inference rules are those rules you may have seen in papers and things:
premise1
premise2
premise3
------------------
conclusion
i.e., the conclusion is judged to be true if all the premises are true.
Judgments in type system papers can look all kinds of ways. For example, a common type system judgment would be the following:
Γ ⊢ E : T
This can be read as, in the environment Γ, the expression E has type T. You might have rule like these:
Γ[X] = ty // lookup variable in the environment
--------------- "variable"
Γ ⊢ X : ty
Γ ⊢ E1 : T // must have the same type
Γ ⊢ E2 : T
--------------- "add"
Γ ⊢ E1 + E2 : T
In a-mir-formality, you might write those rules like so:
#![allow(unused)] fn main() { judgment_fn! { pub fn has_type( env: Env, expr: Expr, ) => Type { ( (env.get(&name) => ty) --------------- (has_type(env, name: Variable) => ty) ) ( (has_type(env, left) => ty_left) (has_type(env, right) => ty_right) (if ty_left == ty_right) --------------- (has_type(env, Expr::Add(left, right)) => ty_left) ) } } }
Unlike mathematical papers, where judgments can look like whatever you want, judgments in a-mir-formality always have a fixed form that distinguish inputs and outputs:
judgment_name(input1, input2, input3) => output
In this case, has_type(env, expr) => ty
is the equivalent of Γ ⊢ E : T
. Note that, by convention, we tend to use more English style names, so env
and not Γ
, and expr
and not E
. Of course nothing is stop you from using single letters, it's just a bit harder to read.
When we write the judgement_fn
, it is going to desugar into an actual Rust function that looks like this:
#![allow(unused)] fn main() { pub fn has_type(arg0: impl Upcast<Env>, arg1: impl Upcast<Expr>) -> Set<Type> { let arg0: Env = arg0.upcast(); let arg1: Expr = arg1.upcast(); ... } }
Some things to note. First, the function arguments (arg0
, arg1
) implicitly accept anything that "upcasts" (infallibly converts) into the desired types. Upcast
is a trait defined within a-mir-formality and implemented by the #[term]
macro automatically.
Second, the function always returns a Set
. This is because there can be more rules, and they may match in any ways. The generated code is going to exhaustively search to find all the ways that the rules could match. At a high-level the code looks like this (at least if we ignore the possibility of cycles; we'll come back to that later):
#![allow(unused)] fn main() { pub fn has_type(arg0: impl Upcast<Env>, arg1: impl Upcast<Expr>) -> Set<Type> { let arg0: Env = arg0.upcast(); let arg1: Expr = arg1.upcast(); let mut results = set![]; // we define this macro if /* inference rule 1 matches */ { results.push(/* result from inference rule 1 */); } if /* inference rule 2 matches */ { results.push(/* result from inference rule 1 */); } // ... if /* inference rule N matches */ { results.push(/* result from inference rule N */); } results } }
So how do we test if a particular inference rule matches? Let's look more closely at the code we would generate for this inference rule:
#![allow(unused)] fn main() { ( (env.get(name) => ty) --------------- (has_type(env, name: Variable) => ty) ) }
The first part of the final line, has_type(env, name: Variable)
, defines patterns that are matched against the arguments. These are matched against the arguments (arg0
, arg1
) from the judgment. Patterns can either be trivial bindings (like env
) or more complex (like name: Variable
or Expr::Add(left, right)
). In the latter case, they don't have to match the type of the argument precisely. Instead, we use the Downcast
trait combined with pattern matching. So this inference rule would compile to something like...
#![allow(unused)] fn main() { // Simple variable bindings just clone... let env = arg0.clone(); // More complex patterns are downcasted and testing... if let Some(name) = arg1.downcast::<Variable>() { ... // rule successfully matched! See below. } }
Once we've matched the arguments, we start trying to execute the inference rule conditions. We have one, env.get(&name) => ty
. What does that do? A condition written like $expr => $pat
basically becomes a for loop, so you get...
#![allow(unused)] fn main() { let env = arg0.clone(); if let Some(name) = arg1.downcast::<Variable>() { for ty in env.get(&name) { ... // other conditions, if any } } }
Once we run out of conditions, we can generate the final result, which comes from the => $expr
in the conclusion of the rule. In this case, something like this:
#![allow(unused)] fn main() { let env = arg0.clone(); if let Some(name) = arg1.downcast::<Variable>() { for ty in env.get(&name) { result.push(ty); } } }
Thus each inference rule is converted into a little block of code that may push results onto the final set.
The second inference rule (for addition) looks like...
#![allow(unused)] fn main() { // given this... // ( // (has_type(env, left) => ty_left) // (has_type(env, right) => ty_right) // (if ty_left == ty_right) // --------------- // (has_type(env, Expr::Add(left, right)) => ty_left) // ) // we generate this... let env = arg0.clone(); if let Some(Expr::Add(left, right)) = arg1.downcast() { for ty_left in has_type(env, left) { for ty_right in has_type(env, right) { if ty_left == ty_right { result.push(ty_left); } } } } }
If you want to see a real judgement, take a look at the one for proving where clauses.
Handling cycles
Judgment functions must be inductive, which means that cycles are considered failures. We have a tabling implementation, which means we detect cycles and try to handle them intelligently. Basically we track a stack and, if a cycle is detected, we return an empty set of results. But we remember that the cycle happened. Then, once we are done, we'll have computed some intermediate set of results R[0]
, and we execute again. This time, when we get the cycle, we return R[0]
instead of an empty set. This will compute some new set of results, R[1]
. So then we try again. We keep doing this until the new set of results R[i]
is equal to the previous set of results R[i-1]
. At that point, we have reached a fixed point, so we stop. Of course, it could be that you get an infinitely growing set of results, and execution never terminates. This means your rules are broken. Don't do that.
FAQ and troubleshooting
Why am I getting errors about undefined references to crate::FormalityLang
?
The various derive macros need to know what language you are working in.
To figure this out, they reference crate::FormalityLang
, which you must define.
See the chapter on defining your language for more details.