Parsing

Formality's #[term] and #[grammar] attributes let you specify the grammar that will be used to parse your structs/enums.

For structs, there is a single grammar, specified as part of the term:

#![allow(unused)]
fn main() {
#[term( /* grammar here */ )]
struct MyStruct { }
}

For enums, the grammar is placed in #[grammar] attributes on each variant:

#![allow(unused)]
fn main() {
#[term]
enum MyEnum {
    #[grammar( /* grammar here */ )]
    Foo(...),
}
}

Ambiguity and greedy parsing

When parsing an enum there will be multiple possibilities. We will attempt to parse them all. If more than one succeeds, the parser will attempt to resolve the ambiguity by looking for the longest match. However, we don't just consider the number of characters, we look for a reduction prefix:

  • When parsing, we track the list of things we had to parse. If there are two variants at the same precedence level, but one of them had to parse strictly more things than the other and in the same way, we'll prefer the longer one. So for example if one variant parsed a Ty and the other parsed a Ty Ty, we'd take the Ty Ty.
    • When considering whether a reduction is "significant", we take casts into account. See ActiveVariant::mark_as_cast_variant for a more detailed explanation and set of examples.

Precedence and left-recursive grammars

We support left-recursive grammars like this one from the parse-torture-tests:

#![allow(unused)]
fn main() {
#[term]
pub enum Path {
    #[cast]
    Id(Id),

    #[grammar($v0 . $v1)]
    Field(Arc<Path>, Id),

    #[grammar($v0 [ $v1 ])]
    Index(Arc<Path>, Arc<Path>),
}

formality_core::id!(Id);
}

We also support ambiguous grammars. For example, you can code up arithmetic expressions like this:

#![allow(unused)]
fn main() {
#[term]
pub enum Expr {
    #[cast]
    Id(Id),

    #[grammar($v0 + $v1)]
    #[precedence(1)]
    Add(Arc<Expr>, Arc<Expr>),

    #[grammar($v0 * $v1)]
    #[precedence(2)]
    Mul(Arc<Expr>, Arc<Expr>),
}

formality_core::id!(Id);
}

When specifying the #[precedence] of a variant, the default is left-associativity, which can be written more explicitly as #[precedence(L, left)]. If you prefer, you can specify right-associativity (#[precedence(L, right)]) or non-associativity #[precedence(L, none)]. This affects how things of the same level are parsed:

  • 1 + 1 + 1 when left-associative is (1 + 1) + 1
  • 1 + 1 + 1 when right-associative is 1 + (1 + 1)
  • 1 + 1 + 1 when none-associative is an error.

Symbols

A grammar consists of a series of symbols. Each symbol matches some text in the input string. Symbols come in two varieties:

  • Most things are terminals or tokens: this means they just match themselves:
    • For example, the * in #[grammar($v0 * $v1)] is a terminal, and it means to parse a * from the input.
    • Delimeters are accepted but must be matched, e.g., ( /* tokens */ ) or [ /* tokens */ ].
  • Things beginning with $ are nonterminals -- they parse the contents of a field. The grammar for a field is generally determined from its type.
    • If fields have names, then $field should name the field.
    • For position fields (e.g., the T and U in Mul(Expr, Expr)), use $v0, $v1, etc.
    • Exception: $$ is treated as the terminal '$'.
  • Nonterminals have various modes:
    • $field -- just parse the field's type
    • $*field -- the field must be a collection of T (e.g., Vec<T>, Set<T>) -- parse any number of T instances. Something like [ $*field ] would parse [f1 f2 f3], assuming f1, f2, and f3 are valid values for field.
    • $,field -- similar to the above, but uses a comma separated list (with optional trailing comma). So [ $,field ] will parse something like [f1, f2, f3].
    • $?field -- will parse field and use Default::default() value if not present.
    • $<field> -- parse <E1, E2, E3>, where field is a collection of E
    • $<?field> -- parse <E1, E2, E3>, where field is a collection of E, but accept empty string as empty vector
    • $(field) -- parse (E1, E2, E3), where field is a collection of E
    • $(?field) -- parse (E1, E2, E3), where field is a collection of E, but accept empty string as empty vector
    • $[field] -- parse [E1, E2, E3], where field is a collection of E
    • $[?field] -- parse [E1, E2, E3], where field is a collection of E, but accept empty string as empty vector
    • ${field} -- parse {E1, E2, E3}, where field is a collection of E
    • ${?field} -- parse {E1, E2, E3}, where field is a collection of E, but accept empty string as empty vector
    • $:guard <nonterminal> -- parses <nonterminal> but only if the keyword guard is present. For example, $:where $,where_clauses would parse where WhereClause1, WhereClause2, WhereClause3 but would also accept nothing (in which case, you would get an empty vector).

Greediness

Parsing is generally greedy. So $*x and $,x, for example, consume as many entries as they can. Typically this works best if x begins with some symbol that indicates whether it is present.

Default grammar

If no grammar is supplied, the default grammar is determined as follows:

  • If a #[cast] or #[variable] annotation is present, then the default grammar is just $v0.
  • Otherwise, the default grammar is the name of the type (for structs) or variant (for enums), followed by (), with the values for the fields in order. So Mul(Expr, Expr) would have a default grammar mul($v0, $v1).

Customizing the parse

If you prefer, you can customize the parse by annotating your term with #[customize(parse)]. In the Rust case, for example, the parsing of RigidTy is customized (as is the debug impl):

#![allow(unused)]
fn main() {
#[term((rigid $name $*parameters))]
#[customize(parse, debug)]
pub struct RigidTy {
    pub name: RigidName,
    pub parameters: Parameters,
}
}

You must then supply an impl of Parse yourself. Because Parse is a trait alias, you actually want to implement CoreParse<L> for your language type L. Inside the code you will want to instantiate a Parser and then invoke parse_variant for every variant, finally invoking finish.

In the Rust code, the impl for RigidTy looks as follows:

#![allow(unused)]
fn main() {
// Implement custom parsing for rigid types.
impl CoreParse<Rust> for RigidTy {
    fn parse<'t>(scope: &Scope<Rust>, text: &'t str) -> ParseResult<'t, Self> {
        Parser::multi_variant(scope, text, "RigidTy", |parser| {
            // Parse a `ScalarId` (and upcast it to `RigidTy`) with the highest
            // precedence. If someone writes `u8`, we always interpret it as a
            // scalar-id.
            parser.parse_variant_cast::<ScalarId>(Precedence::default());

            // Parse something like `Id<...>` as an ADT.
            parser.parse_variant("Adt", Precedence::default(), |p| {
                // Don't accept scalar-ids as Adt names.
                p.reject_nonterminal::<ScalarId>()?;

                let name: AdtId = p.nonterminal()?;
                let parameters: Vec<Parameter> = parse_parameters(p)?;
                Ok(RigidTy {
                    name: name.upcast(),
                    parameters,
                })
            });

            // Parse `&`
            parser.parse_variant("Ref", Precedence::default(), |p| {
                p.expect_char('&')?;
                let lt: Lt = p.nonterminal()?;
                let ty: Ty = p.nonterminal()?;
                Ok(RigidTy {
                    name: RigidName::Ref(RefKind::Shared),
                    parameters: seq![lt.upcast(), ty.upcast()],
                })
            });

            parser.parse_variant("RefMut", Precedence::default(), |p| {
                p.expect_char('&')?;
                p.expect_keyword("mut")?;
                let lt: Lt = p.nonterminal()?;
                let ty: Ty = p.nonterminal()?;
                Ok(RigidTy {
                    name: RigidName::Ref(RefKind::Mut),
                    parameters: seq![lt.upcast(), ty.upcast()],
                })
            });

            parser.parse_variant("Tuple", Precedence::default(), |p| {
                p.expect_char('(')?;
                p.reject_custom_keywords(&["alias", "rigid", "predicate"])?;
                let types: Vec<Ty> = p.comma_nonterminal()?;
                p.expect_char(')')?;
                let name = RigidName::Tuple(types.len());
                Ok(RigidTy {
                    name,
                    parameters: types.upcast(),
                })
            });
        })
    }
}
}