Skip to content

Commit 8298556

Browse files
nikomatsakisNiko Matsakis
authored andcommitted
introduce the idea of "commit points"
1 parent 57126c9 commit 8298556

File tree

7 files changed

+215
-22
lines changed

7 files changed

+215
-22
lines changed

book/src/formality_core/parse.md

Lines changed: 56 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,18 @@ enum MyEnum {
1919
}
2020
```
2121

22-
### Ambiguity and greedy parsing
22+
### Succeeding, failing, and _almost_ succeeding
23+
24+
When you attempt to parse something, you'll get back a `Result`: either the parse succeeded (`Ok`), or it didn't (`Err`). But we actually distinguish three outcomes:
25+
26+
- Success: we parsed a value successfully. We generally implement a **greedy** parse, which means we will attempt to consume as many things we can. As a simple example, imagine you are parsing a list of numbers. If the input is `"1, 2, 3"`, we could choose to parse just `[1, 2]` (or indeed just `[1]`), but we will instead parse the full list.
27+
- For you parsing nerds, this is analogous to the commonly used rule to prefer shifts over reduces in LR parsers.
28+
- Failure: we tried to parse the value, but it clearly did not correspond to the thing we are looking for. This usually means that the first token was not a valid first token. This will give a not-very-helpful error message like "expected an `Expr`" (assuming we are parsing an `Expr`).
29+
- _Almost_ succeeded: this is a special case of failure where we got part-way through parsing, consuming some tokens, but then encountered an error. So for example if we had an input like `"1 / / 3"`, we might give an error like "expected an `Expr`, found `/`". Exactly how many tokens we have to consume before we consider something to have 'almost' succeeded depends on the thing we are parsing (see the discussion on _commit points_ below).
30+
31+
Both failure and 'almost' succeeding correspond to a return value of `Err`. The difference is in the errors contained in the result. If there is a single error and it occurs at the start of the input (possibly after skipping whitespace), that is considered **failure**. Otherwise the parse "almost" succeeded. The distinction between failure and "almost" succeeding helps us to give better error messages, but it is also important for "optional" parsing or when parsing repeated items.
32+
33+
### Resolving ambiguity, greedy parsing
2334

2435
When parsing an enum there will be multiple possibilities. We will attempt to parse them all. If more than one succeeds, the parser will attempt to resolve the ambiguity by looking for the **longest match**. However, we don't just consider the number of characters, we look for a **reduction prefix**:
2536

@@ -53,11 +64,10 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
5364
- Most things are _terminals_ or _tokens_: this means they just match themselves:
5465
- For example, the `*` in `#[grammar($v0 * $v1)]` is a terminal, and it means to parse a `*` from the input.
5566
- Delimeters are accepted but must be matched, e.g., `( /* tokens */ )` or `[ /* tokens */ ]`.
56-
- Things beginning with `$` are _nonterminals_ -- they parse the contents of a field. The grammar for a field is generally determined from its type.
67+
- The `$` character is used to introduce special matches. Generally these are _nonterminals_, which means they parse the contents of a field, where the grammar for a field is determined by its type.
5768
- If fields have names, then `$field` should name the field.
5869
- For position fields (e.g., the T and U in `Mul(Expr, Expr)`), use `$v0`, `$v1`, etc.
59-
- Exception: `$$` is treated as the terminal `'$'`.
60-
- Nonterminals have various modes:
70+
- Valid uses of `$` are as follows:
6171
- `$field` -- just parse the field's type
6272
- `$*field` -- the field must be a collection of `T` (e.g., `Vec<T>`, `Set<T>`) -- parse any number of `T` instances. Something like `[ $*field ]` would parse `[f1 f2 f3]`, assuming `f1`, `f2`, and `f3` are valid values for `field`.
6373
- `$,field` -- similar to the above, but uses a comma separated list (with optional trailing comma). So `[ $,field ]` will parse something like `[f1, f2, f3]`.
@@ -71,10 +81,50 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
7181
- `${field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`
7282
- `${?field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`, but accept empty string as empty vector
7383
- `$:guard <nonterminal>` -- parses `<nonterminal>` but only if the keyword `guard` is present. For example, `$:where $,where_clauses` would parse `where WhereClause1, WhereClause2, WhereClause3` but would also accept nothing (in which case, you would get an empty vector).
84+
- `$!` -- marks a commit point, see the section on greediness below
85+
- `$$` -- parse the terminal `$`
86+
87+
### Commit points and greedy parsing
88+
89+
When you parse an optional (e.g., `$?field`) or repeated (e.g., `$*field`) nonterminal, it raises an interesting question. We will attempt to parse the given field, but how do we treat an error? It could mean that the field is not present, but it also could mean a syntax error on the part of the user. To resolve this, we make use of the distinction between failure and _almost_ succeeding that we introduced earlier:
90+
91+
- If parsing `field` outright **fails**, that means that the field was not present, and so the parse can continue with the field having its `Default::default()` value.
92+
- If parsing `field` **almost succeeds**, then we assume it was present, but there is a syntax error, and so parsing fails.
93+
94+
The default rule is that parsing "almost" succeeds if it consumes at least one token. So e.g. if you had...
95+
96+
```rust
97+
#[term]
98+
enum Projection {
99+
#[grammar(. $v0)]
100+
Field(Id),
101+
}
102+
```
74103

75-
### Greediness
104+
...and you tried to parse `".#"`, that would "almost" succeed, because it would consume the `.` but then fail to find an identifier.
105+
106+
Sometimes this rule is not quite right. For example, maybe the `Projection` type is embedded in another type like
107+
108+
```rust
109+
#[term($*projections . #)]
110+
struct ProjectionsThenHash {
111+
projections: Vec<Projection>,
112+
}
113+
```
114+
115+
For `ProjectionsThenHash`, we would consider `".#"` to be a valid parse -- it starts out with no projections and then parses `.#`. But if you try this, you will get an error because the `.#` is considered to be an "almost success" of a projection.
116+
117+
You can control this by indicating a "commit point" with `$!`. If `$!` is present, the parse is failure unless the commit point has been reached. For our grammar above, modifying `Projection` to have a commit point _after_ the identifier will let `ProjectionsThenHash` parse as expected:
118+
119+
```rust
120+
#[term]
121+
enum Projection {
122+
#[grammar(. $v0 $!)]
123+
Field(Id),
124+
}
125+
```
76126

77-
Parsing is generally greedy. So `$*x` and `$,x`, for example, consume as many entries as they can. Typically this works best if `x` begins with some symbol that indicates whether it is present.
127+
See the `parser_torture_tests::commit_points` code for an example of this in action.
78128

79129
### Default grammar
80130

crates/formality-core/src/parse/parser.rs

Lines changed: 42 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,9 @@ where
153153
current_text: &'t str,
154154
reductions: Vec<&'static str>,
155155
is_cast_variant: bool,
156+
157+
/// A variant is 'committed' when we have seen enough success
158+
is_committed: bool,
156159
}
157160

158161
impl<'s, 't, T, L> Parser<'s, 't, T, L>
@@ -289,11 +292,13 @@ where
289292
// We only record failures where we actually consumed any tokens.
290293
// This is part of our error reporting and recovery mechanism.
291294
// Note that we expect (loosely) an LL(1) grammar.
292-
self.failures.extend(
293-
errs.into_iter()
294-
.filter(|e| e.consumed_any_since(self.start_text))
295-
.inspect(|e| tracing::trace!("error: {e:?}")),
296-
);
295+
if active_variant.is_committed {
296+
self.failures.extend(
297+
errs.into_iter()
298+
.filter(|e| e.consumed_any_since(self.start_text))
299+
.inspect(|e| tracing::trace!("error: {e:?}")),
300+
);
301+
}
297302
}
298303
}
299304
}
@@ -380,8 +385,30 @@ where
380385
current_text: start_text,
381386
reductions: vec![],
382387
is_cast_variant: false,
388+
is_committed: true,
383389
}
384390
}
391+
392+
/// A variant is "committed" when it has parsed enough tokens
393+
/// for us to be reasonably sure this is what the user meant to type.
394+
/// At that point, any parse errors will propagate out.
395+
/// This is important for optional or repeated nonterminals.
396+
///
397+
/// By default, variants start with committed set to true.
398+
/// You can clear it to false explicitly and set it back to true later
399+
/// once you've seen enough parsing.
400+
///
401+
/// Regardless of the value of this flag, any error that occurs before
402+
/// we have consumed any tokens at all will be considered uncommitted.
403+
///
404+
/// With auto-generated parsers, this flag is used to implement the `$!`
405+
/// marker. If that marker is present, we set committed to false initially,
406+
/// and then set it to true when we encounter a `$!`.
407+
pub fn set_committed(&mut self, value: bool) {
408+
tracing::trace!("set_committed({})", value);
409+
self.is_committed = value;
410+
}
411+
385412
fn current_state(&self) -> CurrentState {
386413
// Determine whether we are in Left or Right position -- Left means
387414
// that we have not yet consumed any tokens. Right means that we have.
@@ -542,6 +569,7 @@ where
542569
reductions: vec![],
543570
scope: self.scope,
544571
is_cast_variant: false,
572+
is_committed: true,
545573
};
546574

547575
match op(&mut this) {
@@ -601,6 +629,7 @@ where
601629
current_text: self.current_text,
602630
reductions: vec![],
603631
is_cast_variant: false,
632+
is_committed: true,
604633
};
605634
let result = op(&mut av);
606635
self.current_text = av.current_text;
@@ -746,8 +775,16 @@ where
746775
// If no errors consumed anything, then self.text
747776
// must not have advanced.
748777
assert_eq!(skip_whitespace(text0), self.current_text);
778+
tracing::trace!(
779+
"opt_nonterminal({}): parsing did not consume tokens or did not commit",
780+
std::any::type_name::<T>()
781+
);
749782
Ok(None)
750783
} else {
784+
tracing::trace!(
785+
"opt_nonterminal({}): 'almost' succeeded with parse",
786+
std::any::type_name::<T>()
787+
);
751788
Err(errs)
752789
}
753790
}

crates/formality-macros/src/debug.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,8 @@ fn debug_variant_with_attr(
194194
stream.extend(match op {
195195
spec::FormalitySpecSymbol::Field { name, mode } => debug_field_with_mode(name, mode),
196196

197+
spec::FormalitySpecSymbol::CommitPoint => quote!(),
198+
197199
spec::FormalitySpecSymbol::Keyword { ident } => {
198200
let literal = as_literal(ident);
199201
quote_spanned!(ident.span() =>

crates/formality-macros/src/parse.rs

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -177,14 +177,23 @@ fn parse_variant_with_attr(
177177
spec: &FormalitySpec,
178178
mut stream: TokenStream,
179179
) -> syn::Result<TokenStream> {
180+
// If the user added a commit point, then clear the committed flag initially.
181+
// It will be set back to true once we reach that commit point.
182+
if spec
183+
.symbols
184+
.iter()
185+
.any(|s| matches!(s, spec::FormalitySpecSymbol::CommitPoint))
186+
{
187+
stream.extend(quote!(__p.set_committed(false);));
188+
}
189+
180190
for symbol in &spec.symbols {
181191
stream.extend(match symbol {
182192
spec::FormalitySpecSymbol::Field { name, mode } => {
183193
let initializer = parse_field_mode(name.span(), mode);
184-
quote_spanned! {
185-
name.span() =>
194+
quote_spanned!(name.span() =>
186195
let #name = #initializer;
187-
}
196+
)
188197
}
189198

190199
spec::FormalitySpecSymbol::Keyword { ident } => {
@@ -194,17 +203,22 @@ fn parse_variant_with_attr(
194203
)
195204
}
196205

206+
spec::FormalitySpecSymbol::CommitPoint => {
207+
quote!(
208+
let () = __p.set_committed(true);
209+
)
210+
}
211+
197212
spec::FormalitySpecSymbol::Char { punct } => {
198213
let literal = Literal::character(punct.as_char());
199-
quote_spanned!(
200-
punct.span() =>
214+
quote_spanned!(punct.span() =>
201215
__p.expect_char(#literal)?;
202216
)
203217
}
204218

205219
spec::FormalitySpecSymbol::Delimeter { text } => {
206220
let literal = Literal::character(*text);
207-
quote!(
221+
quote_spanned!(literal.span() =>
208222
__p.expect_char(#literal)?;
209223
)
210224
}

crates/formality-macros/src/spec.rs

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,13 @@ pub enum FormalitySpecSymbol {
2121
/// `$foo` or `$foo*` -- indicates we should parse the type of the given field.
2222
Field { name: Ident, mode: FieldMode },
2323

24+
/// `$!` -- indicates where a parse is considered to 'almost' succeed
25+
CommitPoint,
26+
2427
/// `foo` -- indicates we should parse the given keyword.
2528
Keyword { ident: Ident },
2629

27-
/// `<` -- indicates we should parse the given char. We currently ignoring the spacing rules.
30+
/// `<` -- indicates we should parse the given char. We currently ignore the spacing rules.
2831
Char { punct: Punct },
2932

3033
/// Specific delimeter (e.g., `(`) we should parse.
@@ -36,10 +39,8 @@ pub enum FieldMode {
3639
/// $x -- just parse `x`
3740
Single,
3841

39-
Guarded {
40-
guard: Ident,
41-
mode: Arc<FieldMode>,
42-
},
42+
/// $:ident $nt -- try to parse `ident` and, if present, parse `$nt`
43+
Guarded { guard: Ident, mode: Arc<FieldMode> },
4344

4445
/// $<x> -- `x` is a `Vec<E>`, parse `<E0,...,En>`
4546
/// $[x] -- `x` is a `Vec<E>`, parse `[E0,...,En]`
@@ -151,6 +152,12 @@ fn parse_variable_binding(
151152
Ok(FormalitySpecSymbol::Char { punct })
152153
}
153154

155+
// $!
156+
TokenTree::Punct(punct) if punct.as_char() == '!' => {
157+
tokens.next();
158+
Ok(FormalitySpecSymbol::CommitPoint)
159+
}
160+
154161
// $,x
155162
TokenTree::Punct(punct) if punct.as_char() == ',' => {
156163
tokens.next();
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
//! Test to demonstrate the value of commit points.
2+
//! We distinguish three states when parsing a nonterminal `X`:
3+
//!
4+
//! * Failed completely -- there is no `X` here
5+
//! * Almost succeeded -- there is a `X` here, but it has syntax errors
6+
//! * Succeeded -- there is an `X` here
7+
//!
8+
//! Distinguishing the first two is an art, not a science.
9+
//!
10+
//! A typical parser combinator just distinguishes the first and the last
11+
//! and doesn't have a concept of "almost succeeded", but this makes for
12+
//! significantly worse error messages and less predictable parsing.
13+
//!
14+
//! By default, we say that a parse "almost" succeeds if it consumes
15+
//! any tokens at all. This corresponds to LL(1) grammars. But sometimes
16+
//! it's not good enough!
17+
//!
18+
//! In this test, the `Place` grammar consumes as many `. <id>` projections
19+
//! as it can. But if we consider consuming `.` alone to be enough for a
20+
//! projection to "almost" succeed, we can't parse `$expr.let`. Note that `let`
21+
//! is a keyword, so that is not parsable as a `Place`.
22+
use formality_core::{term, test};
23+
use std::sync::Arc;
24+
25+
#[term]
26+
pub enum Expr {
27+
#[cast]
28+
Place(Place),
29+
30+
#[grammar($v0 . let)]
31+
Let(Arc<Expr>),
32+
}
33+
34+
#[term($var $*projections)]
35+
pub struct Place {
36+
var: Id,
37+
projections: Vec<Projection>,
38+
}
39+
40+
#[term]
41+
pub enum Projection {
42+
#[grammar(. $v0 $!)]
43+
Field(Id),
44+
}
45+
46+
formality_core::id!(Id);
47+
48+
/// Check that we can parse `a.b` as a `Field``
49+
#[test]
50+
fn test_parse_field() {
51+
let e: Expr = crate::ptt::term("a.b");
52+
expect_test::expect![[r#"
53+
Place(
54+
Place {
55+
var: a,
56+
projections: [
57+
Field(
58+
b,
59+
),
60+
],
61+
},
62+
)
63+
"#]]
64+
.assert_debug_eq(&e);
65+
}
66+
67+
/// Check that we can parse `a.let` as a `Let``
68+
#[test]
69+
fn test_parse_let() {
70+
let e: Expr = crate::ptt::term("a.let");
71+
expect_test::expect![[r#"
72+
Let(
73+
Place(
74+
Place {
75+
var: a,
76+
projections: [],
77+
},
78+
),
79+
)
80+
"#]]
81+
.assert_debug_eq(&e);
82+
}

tests/parser-torture-tests/main.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
mod ambiguity;
2+
mod commit_points;
23
mod grammar;
34
mod left_associative;
45
mod none_associative;

0 commit comments

Comments
 (0)