introduce the idea of "commit points"

nikomatsakis · Niko Matsakis · commit 82985560ce4e · 2025-02-24T20:48:00.000Z
diff --git a/book/src/formality_core/parse.md b/book/src/formality_core/parse.md
@@ -19,7 +19,18 @@ enum MyEnum {
 }
 ```
 
-### Ambiguity and greedy parsing
+### Succeeding, failing, and _almost_ succeeding
+
+When you attempt to parse something, you'll get back a `Result`: either the parse succeeded (`Ok`), or it didn't (`Err`). But we actually distinguish three outcomes:
+
+- Success: we parsed a value successfully. We generally implement a **greedy** parse, which means we will attempt to consume as many things we can. As a simple example, imagine you are parsing a list of numbers. If the input is `"1, 2, 3"`, we could choose to parse just `[1, 2]` (or indeed just `[1]`), but we will instead parse the full list.
+  - For you parsing nerds, this is analogous to the commonly used rule to prefer shifts over reduces in LR parsers.
+- Failure: we tried to parse the value, but it clearly did not correspond to the thing we are looking for. This usually means that the first token was not a valid first token. This will give a not-very-helpful error message like "expected an `Expr`" (assuming we are parsing an `Expr`).
+- _Almost_ succeeded: this is a special case of failure where we got part-way through parsing, consuming some tokens, but then encountered an error. So for example if we had an input like `"1 / / 3"`, we might give an error like "expected an `Expr`, found `/`". Exactly how many tokens we have to consume before we consider something to have 'almost' succeeded depends on the thing we are parsing (see the discussion on _commit points_ below).
+
+Both failure and 'almost' succeeding correspond to a return value of `Err`. The difference is in the errors contained in the result. If there is a single error and it occurs at the start of the input (possibly after skipping whitespace), that is considered **failure**. Otherwise the parse "almost" succeeded. The distinction between failure and "almost" succeeding helps us to give better error messages, but it is also important for "optional" parsing or when parsing repeated items.
+
+### Resolving ambiguity, greedy parsing
 
 When parsing an enum there will be multiple possibilities. We will attempt to parse them all. If more than one succeeds, the parser will attempt to resolve the ambiguity by looking for the **longest match**. However, we don't just consider the number of characters, we look for a **reduction prefix**:
 
@@ -53,11 +64,10 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
 - Most things are _terminals_ or _tokens_: this means they just match themselves:
   - For example, the `*` in `#[grammar($v0 * $v1)]` is a terminal, and it means to parse a `*` from the input.
   - Delimeters are accepted but must be matched, e.g., `( /* tokens */ )` or `[ /* tokens */ ]`.
-- Things beginning with `$` are _nonterminals_ -- they parse the contents of a field. The grammar for a field is generally determined from its type.
+- The `$` character is used to introduce special matches. Generally these are _nonterminals_, which means they parse the contents of a field, where the grammar for a field is determined by its type.
   - If fields have names, then `$field` should name the field.
   - For position fields (e.g., the T and U in `Mul(Expr, Expr)`), use `$v0`, `$v1`, etc.
-  - Exception: `$$` is treated as the terminal `'$'`.
-- Nonterminals have various modes:
+- Valid uses of `$` are as follows:
   - `$field` -- just parse the field's type
   - `$*field` -- the field must be a collection of `T` (e.g., `Vec<T>`, `Set<T>`) -- parse any number of `T` instances. Something like `[ $*field ]` would parse `[f1 f2 f3]`, assuming `f1`, `f2`, and `f3` are valid values for `field`.
   - `$,field` -- similar to the above, but uses a comma separated list (with optional trailing comma). So `[ $,field ]` will parse something like `[f1, f2, f3]`.
@@ -71,10 +81,50 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
   - `${field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`
   - `${?field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`, but accept empty string as empty vector
   - `$:guard <nonterminal>` -- parses `<nonterminal>` but only if the keyword `guard` is present. For example, `$:where $,where_clauses` would parse `where WhereClause1, WhereClause2, WhereClause3` but would also accept nothing (in which case, you would get an empty vector).
+  - `$!` -- marks a commit point, see the section on greediness below
+  - `$$` -- parse the terminal `$`
+
+### Commit points and greedy parsing
+
+When you parse an optional (e.g., `$?field`) or repeated (e.g., `$*field`) nonterminal, it raises an interesting question. We will attempt to parse the given field, but how do we treat an error? It could mean that the field is not present, but it also could mean a syntax error on the part of the user. To resolve this, we make use of the distinction between failure and _almost_ succeeding that we introduced earlier:
+
+- If parsing `field` outright **fails**, that means that the field was not present, and so the parse can continue with the field having its `Default::default()` value.
+- If parsing `field` **almost succeeds**, then we assume it was present, but there is a syntax error, and so parsing fails.
+
+The default rule is that parsing "almost" succeeds if it consumes at least one token. So e.g. if you had...
+
+```rust
+#[term]
+enum Projection {
+  #[grammar(. $v0)]
+  Field(Id),
+}
+```
 
-### Greediness
+...and you tried to parse `".#"`, that would "almost" succeed, because it would consume the `.` but then fail to find an identifier.
+
+Sometimes this rule is not quite right. For example, maybe the `Projection` type is embedded in another type like
+
+```rust
+#[term($*projections . #)]
+struct ProjectionsThenHash {
+  projections: Vec<Projection>,
+}
+```
+
+For `ProjectionsThenHash`, we would consider `".#"` to be a valid parse -- it starts out with no projections and then parses `.#`. But if you try this, you will get an error because the `.#` is considered to be an "almost success" of a projection.
+
+You can control this by indicating a "commit point" with `$!`. If `$!` is present, the parse is failure unless the commit point has been reached. For our grammar above, modifying `Projection` to have a commit point _after_ the identifier will let `ProjectionsThenHash` parse as expected:
+
+```rust
+#[term]
+enum Projection {
+  #[grammar(. $v0 $!)]
+  Field(Id),
+}
+```
 
-Parsing is generally greedy. So `$*x` and `$,x`, for example, consume as many entries as they can. Typically this works best if `x` begins with some symbol that indicates whether it is present.
+See the `parser_torture_tests::commit_points` code for an example of this in action.
 
 ### Default grammar
 
diff --git a/crates/formality-core/src/parse/parser.rs b/crates/formality-core/src/parse/parser.rs
@@ -153,6 +153,9 @@ where
     current_text: &'t str,
     reductions: Vec<&'static str>,
     is_cast_variant: bool,
+
+    /// A variant is 'committed' when we have seen enough success
+    is_committed: bool,
 }
 
 impl<'s, 't, T, L> Parser<'s, 't, T, L>
@@ -289,11 +292,13 @@ where
                 // We only record failures where we actually consumed any tokens.
                 // This is part of our error reporting and recovery mechanism.
                 // Note that we expect (loosely) an LL(1) grammar.
-                self.failures.extend(
-                    errs.into_iter()
-                        .filter(|e| e.consumed_any_since(self.start_text))
-                        .inspect(|e| tracing::trace!("error: {e:?}")),
-                );
+                if active_variant.is_committed {
+                    self.failures.extend(
+                        errs.into_iter()
+                            .filter(|e| e.consumed_any_since(self.start_text))
+                            .inspect(|e| tracing::trace!("error: {e:?}")),
+                    );
+                }
             }
         }
     }
@@ -380,8 +385,30 @@ where
             current_text: start_text,
             reductions: vec![],
             is_cast_variant: false,
+            is_committed: true,
         }
     }
+
+    /// A variant is "committed" when it has parsed enough tokens
+    /// for us to be reasonably sure this is what the user meant to type.
+    /// At that point, any parse errors will propagate out.
+    /// This is important for optional or repeated nonterminals.
+    ///
+    /// By default, variants start with committed set to true.
+    /// You can clear it to false explicitly and set it back to true later
+    /// once you've seen enough parsing.
+    ///
+    /// Regardless of the value of this flag, any error that occurs before
+    /// we have consumed any tokens at all will be considered uncommitted.
+    ///
+    /// With auto-generated parsers, this flag is used to implement the `$!`
+    /// marker. If that marker is present, we set committed to false initially,
+    /// and then set it to true when we encounter a `$!`.
+    pub fn set_committed(&mut self, value: bool) {
+        tracing::trace!("set_committed({})", value);
+        self.is_committed = value;
+    }
+
     fn current_state(&self) -> CurrentState {
         // Determine whether we are in Left or Right position -- Left means
         // that we have not yet consumed any tokens. Right means that we have.
@@ -542,6 +569,7 @@ where
             reductions: vec![],
             scope: self.scope,
             is_cast_variant: false,
+            is_committed: true,
         };
 
         match op(&mut this) {
@@ -601,6 +629,7 @@ where
             current_text: self.current_text,
             reductions: vec![],
             is_cast_variant: false,
+            is_committed: true,
         };
         let result = op(&mut av);
         self.current_text = av.current_text;
@@ -746,8 +775,16 @@ where
                     // If no errors consumed anything, then self.text
                     // must not have advanced.
                     assert_eq!(skip_whitespace(text0), self.current_text);
+                    tracing::trace!(
+                        "opt_nonterminal({}): parsing did not consume tokens or did not commit",
+                        std::any::type_name::<T>()
+                    );
                     Ok(None)
                 } else {
+                    tracing::trace!(
+                        "opt_nonterminal({}): 'almost' succeeded with parse",
+                        std::any::type_name::<T>()
+                    );
                     Err(errs)
                 }
             }
diff --git a/crates/formality-macros/src/debug.rs b/crates/formality-macros/src/debug.rs
@@ -194,6 +194,8 @@ fn debug_variant_with_attr(
         stream.extend(match op {
             spec::FormalitySpecSymbol::Field { name, mode } => debug_field_with_mode(name, mode),
 
+            spec::FormalitySpecSymbol::CommitPoint => quote!(),
+
             spec::FormalitySpecSymbol::Keyword { ident } => {
                 let literal = as_literal(ident);
                 quote_spanned!(ident.span() =>
diff --git a/crates/formality-macros/src/parse.rs b/crates/formality-macros/src/parse.rs
@@ -177,14 +177,23 @@ fn parse_variant_with_attr(
     spec: &FormalitySpec,
     mut stream: TokenStream,
 ) -> syn::Result<TokenStream> {
+    // If the user added a commit point, then clear the committed flag initially.
+    // It will be set back to true once we reach that commit point.
+    if spec
+        .symbols
+        .iter()
+        .any(|s| matches!(s, spec::FormalitySpecSymbol::CommitPoint))
+    {
+        stream.extend(quote!(__p.set_committed(false);));
+    }
+
     for symbol in &spec.symbols {
         stream.extend(match symbol {
             spec::FormalitySpecSymbol::Field { name, mode } => {
                 let initializer = parse_field_mode(name.span(), mode);
-                quote_spanned! {
-                    name.span() =>
+                quote_spanned!(name.span() =>
                     let #name = #initializer;
-                }
+                )
             }
 
             spec::FormalitySpecSymbol::Keyword { ident } => {
@@ -194,17 +203,22 @@ fn parse_variant_with_attr(
                 )
             }
 
+            spec::FormalitySpecSymbol::CommitPoint => {
+                quote!(
+                    let () = __p.set_committed(true);
+                )
+            }
+
             spec::FormalitySpecSymbol::Char { punct } => {
                 let literal = Literal::character(punct.as_char());
-                quote_spanned!(
-                    punct.span() =>
+                quote_spanned!(punct.span() =>
                     __p.expect_char(#literal)?;
                 )
             }
 
             spec::FormalitySpecSymbol::Delimeter { text } => {
                 let literal = Literal::character(*text);
-                quote!(
+                quote_spanned!(literal.span() =>
                     __p.expect_char(#literal)?;
                 )
             }
diff --git a/crates/formality-macros/src/spec.rs b/crates/formality-macros/src/spec.rs
@@ -21,10 +21,13 @@ pub enum FormalitySpecSymbol {
     /// `$foo` or `$foo*` -- indicates we should parse the type of the given field.
     Field { name: Ident, mode: FieldMode },
 
+    /// `$!` -- indicates where a parse is considered to 'almost' succeed
+    CommitPoint,
+
     /// `foo` -- indicates we should parse the given keyword.
     Keyword { ident: Ident },
 
-    /// `<` -- indicates we should parse the given char. We currently ignoring the spacing rules.
+    /// `<` -- indicates we should parse the given char. We currently ignore the spacing rules.
     Char { punct: Punct },
 
     /// Specific delimeter (e.g., `(`) we should parse.
@@ -36,10 +39,8 @@ pub enum FieldMode {
     /// $x -- just parse `x`
     Single,
 
-    Guarded {
-        guard: Ident,
-        mode: Arc<FieldMode>,
-    },
+    /// $:ident $nt -- try to parse `ident` and, if present, parse `$nt`
+    Guarded { guard: Ident, mode: Arc<FieldMode> },
 
     /// $<x> -- `x` is a `Vec<E>`, parse `<E0,...,En>`
     /// $[x] -- `x` is a `Vec<E>`, parse `[E0,...,En]`
@@ -151,6 +152,12 @@ fn parse_variable_binding(
             Ok(FormalitySpecSymbol::Char { punct })
         }
 
+        // $!
+        TokenTree::Punct(punct) if punct.as_char() == '!' => {
+            tokens.next();
+            Ok(FormalitySpecSymbol::CommitPoint)
+        }
+
         // $,x
         TokenTree::Punct(punct) if punct.as_char() == ',' => {
             tokens.next();
diff --git a/tests/parser-torture-tests/commit_points.rs b/tests/parser-torture-tests/commit_points.rs
@@ -0,0 +1,82 @@
+//! Test to demonstrate the value of commit points.
+//! We distinguish three states when parsing a nonterminal `X`:
+//!
+//! * Failed completely -- there is no `X` here
+//! * Almost succeeded -- there is a `X` here, but it has syntax errors
+//! * Succeeded -- there is an `X` here
+//!
+//! Distinguishing the first two is an art, not a science.
+//!
+//! A typical parser combinator just distinguishes the first and the last
+//! and doesn't have a concept of "almost succeeded", but this makes for
+//! significantly worse error messages and less predictable parsing.
+//!
+//! By default, we say that a parse "almost" succeeds if it consumes
+//! any tokens at all. This corresponds to LL(1) grammars. But sometimes
+//! it's not good enough!
+//!
+//! In this test, the `Place` grammar consumes as many `. <id>` projections
+//! as it can. But if we consider consuming `.` alone to be enough for a
+//! projection to "almost" succeed, we can't parse `$expr.let`. Note that `let`
+//! is a keyword, so that is not parsable as a `Place`.
+use formality_core::{term, test};
+use std::sync::Arc;
+
+#[term]
+pub enum Expr {
+    #[cast]
+    Place(Place),
+
+    #[grammar($v0 . let)]
+    Let(Arc<Expr>),
+}
+
+#[term($var $*projections)]
+pub struct Place {
+    var: Id,
+    projections: Vec<Projection>,
+}
+
+#[term]
+pub enum Projection {
+    #[grammar(. $v0 $!)]
+    Field(Id),
+}
+
+formality_core::id!(Id);
+
+/// Check that we can parse `a.b` as a `Field``
+#[test]
+fn test_parse_field() {
+    let e: Expr = crate::ptt::term("a.b");
+    expect_test::expect![[r#"
+        Place(
+            Place {
+                var: a,
+                projections: [
+                    Field(
+                        b,
+                    ),
+                ],
+            },
+        )
+    "#]]
+    .assert_debug_eq(&e);
+}
+
+/// Check that we can parse `a.let` as a `Let``
+#[test]
+fn test_parse_let() {
+    let e: Expr = crate::ptt::term("a.let");
+    expect_test::expect![[r#"
+        Let(
+            Place(
+                Place {
+                    var: a,
+                    projections: [],
+                },
+            ),
+        )
+    "#]]
+    .assert_debug_eq(&e);
+}
diff --git a/tests/parser-torture-tests/main.rs b/tests/parser-torture-tests/main.rs
@@ -1,4 +1,5 @@
 mod ambiguity;
+mod commit_points;
 mod grammar;
 mod left_associative;
 mod none_associative;

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`mod ambiguity;`
	`2`	`+mod commit_points;`
`2`	`3`	`mod grammar;`
`3`	`4`	`mod left_associative;`
`4`	`5`	`mod none_associative;`