LL(2) parsing error: parser commits to subrule and is not able to get out #2095

jitsedesmet · 2025-03-17T14:58:41Z

jitsedesmet
Mar 17, 2025

Hi! This library has been amazing so far. I've been trying to create a round tripping parser that includes round tripping in the syntax.
I am in need of a rule that parser everything that is otherwise skipped.

I came across the issue where the following grammar works in ANTLR4, but not in Chevrotain:

grammar Test;

compilationUnit
  : gramB gramD gramF EOF
  ;
gramB
  : gramC ( gramD gramE )*;
gramD
  : 'D'? ;
gramC: 'C' ;
gramE: 'E' ;
gramF: 'F' ;

I expect (and using ANTLR4 this works) to be in the grammar: CDF CDEDF (does not work in Chevrotain) but also CF and CDEF (does work in Chevrotain).
When using Chevrotain it looks like the parser get's stuck in the gramB rule forgetting that a parse of the gramD rule also allows him to continue in the compilationUnit rule.
I am fairly sure the grammar above is LL(2).
I wonder whether this is a mistake on my part, or whether this is a genuine bug. I have no issue with attempting a PR myself. Maybe you could give some pointers :D

(Issue is also present when the gramB rule uses gramC ( gramD gramE )?;)

Chevrotain code I used:

/* eslint-disable require-unicode-regexp */
import type { ParserMethod } from 'chevrotain';
import { createToken, EmbeddedActionsParser, Lexer } from 'chevrotain';
import { describe, it } from 'vitest';

export const lexC = createToken({ name: 'lexC', pattern: /c/i, label: 'c' });
export const lexD = createToken({ name: 'lexD', pattern: /d/i, label: 'd' });
export const lexE = createToken({ name: 'lexE', pattern: /e/i, label: 'e' });
export const lexF = createToken({ name: 'lexF', pattern: /f/i, label: 'f' });
const allTokens = [ lexC, lexD, lexE, lexF ];

const lexer: Lexer = new Lexer(allTokens, {
  positionTracking: 'onlyStart',
  recoveryEnabled: false,
  safeMode: true,
});

class MyParser extends EmbeddedActionsParser {
  public readonly gramB: ParserMethod<Parameters<() => void>, ReturnType<() => 'gramB'>>;
  public readonly gramD: ParserMethod<Parameters<() => void>, ReturnType<() => 'gramD'>>;
  public readonly gramMain: ParserMethod<Parameters<() => void>, ReturnType<() => 'gramMain'>>;

  public constructor() {
    super(allTokens);

    this.gramD = this.RULE('gramD', () => {
      this.OPTION(() => this.CONSUME(lexD));
      return <const> 'gramD';
    });

    this.gramB = this.RULE('gramB', () => {
      this.CONSUME(lexC);
      this.MANY(() => {
        this.SUBRULE(this.gramD, undefined);
        this.CONSUME(lexE);
      });
      return <const> 'gramB';
    });

    this.gramMain = this.RULE('main', () => {
      this.SUBRULE(this.gramB, undefined);
      this.SUBRULE(this.gramD, undefined);
      this.CONSUME(lexF);
      return <const> 'gramMain';
    });

    this.performSelfAnalysis();
  }
}

describe('bugTest', () => {
  const parser = new MyParser();
  function parse(query: string): string {
    const tokens = lexer.tokenize(query);
    if (tokens.errors.length > 0) {
      throw new Error(tokens.errors[0].message);
    }
    parser.input = tokens.tokens;
    const res = parser.gramMain();
    if (parser.errors.length > 0) {
      console.log(tokens.tokens);
      throw new Error(`Parse error on line ${parser.errors.map(x => x.token.startLine).join(', ')}
${parser.errors.map(x => `${x.token.startLine}: ${x.message}`).join('\n')}
${parser.errors.map(x => x.stack).join('\n')}`);
    }
    return res;
  }

  it('bug recreation', ({ expect }) => {
    // WORKS: 'CF' and 'CDEF'
    // DOESN'T WORK, but should?: 'CDF' 'CDEDF'
    const res = parse(`CDF`);
    expect(res).toEqual('gramMain');
  });
});

Source files: ANTLR4 and Chevrotain.

Using the Intellij Plugin for ANTLR 4 I get the following out or ANTLR4:
A parsetree and What I think is a confirmation that it is LL(2)

msujew · 2025-03-17T15:08:27Z

msujew
Mar 17, 2025
Collaborator

Hey @jitsedesmet,

unlike ANTLR4, Chevrotain does not look into the outer context when constructing the lookahead table. I.e. when constructing the lookahead for ( gramD gramE )*, it will only really take ( gramD gramE )* into consideration, and not whether there's a potential gramD gramF EOF afterwards (given that this is not known in the context of gramB on its own). This is not a bug, but by design.

Note that the LL(2) statement only holds true for ANTLR in particular - Chevrotain would say that your grammar is LL(1) but with the outer context issue mentioned above.

Solutions to this consist mainly of using GATE (and maybe BACKTRACK, depending on the complexity of the grammar) in your MANY call.

0 replies

jitsedesmet · 2025-03-17T15:51:22Z

jitsedesmet
Mar 17, 2025
Author

Hi @msujew, Thank you for the swift reply!

It does indeed look like adding a gate resolves the issue (EDIT: previous one was wrong):

this.gramB = this.RULE('gramB', () => {
      this.CONSUME(lexC);
      this.MANY({
        GATE: () => this.LA(2).tokenType === lexE || this.LA(3).tokenType === lexE,
        DEF: () => {
          this.SUBRULE(this.gramD, undefined);
          this.CONSUME(lexE);
        },
      });
      return <const> 'gramB';
    });

Using backtracking or gates is not ideal in this use case because the optional rule (ruleD) that I use in my grammar is called before every CONSUME. That means that this case will happen often. (And using gates often might make debugging a nightmare).
Is it possible that the lookaheadStrategy can solve my issue? Or does it also not have access to the outer context either?
I've also seen that there is a huge warning that providing your own implementation can be devastating for the execution time? 😬

0 replies

msujew · 2025-03-17T15:55:32Z

msujew
Mar 17, 2025
Collaborator

In theory you can use it. I'm currently using it for chevrotain-allstar, but as outlined in TypeFox/chevrotain-allstar#1, it also (currently) is incapable of taking the outer context into account.

providing your own implementation can be devastating for the execution time?

Depending on how optimized your solution is :)

0 replies

jitsedesmet · 2025-03-18T12:43:58Z

jitsedesmet
Mar 18, 2025
Author

Those are 2 very interesting resources, thank you!
As I understand now, the issue is present in subrule that starts with a token (D) and the caller of that subrule follows it with that same token consumed (again D).

Luckily, that should mean that in my use case of parsing the otherwise ignored tokens, I can resolve the issue by changing the semantics of my consumptions. Instead of a requiring a consumption to consume ignored tokens before, it should consume the ignored tokens after. That way my parser will not commit to a certain subrule if it cannot be handled (also reducing the complexity to LL1 again (I think)).

I wonder whether we could add some kind of warning of this behavior to the Chevrotain documentation?
Maybe a small note to the page about LL(k) grammars?
Just because I do feel like not taking the outer context into account is something you would definitely want to know before writing a grammar that is LL(>1)?

Anyway, thank you very much!

Test code verifying that swapping to the end solves the issue:

this.gramB = this.RULE('gramB', () => {
      this.CONSUME(lexC);
      this.MANY({
        // GATE: () => this.LA(2).tokenType === lexE || this.LA(3).tokenType === lexE,
        DEF: () => {
          this.CONSUME(lexE);
          this.SUBRULE(this.gramD, undefined);
        },
      });
      return <const> 'gramB';
    });

And the main rule is still the same:

this.gramMain = this.RULE('main', () => {
      this.SUBRULE(this.gramB, undefined);
      this.SUBRULE(this.gramD, undefined);
      this.CONSUME(lexF);
      return <const> 'gramMain';
    });

The grammar works as intended, now parsing what previously didn't work: CDF , CEDEF, and still parsing what worked: CF, CEDF.

0 replies

bd82 · 2025-03-19T18:54:47Z

bd82
Mar 19, 2025
Maintainer

@jitsedesmet you wrote:

I wonder whether we could add some kind of warning of this behavior to the Chevrotain documentation?
Maybe a small note to the page about LL(k) grammars?
Just because I do feel like not taking the outer context into account is something you would definitely want to know before writing a grammar that is LL(>1)?

My understanding is that taking into account the outer context is a special Antlr feature (Which is an LL(star) / adaptive LL(star) parser generator), and not a common LL(K) capability.

Perhaps a runtime validation identifying these scenarios and outputting a useful error / warning message would be the most useful approach. If you want to try implementing such a validation a good place to start would be here:

Left Recursion Detection

2 replies

jitsedesmet Mar 20, 2025
Author

Hi @bd82,

It is entirly possible 😅 , I am not at all an expert in parsers (sorry).
I just thought that during parsing Chevrotain knows it cannot complete a subrule since it has lookahead of k (gramD gramE)* in the case above).
My readoning was thuis that during lookahead it knows it will fail so it cann start traverssing the callstack back up and see if there are any other rules in the callstack that could 'finish the job'. Ofcource, in case no rule can finish the job, the parser should execute the deepest rule it was at (to get a valid error message).
All in all, this was just what I thought would be possible (disregarding any possible performace regressions).

I'd be happy to contribute back and look at the runtime validation.
So what we want is a check validateNoOuterContextDependence that:
For some topRule traverse the call tree from right to left keeping a list of possible_to_parse_tokens.
For each node that requires you to commit (like MANY above), check if amy of the first posible tokens are in the possible_to_parse_tokens list. If this is the case - there is an outer context dependence(?).
Simple case that shows complexities:
ruleX{ ('a' 'b')* } ruleY{ 'a' }

Thinking about that algorithm makes me think that implementing the outer context support might be easier?
Current behaviour puts you in a new conttext when the next token matches the next token of the new context.
However, during k-lookahead you would see that the next token matches but within your k-lookahead not all tokens match.
-> you safe the context reference in a deepFailed and thus don't touch your 'toBeParsedTokens' stack.
-> You then start poping your function-stack and hope a possible alternative path arises. If no alternative path arises, either because you are about to pop your starting rule, or because your current context forced you to have made a decision, than you execute the deepFailed context.

Since I don't know anything about the codebase, I am fine trying to implement either techniques.

Wrapping up with a clear question:
What would you prefer me looking into, the validateNoOuterContextDependence or the support of the outer context?
Any additional feedback?

bd82 Apr 25, 2025
Maintainer

Changing the way the lookahead runtime works could have significant impact on performance
and I lack the time to benchmark and optimize this aspect right now.

Which only leaves the possible validation / detection of the possible ambiguity.

LL(2) parsing error: parser commits to subrule and is not able to get out #2095

Uh oh!

jitsedesmet Mar 17, 2025

Replies: 5 comments · 2 replies

Uh oh!

msujew Mar 17, 2025 Collaborator

Uh oh!

Uh oh!

jitsedesmet Mar 17, 2025 Author

Uh oh!

msujew Mar 17, 2025 Collaborator

Uh oh!

Uh oh!

jitsedesmet Mar 18, 2025 Author

Uh oh!

Uh oh!

bd82 Mar 19, 2025 Maintainer

Uh oh!

jitsedesmet Mar 20, 2025 Author

Uh oh!

bd82 Apr 25, 2025 Maintainer

jitsedesmet
Mar 17, 2025

Replies: 5 comments 2 replies

msujew
Mar 17, 2025
Collaborator

jitsedesmet
Mar 17, 2025
Author

msujew
Mar 17, 2025
Collaborator

jitsedesmet
Mar 18, 2025
Author

bd82
Mar 19, 2025
Maintainer

jitsedesmet Mar 20, 2025
Author

bd82 Apr 25, 2025
Maintainer