Skip to content

Tokenizer misidentifies Regex as Division after nested parentheses due to flat state tracking #2132

@gh-markt

Description

@gh-markt

class in tokenizer.ts uses single integer variables (this.paren and this.curly) to track the index of the most recent opening token. This logic is used by isRegexStart() to determine if a forward slash (/) is a Regular Expression literal or a Division operator by inspecting the token preceding the current block.

However, because these variables are simply overwritten when a nested group is encountered—and never restored when that group closes—the tokenizer loses track of the outer context.

The Issue:

  1. When an outer ( is encountered, this.paren is set to its index.
  2. When a nested ( is encountered, this.paren is overwritten with the new index. The reference to the outer ( is lost.
  3. When the nested group closes ), no state restoration occurs.
  4. When the outer group closes ), this.paren still refers to the inner start index.

Reproduction:
Consider the following valid JavaScript. The / following the if condition should be parsed as the start of a Regex literal.

// The condition includes a nested grouping (function call)
if (isValid(x)) /abc/.test(x);

Expected Behavior:
The tokenizer sees the ) closing the if statement. It looks back at the matching (. It sees the if keyword preceding it. It determines that /abc/ is a Regex.

Actual Behavior:

  1. this.paren is initially set to the index of the ( after if.
  2. this.paren is overwritten by the index of the ( after isValid.
  3. When the tokenizer reaches the /, it looks back using the current value of this.paren (the inner parenthesis).
  4. It checks the token preceding that inner index: isValid (an Identifier).
  5. Standard grammar rules suggest that an identifier followed by a parenthesized group implies a function call, and a slash following that implies division (e.g. fn() / 2).
  6. The tokenizer incorrectly identifies /abc/ as a series of division operators and identifiers, likely causing a parse error later.

Proposed Fix:
We should add stacks to the Reader class to maintain the history of open delimiters. We can keep this.paren and this.curly as the properties used by isRegexStart, but they should be updated by popping from these stacks.

1. Update Reader properties and constructor:

class Reader {
    readonly values: ReaderEntry[];
    curly: number;
    paren: number;
    
    // Add stacks to track nesting history
    curlyStack: number[];
    parenStack: number[];

    constructor() {
        this.values = [];
        this.curly = this.paren = -1;
        this.curlyStack = [];
        this.parenStack = [];
    }
    // ...

2. Update Reader.push to manage the stack:

    push(token): void {
        if (token.type === Token.Punctuator || token.type === Token.Keyword) {
            if (token.value === '{') {
                this.curlyStack.push(this.values.length);
            } else if (token.value === '(') {
                this.parenStack.push(this.values.length);
            } else if (token.value === '}') {
                // Pop the stack to restore context to the matching opener
                const index = this.curlyStack.pop();
                this.curly = (index !== undefined) ? index : -1;
            } else if (token.value === ')') {
                // Pop the stack to restore context to the matching opener
                const index = this.parenStack.pop();
                this.paren = (index !== undefined) ? index : -1;
            }
            this.values.push(token.value);
        } else if (token.type === Token.Template && !token.tail) {
            this.values.push(null);
            // Template head/middle acts as a curly brace opener
            this.curlyStack.push(this.values.length - 1);
        } else {
            this.values.push(null);
        }
    }

This handles two other edge cases not considered by the existing code:

1. Curly Brackets (curlyStack): Object Literals vs. Code Blocks
The fundamental ambiguity the Reader tries to resolve is whether a closing } marks the end of an Object Literal (an expression value) or a Code Block (a statement).

  • The Scenario: fn() { return { a: 1 }; } /regex/
  • The Ambiguity:
    • If the / follows an Object Literal (e.g., x = { a: 1 } / 2), it is a Division operator.
    • If the / follows a Code Block (e.g., if (x) { ... } /regex/), it is the start of a Regex literal.
  • The Failure:
    • Without a stack, when the inner object literal { a: 1 } closes, the curly variable points to the inner {.
    • When the function body closes next, the curly variable still points to the inner { (because it was never restored).
    • The tokenizer looks back from the inner {, sees return (or =), and incorrectly concludes: "This was an Object Literal. The next token / must be Division."
    • Result: Syntax Error on valid code.

2. Template Literals (${): The Implicit Bracket
Template expressions (e.g., `value: ${expr}`) introduce an interpolation scope that behaves syntactically like a parenthesized group or block.

  • The Issue: The token sequence ${ acts as an opening delimiter, but it is closed by a standard } Punctuator.
  • The Failure:
    • If we do not push the ${ position onto the curlyStack, the subsequent } could blindly pop the parent scope's entry from the stack.
    • This corrupts the state for the remainder of the file. The tokenizer will think it has closed a block that it hasn't, or will underflow the stack.
  • Relevance: Inside an interpolation, we effectively restart the expression parser. a = ${ {a:1} / 2 }``. We must correctly identify that the inner { matches the inner `}` so we can determine that `/` is a division operator inside the template.

The second edge case is more a consequence of using a stack for curly bracket management in the first place, but that stack is essential to maintaining correct code state

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions