Skip to content

Reduce generated parser backtracking via ATN-style RuleRef expansion#713

Merged
gfx merged 5 commits intomainfrom
claude/reduce-backtracking-kyBvr
Mar 28, 2026
Merged

Reduce generated parser backtracking via ATN-style RuleRef expansion#713
gfx merged 5 commits intomainfrom
claude/reduce-backtracking-kyBvr

Conversation

@gfx
Copy link
Copy Markdown
Member

@gfx gfx commented Mar 28, 2026

Summary

  • Reduce SQLite grammar backtracking sites from 298 → 253 (-15%) by expanding multi-token RuleRefs during SLL prediction
  • Add ATN-style try_expand_opaque: when the prediction engine encounters opaque RuleRefs, enter the referenced rules and compute FIRST sets at the decision point's lookahead level to build a flat Dispatch
  • Modify sll_advance_inner to enter nullable multi-token RuleRefs (e.g., with_clause?) via return stack instead of treating them as single-token consumers
  • Add return stack infrastructure (SllReturn, push_return, pop_return) to SllConfig for tracking continuation points during rule expansion
  • Copy 18 complex SQL queries from benchmark/sqlite_parse/queries.sql to driver_sqlite_test.wado for better parser regression coverage
  • Document failed expansion approaches and their failure modes in package-gale/CLAUDE.md

Key design decisions

Flat Dispatch only: Expanded configs are never passed to build_sll_node. FIRST sets are computed manually and grouped by alt_index to avoid Consume corruption, depth-mixed Dispatch, and dedup false resolution bugs.

All-Leaf guard: try_expand_opaque only returns a Dispatch when every branch resolves to a single alternative (Leaf). Ambiguous branches cause the entire expansion to be rejected, falling back to the original Backtrack.

Nullable entering: sll_advance_inner enters nullable Repeat(Optional/Star, RuleRef) elements via return stack (depth < 1, alt count <= 8) so that the prediction engine sees tokens at the correct input depth.

Test plan

  • All 1301 tests pass (mise run on-task-done)
  • SQLite driver tests: 111 passed, 0 failed (including 18 new benchmark queries)
  • No correctness regressions (CTE, JOIN, subquery, recursive CTE all pass)
  • Small grammars (JSON, sexpression, calculator) unchanged

https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C

claude added 5 commits March 28, 2026 02:49
…expansion

Adds SllReturn/return_stack to SllConfig and sll_expand_rule_ref helper,
enabling the prediction engine to enter multi-token RuleRef alternatives.
Expansion is currently disabled (guarded) pending resolution of the Consume
node correctness issue discovered during testing.

Key findings:
- Return stack infrastructure has zero overhead when unused (empty arrays)
- RuleRef expansion reduces SQLite backtracking by 31% (298→205 sites)
- But Consume nodes from expanded rules incorrectly emit parse code at the
  decision point, causing test failures
- Next step: limit expansion results to Leaf-only (approach B)

https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
Add return stack to SllConfig, enabling the prediction engine to expand
multi-token RuleRefs by entering referenced rules during SLL advancement.
This tracks continuation points so the engine can return to the caller
after advancing through a sub-rule.

Infrastructure added (disabled, zero runtime overhead):
- SllReturn struct and return_stack field on SllConfig
- push_return/pop_return helpers for stack management
- sll_expand_rule_ref: expands multi-token RuleRefs with depth/alt guards
- try_expand_opaque: attempts to resolve opaque prediction groups
- strip_all_consume: removes Consume nodes from expanded prediction trees

The expansion is currently disabled (try_expand_opaque is not called)
because dispatching on tokens from inside expanded sub-rules can produce
incorrect prediction branches. Specifically:
- Consume nodes from sub-rules incorrectly consume tokens at the decision point
- Dispatch branches mix tokens from different rule depths
- Rules sharing prefixes (e.g., with_clause) create false disambiguation

The infrastructure is ready for activation once a correct dispatch strategy
is implemented (e.g., computing FIRST sets at the decision point level
rather than at the expanded position level).

https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
…approaches

- Copy 18 complex SQL queries from benchmark/sqlite_parse/queries.sql to
  driver_sqlite_test.wado for better parser regression coverage (JOINs,
  recursive CTEs, correlated subqueries, CASE, set operations, etc.)
- Remove dead code from parser_gen.wado: sll_expand_rule_ref,
  try_expand_opaque, strip_all_consume (not called, caused correctness
  bugs when active)
- Keep zero-overhead return stack infrastructure (SllReturn, push_return,
  pop_return, return-stack-aware sll_config_first/sll_advance_inner)
- Document the RuleRef expansion approach and its 3 failure modes in
  package-gale/CLAUDE.md to prevent repeating the same mistakes

https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
Implement try_expand_opaque: when the SLL prediction engine encounters
opaque multi-token RuleRefs that would produce a Backtrack node, expand
them by entering the referenced rules and computing FIRST sets at the
decision point's lookahead level.

Key design: build a flat Dispatch manually from expanded FIRST sets,
never passing expanded configs to build_sll_node. This avoids the 3
bugs from the previous approach (Consume corruption, depth-mixed
Dispatch, dedup false resolution).

Safety guards:
- Rule diversity check: skip if all opaque alts reference the same rule
- Alt count limit (<=8): prevent combinatorial explosion
- Nullable-start guard: skip rules starting with nullable elements
  (e.g., with_clause?) to prevent depth mismatch in sll_advance
- FIRST pre-filter: skip rule alternatives that can't match the token
- Coverage verification: reject if any original alt is lost

Results for SQLite grammar: 298 → 275 backtracking sites (-8%).
Primarily resolves CREATE (5→0) and DROP (4→0) groups where
alternatives start with different terminal sequences.

https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
Modify sll_advance_inner to enter nullable elements containing
multi-token RuleRefs (e.g., with_clause?) via the return stack,
instead of treating them as single-token consumers. This fixes
the depth mismatch that caused try_expand_opaque to skip rules
starting with nullable elements.

When sll_advance encounters a nullable Repeat(Optional/Star, RuleRef):
- If the RuleRef is single-token: advance past it (unchanged)
- If multi-token and return_stack depth < 1: push continuation,
  enter the rule's alternatives, advance inside
- Otherwise: fall back to pos+1 (legacy behavior)

Guards: return_stack depth < 1, alt count <= 8, FIRST pre-filter.

Results for SQLite: 298 → 253 backtracking sites (-15%).
All 1301 tests pass. No correctness regressions.

https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
@gfx gfx changed the title Reduce backtracking in package-gale parser Reduce generated parser backtracking via ATN-style RuleRef expansion Mar 28, 2026
@gfx gfx merged commit b2e0f13 into main Mar 28, 2026
9 of 10 checks passed
@gfx gfx deleted the claude/reduce-backtracking-kyBvr branch March 28, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants