Reduce generated parser backtracking via ATN-style RuleRef expansion#713
Merged
Reduce generated parser backtracking via ATN-style RuleRef expansion#713
Conversation
…expansion Adds SllReturn/return_stack to SllConfig and sll_expand_rule_ref helper, enabling the prediction engine to enter multi-token RuleRef alternatives. Expansion is currently disabled (guarded) pending resolution of the Consume node correctness issue discovered during testing. Key findings: - Return stack infrastructure has zero overhead when unused (empty arrays) - RuleRef expansion reduces SQLite backtracking by 31% (298→205 sites) - But Consume nodes from expanded rules incorrectly emit parse code at the decision point, causing test failures - Next step: limit expansion results to Leaf-only (approach B) https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
Add return stack to SllConfig, enabling the prediction engine to expand multi-token RuleRefs by entering referenced rules during SLL advancement. This tracks continuation points so the engine can return to the caller after advancing through a sub-rule. Infrastructure added (disabled, zero runtime overhead): - SllReturn struct and return_stack field on SllConfig - push_return/pop_return helpers for stack management - sll_expand_rule_ref: expands multi-token RuleRefs with depth/alt guards - try_expand_opaque: attempts to resolve opaque prediction groups - strip_all_consume: removes Consume nodes from expanded prediction trees The expansion is currently disabled (try_expand_opaque is not called) because dispatching on tokens from inside expanded sub-rules can produce incorrect prediction branches. Specifically: - Consume nodes from sub-rules incorrectly consume tokens at the decision point - Dispatch branches mix tokens from different rule depths - Rules sharing prefixes (e.g., with_clause) create false disambiguation The infrastructure is ready for activation once a correct dispatch strategy is implemented (e.g., computing FIRST sets at the decision point level rather than at the expanded position level). https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
…approaches - Copy 18 complex SQL queries from benchmark/sqlite_parse/queries.sql to driver_sqlite_test.wado for better parser regression coverage (JOINs, recursive CTEs, correlated subqueries, CASE, set operations, etc.) - Remove dead code from parser_gen.wado: sll_expand_rule_ref, try_expand_opaque, strip_all_consume (not called, caused correctness bugs when active) - Keep zero-overhead return stack infrastructure (SllReturn, push_return, pop_return, return-stack-aware sll_config_first/sll_advance_inner) - Document the RuleRef expansion approach and its 3 failure modes in package-gale/CLAUDE.md to prevent repeating the same mistakes https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
Implement try_expand_opaque: when the SLL prediction engine encounters opaque multi-token RuleRefs that would produce a Backtrack node, expand them by entering the referenced rules and computing FIRST sets at the decision point's lookahead level. Key design: build a flat Dispatch manually from expanded FIRST sets, never passing expanded configs to build_sll_node. This avoids the 3 bugs from the previous approach (Consume corruption, depth-mixed Dispatch, dedup false resolution). Safety guards: - Rule diversity check: skip if all opaque alts reference the same rule - Alt count limit (<=8): prevent combinatorial explosion - Nullable-start guard: skip rules starting with nullable elements (e.g., with_clause?) to prevent depth mismatch in sll_advance - FIRST pre-filter: skip rule alternatives that can't match the token - Coverage verification: reject if any original alt is lost Results for SQLite grammar: 298 → 275 backtracking sites (-8%). Primarily resolves CREATE (5→0) and DROP (4→0) groups where alternatives start with different terminal sequences. https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
Modify sll_advance_inner to enter nullable elements containing multi-token RuleRefs (e.g., with_clause?) via the return stack, instead of treating them as single-token consumers. This fixes the depth mismatch that caused try_expand_opaque to skip rules starting with nullable elements. When sll_advance encounters a nullable Repeat(Optional/Star, RuleRef): - If the RuleRef is single-token: advance past it (unchanged) - If multi-token and return_stack depth < 1: push continuation, enter the rule's alternatives, advance inside - Otherwise: fall back to pos+1 (legacy behavior) Guards: return_stack depth < 1, alt count <= 8, FIRST pre-filter. Results for SQLite: 298 → 253 backtracking sites (-15%). All 1301 tests pass. No correctness regressions. https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
try_expand_opaque: when the prediction engine encounters opaque RuleRefs, enter the referenced rules and compute FIRST sets at the decision point's lookahead level to build a flat Dispatchsll_advance_innerto enter nullable multi-token RuleRefs (e.g.,with_clause?) via return stack instead of treating them as single-token consumersSllReturn,push_return,pop_return) toSllConfigfor tracking continuation points during rule expansionbenchmark/sqlite_parse/queries.sqltodriver_sqlite_test.wadofor better parser regression coveragepackage-gale/CLAUDE.mdKey design decisions
Flat Dispatch only: Expanded configs are never passed to
build_sll_node. FIRST sets are computed manually and grouped by alt_index to avoid Consume corruption, depth-mixed Dispatch, and dedup false resolution bugs.All-Leaf guard:
try_expand_opaqueonly returns a Dispatch when every branch resolves to a single alternative (Leaf). Ambiguous branches cause the entire expansion to be rejected, falling back to the original Backtrack.Nullable entering:
sll_advance_innerenters nullableRepeat(Optional/Star, RuleRef)elements via return stack (depth < 1, alt count <= 8) so that the prediction engine sees tokens at the correct input depth.Test plan
mise run on-task-done)https://claude.ai/code/session_01ACVN5Rr7waUZWXtv8MFN2C