Parser development guide #507

duckdoom4 · 2025-08-23T08:04:26Z

duckdoom4
Aug 23, 2025
Collaborator

This guide explains how the tokenizer, parser, and AST processing work, and how to extend them.

Architecture overview

Text → Tokens
- The tokenizer scans document text and produces a sequence of tokens.
- There are two kinds of tokens:
  - Atomic tokens: keywords, identifiers, literals, operators, punctuation. (linear sequence, non-overlapping)
  - Meta tokens: higher-level constructs like Comment, DefineStatement, SayStatement, etc. (these can overlap and effectively form a tree overlay). Note that meta tokens are not strictly necessary to parse the document, but are there to help.
Tokens → AST (Parser)
- The parser iterates through the token stream and applies rule objects that each have test() and parse() methods.
- Rules validate grammar, consume tokens, and build typed AST nodes.
AST → Program model
- The AST is processed into an RpyProgram, which contains symbols, scopes, references, and diagnostics.
- This model powers features like go-to-definition and find-references.

Core files you’ll interact with:

tokenizer/renpy-tokens.ts — token kinds and tokenizer output.
grammars/renpy.grammar.ebnf — formal grammar reference to follow while implementing rules.
parser/renpy-grammar-rules.ts — rule registry and rule implementations; the parser’s core loop lives here.
parser/ast-nodes.ts — AST node types and their processing hooks.
parser/parser-test.ts — a simple wrapper that runs tokenization and parsing for the active document.

Tokenizer

The tokenizer is mature and used in production today. It’s conceptually similar to VS Code’s highlighting approach (Oniguruma-style), but implemented from scratch in TypeScript.

Emits a strictly ordered sequence of tokens.
Atomic tokens don’t overlap.
Meta tokens can overlap and describe higher-level constructs, effectively giving a rudimentary AST shape before verification.
There’s a debug command to visualize tokens; by default it's bound to Ctrl+Alt+Shift+T (you may need to bind it manually).
The tokenizer rules are auto generated from the syntax highlighting rules we use for VSCode's highlighter
- VSCode also has the internal Ctrl+Alt+Shift+I command, to visualize the syntax highlighting tokens, which might be nice to check sometimes, given we use the same regex expressions

If you want to know exactly which tokens exist, check tokenizer/renpy-tokens.ts and related files in that folder.

Also check out the syntax highligher guide to understand more about the tokenizer: #447

Parser

The parser is rule-driven and intentionally modular.

Each rule has:
- test() — quick predicate to check if the rule applies at the current token.
- parse() — consumes tokens and returns a specific AST node or null on error.
A central loop walks a list of rules and executes the first rule whose test() returns true.
Example: DefineStatementRule expects the define keyword, optionally an integer, an assignment, then an end-of-line.

We maintain a formal grammar in grammars/renpy.grammar.ebnf. That’s the source of truth we mirror in the rules.
Note: The current grammar is incomplete and in some cases inaccurate, meaning we might introduce bugs that the offical renpy parser does not agree on.

Error handling has been improved recently, but it’s still evolving. Comments and some edge cases may need more work; while extending the parser, start with valid syntax and then expand coverage.

AST and semantic processing

Every parse() returns a typed AST node (see src/parser/ast-nodes.ts). Nodes can override a visit(program: RpyProgram) method to:

Declare symbols (labels, defines, characters).
Record references (e.g., a jump to a label creates a reference to the label’s symbol).
Emit diagnostics when constraints aren’t met (e.g., an undefined label).

The RpyProgram is the semantic model. It holds:

A global scope (and more scopes as needed, though Ren’Py/Python semantics often push things toward module/global).
A symbol table with definition locations and a list of references per symbol.

Example usage pattern (as seen in parser-test.ts):

Tokenize and parse the document to an AST.
ast.visit(program) builds the semantic model.
Resolve a symbol: program.globalScope.resolve("e") → returns a RpySymbol with definitionLocation and references.

Implementing a new statement (example: `jump`)

Here’s the pattern I follow when implementing a new statement.

Define/verify grammar
- Confirm the EBNF in grammars/renpy.grammar.ebnf covers the statement (e.g., jump).
- Cross-check with Ren’Py’s reference implementation if needed.
Implement the parser rule
- Add JumpStatementRule to src/parser/renpy-grammar-rules.ts.
- test() should look for the jump keyword and whatever follows according to the grammar (e.g., a label name).
- parse() should consume the tokens and build a JumpStatementNode that contains a LabelNameNode.
Register the rule where it belongs
- Add the rule to the statement group list in renpy-grammar-rules.ts so it’s considered at the appropriate point in the top-level loop.
Add AST nodes and processing
- In src/parser/ast-nodes.ts, add JumpStatementNode (or a shared call/jump node if that fits better).
- Override process(program) to resolve the label symbol and add a reference; if the label isn’t defined (yet), emit a diagnostic.
Validate with a unit test
- Put a sample jump start in a .rpy test file (e.g., parser_test.rpy).
- Run the parser test and confirm the label symbol collects the jump site as a reference after ast.process(program).

The same steps work for most statements: start from EBNF → write a rule → register it → add AST nodes → wire references/diagnostics → validate.

LSP server (why and how it fits)

I’m building an LSP server so parsing/indexing runs out-of-process:

Keeps the editor responsive while the server indexes the whole project.
Makes the parsing engine reusable by any LSP client, not just VS Code.
The parser/tokenizer themselves avoid heavy VS Code types; the main dependency you’ll see is the text document API, which I can abstract further if needed.

If you’re consuming this outside the VS Code extension, the LSP server is the cleanest integration point.

Current status and limitations

Tokenizer: stable and production-tested.
Parser: rapidly improving; error handling and recovery are better but not perfect.
- Current progress is mostly blocked by missing EBNF grammar definitions for all renpy features
- We can parse some basic source, but are lacking some as well
- Raw python source is currently parsed using parser rules generated by AI. These are likely incorrect, but it might be possible to parse some pythn source already
Comments and some edge cases: still being expanded.
Symbol references: infrastructure is in place; some categories may still be incomplete.
Grammar coverage: substantial, but newer features are being added as we go.
AST: Need to properly implement walking the AST and processing the data which we can then use.
- I have made some attempts to make use of the current data. For example see registerHighlightProvider in src\semantics.ts

Developer workflow tips

Use src/parser/parser-test.ts to run tokenization/parsing on the active document and print debug output.
Try out/extend parser_test.rpy to exercise specific constructs.
Use the token debug command (bind to Ctrl+Alt+Shift+T) to inspect tokens and their metadata.
When a rule doesn’t fire, check the tokenizer output and ensure the rule is registered in the correct statement group.
Make sure the rules are defined in the correct order. Prioritize rules that have clear grammar and place generic/fallback patterns last.
Emit diagnostics from AST visit() when you need semantic context (e.g., undefined names, duplicates).
- You can also dump the AST using logCatMessage(LogLevel.Debug, LogCategory.Parser, ast.toString());

Contributing checklist

Start with the EBNF in grammars/renpy.grammar.ebnf.
Implement or extend a rule in src/parser/renpy-grammar-rules.ts.
Add/adjust AST nodes in src/parser/ast-nodes.ts.
Update process(program) logic to declare symbols, add references, and emit diagnostics.
Validate with src/parser/parser-test.ts and sample .rpy code.
Iterate on error handling and edge cases.

If you need pointers on specific files or abstractions, I’m happy to guide—my goal is to keep the core modular and approachable while we expand grammar coverage and semantic analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parser development guide #507

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Parser development guide #507

Uh oh!

Uh oh!

duckdoom4 Aug 23, 2025 Collaborator

Architecture overview

Tokenizer

Parser

AST and semantic processing

Implementing a new statement (example: jump)

LSP server (why and how it fits)

Current status and limitations

Developer workflow tips

Contributing checklist

Replies: 0 comments

duckdoom4
Aug 23, 2025
Collaborator

Implementing a new statement (example: `jump`)