langlang is a parser generator and aspires to provide a toolbox for interacting with computer languages in a more general way.
The core piece, and the only thing that is described in the roadmap is the top-down parser generator (based on Parsing Expression Grammars) that is rather feature rich and strives to be as minimal and elegant as possible, without losing sight of performance.
To shape the core piece, and enrich the tooling around extracting and manipulating the meaning of text and structured data, other algorithms and tools will be written along the way. The Rambling section contains some of the ideas about what could be built. And once these ideas get enough shape, they become implementation designs and get implemented.
- [X] Straightforward Grammar definition language
- [X] Full unicode support, you can match emojies
- [X] Matches data structures besides strings
- [X] Full support for left recursive productions
- [X] Operator precedence and associativity definition
- [X] Non-left recursive rules are compiled with zero precedence
- [X] Relative import system with disk and in-memory abstractions
- [ ] Indentation based syntax matching (Relative spaces)
- [ ] Transform matched values (Semantic action expressions)
- [-] Error reporting
- [X] Report correct error position (Farthest failure position)
- [X] Position tracking and reporting (location spans)
- [X] Option tracking on choice operations
- [X] Throw custom error labels
- [ ] Automatic label insertion
- [-] Error recovery
- [X] Manual Recovery Expression definition
- [X] Associate manually defined labels to manually defined REs
- [ ] Automatic recovery expression generation
- [ ] Pipe multiple grammars through the command line
- [ ] parameterized rules?
- [ ] higher-order rules?
- [ ] packages?
- [X] ABNF (RFC-5234)
- [ ] C
- [ ] C++
- [ ] CPP
- [X] CSV
- [ ] CSS
- [ ] Go
- [ ] Java
- [ ] JavaScript
- [X] JSON
- [ ] JSX
- [ ] HTML (tolerant to failures – invalid)
- [ ] Python (relative space)
- [X] PEG
- [ ] Uniform Resource Identifier (URI): Generic Syntax (RFC-3986)
- [ ] Rust
- [ ] TypeScript
- [ ] XML
Matches of an input against a grammar produce an output tree. The
Virtual Machine uses a unified stack with three frame types:
Backtracking, Call, and Capture. Captured nodes are stored in a
shared arena (nodeArena) rather than a separate capture stack.
Stack frame types nodeArena, shared, grows as frames are pushed ┌─────────────────────┐ ┌───┬───┬───┬───┬───┐ │ Capture(id=Expr) │ ───▶│ 0 │ 1 │ 2 │ │ │───▶ │ nodesStart=0 │ └───┴───┴───┴───┴───┘ │ nodesEnd=3 │ ├─────────────────────┤ │ Backtracking │ │ cursor=5, pc=42 │ and shrinks as frames are popped off the stack ├─────────────────────┤ ┌───┬───┐ │ Call │ ◀───│ │ │───◀ │ pc=100 │ └───┴───┘ └─────────────────────┘
The output tree contains four node types:
NodeType_String: a terminal match storing a byte range (start/end offsets into the input)NodeType_Sequence: an ordered list of child nodesNodeType_Node: a named rule match with a single childNodeType_Error: a recovery point containing error metadata
Example: parsing "hello" with grammar Greeting <- 'hello':
NodeType_Node (name="Greeting", start=0, end=5) └── NodeType_String (start=0, end=5)
The string node has the offset within the input:
input: "hello"
▲───▲
0 5
Example: parsing "ab" with grammar Pair <- 'a' 'b':
NodeType_Node (name="Pair")
└── NodeType_Sequence
├── NodeType_String "a" (0,1)
└── NodeType_String "b" (1,2)
The AddCaptures step of the grammar compiler wraps each definition’s
expression in a CaptureNode named after the definition. For
non-syntactic rules (those containing calls to other rules via
identifiers), inner terminal expressions are also wrapped with unnamed
captures so that terminal matches are grouped appropriately. Purely
syntactic rules (containing only terminals like literals, charsets,
ranges, and .) don’t need inner captures since the entire match is
captured as one unit. Certain rules like Spacing, Space, EOF,
and EOL (and their dependencies) can be skipped to avoid capturing
whitespace through the `–disable-capture-spaces` CLI flag.
The VM provides several capture-related instructions:
CapBegin: pushes aCaptureframe onto the stack with an ID and the current cursor position. The frame tracks a range (nodesStart~/~nodesEnd) into the shared node arena.CapEnd: pops the capture frame and creates a node from its captured children. If no children were captured but the cursor moved, it creates aNodeType_String. With one child, it reuses that node. With multiple children, it creates aNodeType_Sequence. If the capture has a name (non-zero ID), the result is wrapped in aNodeType_Node.
Matching "ab" against: Pair <- 'a' 'b'
1. CapBegin(Pair)
Stack:
┌──────────────┐
│Cap(Pair) │
│start=0,end=0 │
└──────────────┘
nodeArena: []
2. match 'a', CapTerm
Stack:
┌──────────────┐
│Cap(Pair) │
│start=0,end=1 │
└──────────────┘
nodeArena: [S0]
3. match 'b', CapTerm
Stack:
┌──────────────┐
│Cap(Pair) │
│start=0,end=2 │
└──────────────┘
nodeArena: [S0, S1]
4. CapEnd(Pair)
- pops frame, sees 2 children, forms a Sequence
- wraps in Node(Pair)
- pushes result to parent (or nodes[])
Output:
Node(Pair)
└── Seq[S0, S1]
├── String(0,1) "a"
└── String(1,2) "b"
CapTerm: creates a terminal string node when the matched size is known at compile time.CapNonTerm: creates a named terminal node when the matched size is known at compile time.CapTermBeginOffset/CapNonTermBeginOffset/CapEndOffset: used when the matched size isn’t known at compile time (e.g., forSpanoperations). These track the start cursor and compute the offset dynamically.CapCommit: pops the top frame and propagates its captures to the parent frame (or to the top-levelnodeslist). Used for ordered choice (/) on success.CapBackCommit: likeCapCommitbut also restores the cursor. Used for predicates.CapPartialCommit: collects captures from the current frame into its parent without popping, then resets the frame’s capture range. Used withinZeroOrMore(*) loops to accumulate iterations.CapReturn: pops the capture frame and propagates captures. Emitted instead ofReturnwhen captures are enabled.
When the Fail path is taken, the VM pops frames from the stack. For
each popped frame, it truncates the node arena back to that frame’s
starting position (nodesStart), effectively discarding any captures
made within that frame. This ensures that backtracking properly
undoes captured values.
Matching "ac" against: Pair <- 'a' 'b' / 'a' 'c' 1. Try first choice Choice(L2) ┌──────────────┐ │Backtrack │ │cursor=0,pc=L2│ └──────────────┘ nodeArena: [] 2. match 'a' ok CapTerm ┌──────────────┐ │Backtrack │ │cursor=0,pc=L2│ └──────────────┘ nodeArena: [S0] (captured) 3. match 'b' FAIL! Backtrack ┌──────────────┐ │(popped) │ └──────────────┘ nodeArena: [] (truncated!) 4. jump to L2 (2nd choice) cursor restored to 0 ┌──────────────┐ │Cap(Pair) │ │start=0,end=0 │ └──────────────┘ nodeArena: [] 5. match 'a','c' ok CapTerm, CapTerm ┌──────────────┐ │Cap(Pair) │ │start=0,end=2 │ └──────────────┘ nodeArena: [S0,S1] 6. CapEnd Output: Node(Pair) └── Seq["a","c"]
The NOT (!) and AND (&) predicates don’t produce captures
since they don’t consume input. When entering a NOT predicate, the
VM sets a predicate flag that suppresses capture operations.
The Optional (?) operator emits code wrapped with Choice~/~CapCommit
so that on success, captures are propagated.
The ZeroOrMore (*) operator uses CapPartialCommit to accumulate
captures across iterations without creating a new frame per iteration.
Matching "aaa" against: As <- 'a'*
1. CapBegin(As) 2. match 'a', CapTerm 3. CapPartialCommit
Choice(L1) (collect to parent)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│Backtrack │ │Backtrack │ │Backtrack │
│start=0,end=0 │ │start=0,end=1 │ │start=1,end=1 │◀─reset
└──────────────┘ └──────────────┘ └──────────────┘
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│Cap(As) │ │Cap(As) │ │Cap(As) │
│start=0,end=0 │ │start=0,end=0 │ │start=0,end=1 │◀─extended
└──────────────┘ └──────────────┘ └──────────────┘
nodeArena: [] nodeArena: [S0] nodeArena: [S0]
4. repeat 2-3 twice 5. match 'a' FAIL 6. CapCommit, CapEnd
(no more input)
┌──────────────┐ ┌──────────────┐ Output:
│Backtrack │ │(popped) │ Node(As)
│start=3,end=3 │ └──────────────┘ └── Seq["a","a","a"]
└──────────────┘
┌──────────────┐ ┌──────────────┐
│Cap(As) │ │Cap(As) │
│start=0,end=3 │ │start=0,end=3 │
└──────────────┘ └──────────────┘
nodeArena:[S0,S1,S2] nodeArena:[S0,S1,S2]
Error recovery expressions (defined by error labels in the grammar)
produce NodeType_Error nodes instead of regular nodes. These
contain an error label ID, message ID, and optionally a child node
with the recovered content.
Read along to see how one can also transform captured values after they match and before they’re returned.
In the original definition of Parsing Expression Grammars, backtracking resets the input cursor to where it was before trying a different parsing expression. If there is no match, backtracking fails and the cursor is left at the position it was at the beginning of the last Ordered Choice.
To improve error reporting, a heuristic called the Farthest Failure Position introduces a new register to the Virtual Machine:
The ffp is the farthest cursor position reached before any failure
Every time the fail path is taken, if the current cursor exceeds
ffp, both registers are updated. This value is immune to
backtracking and provides a more accurate error position.
For example, consider the following grammar:
Pair <- 'a' 'b' 'c'
Let’s match the input "ab" against it, we expect to observe the
following steps:
- match ‘a’ ok, cursor=1, ffp=-1
- match ‘b’ ok, cursor=2, ffp=-1
- match ‘c’ er, cursor=2, ffp=2 <- updated on failure
When the VM returns an error, it uses ffp to report where parsing
actually got stuck, rather than where backtracking left the cursor:
Input: "ab"
12 <---- ffp points to the last index, not at the start
When showFails is enabled, the VM collects “expected” values at the
FFP. These are the characters or ranges that would have allowed
parsing to continue:
Let’s consider the grammar
Pair <- 'a' ('b' / 'c')
When matching "ax" against it, we expect the following steps to
happen:
- match ‘a’ ok
- match ‘b’, fail (collect ‘b’)
- backtrack
- match ‘c’, fail (collect ‘c’)
- fail
When parsing finally fails, it displays the characters it unsuccessfully attempted to match;
Error: "Expected 'b', 'c' but got 'x'"
The Throw Operator (^) allows grammar authors to signal that no
more alternatives should be attempted and parsing should fail
immediately. This is useful when you know that after matching certain
tokens, no other choice could possibly succeed.
The opThrow instruction works as follows:
opThrow(labelID):
if predicate:
goto fail ◀── within !/& predicates, just backtrack
if recovery[labelID] exists:
call recovery[labelID] ◀── try to recover and continue
else:
return error ◀── stop parsing immediately
The general place where a Throw Operator is useful is right after matching tokens that uniquely identify a construct. Consider:
IfStatement <- IF LEFTP Expression RIGHTP Body
AllStatements <- IfStatement / ForStatement / WhileStatement ...
These inputs would unnecessarily trigger all alternatives in
AllStatements:
'if x', 'if (', 'if (x'
With the Throw Operator, parsing stops early on failure:
IfStatement <- IF LEFTP^ Expression^ RIGHTP^ Body
Matching "if x" against: IfStatement
1. match 'if' ok 2. match '(' with ^
cursor=2 cursor=3
Without ^: With ^:
→ backtrack → throw error
→ try ForStatement → "Expected '(' at position 3"
→ try WhileStatement
→ ... (wasteful)
The Throw Operator accepts an optional custom message:
IfStatement <- IF LEFTP^ Expression^"missing expression" RIGHTP^ Body
The syntax expr^label is compiled as (expr / ⇑label):
choice L1 <-- try expr first
<expr code>
commit L2 <-- expr succeeded, skip throw
L1: throw label <-- expr failed, throw error
L2: ...
When a label has an associated Recovery Expression, the throw
instruction calls it instead of returning an error. This enables
incremental parsing where errors are captured as NodeType_Error
nodes in the output tree.
Consider the following grammar:
Expr <- Term ('+' Term^MissingTerm)*
Term <- [0-9]+
MissingTerm <- (!Term .)* Term?
Let’s try to match the input "1++2+3" against it:
- match ‘1’ ok
- match ‘+’ ok
- match
TermFAIL: throwMissingTerm - call
MissingTermrecovery expr - skip past ‘+’ Term (also captures
Errornode) - match ‘+’ ok
- match ‘3’ ok
And this is the expected output:
Expr (1..7)
└── Sequence<5> (1..7)
├── Term (1..2)
│ └── "1" (1..2)
├── "+" (2..3)
├── Error<MissingTerm> (3..5)
│ └── Sequence<2> (3..5)
│ ├── (3..4)
│ │ └── "+" (3..4)
│ └── Term (4..5)
│ └── "2" (4..5)
├── "+" (5..6)
└── Term (6..7)
└── "3" (6..7)
The parser will fail at the first error by default (as Parsing Expression Grammars do originally). But an incremental parsing mode is also included, but with annotation costs traded for precision.
When parsing is done incrementally, the Throw Operator won’t interrupt parsing right away. It will instead add a special node to the tree returned by the parser storing information about the error. The parser will then execute the Recovery Expression associated with the (implicitly created) label behind the Throw Operator, which should consume the input until where the matching of another expression can be attempted.
The default Recovery Expression of a label of an instance of the Throw Operator is the following:
Annotation costs come from the
What would it take to build a parser generator tool capable of self-hosting? It can already take a stream of characters, transform it into a tree, and then it can take the tree and traverse it.
There’s now the need of emitting some code that could then be interpreted by the virtual machine that has been used run the parsing and the translating. Besides choosing a format for outputting the emitted code, it will also be necessary to provide auxiliary tooling for introspecting the output stream. Introspection in the sense of reading from arbitrary positions within the output stream, and knowing where the writing cursor is (e.g.: for creating labels).
So, before being capable of self-hosting, this tool has to be able to describe an entire compiler. A first good exercise would be to try and implement something similar to what is described in the article “PEG-based transformer provides front-, middle and back-end stages in a simple compiler” (Piumarta, 2010). langlang isn’t very far from achieving that. There are two main missing steps:
- semantic actions to allow transformations to be performed on the values produced upon matches.
- a modular output stream, that can encode values in different formats. The backtracking of the parser is the reason why it’s complicated to allow side effects on semantic actions. Our options to deal with that are to either build an output stream that is aware of the backtracking, or to apply the output rules after matching with a classic visitor style.
The initial format of the output stream can be text based, and the proof that it’d work is the ability to compile grammars into their correct text representation of the bytecode that the virtual machine can interpret. There’s some
- [X] Parse input file with language grammar and get an AST
- [X] Generate the tree traversal grammar off the original grammar
- [ ] Traverse tree grammar and output code (ugly print)
- [ ] Merge overrides to the default settings of the code emitter
- [ ] Command line usage
Usage: langlang print [OPTIONS] INPUT_FILE OPTIONS [-g|--grammar] GRAMMAR Grammar file used to parse INPUT_FILE [-o|--output] OUTPUT Writes output into OUTPUT file --cfg-tab-width NUMBER Configure indentation size --cfg-max-line-width NUMBER Maximum number of characters in a single line --cfg-add-trailing-comma Add trailing comma to the end of last list entry
- [X] Parse both files and get their AST
- [ ] Apply tree diff algorithm
- [ ] Display results
- [ ] Command line usage
langlang diff file_v1.py file_v2.py langlang diff file.py file.js
- [ ] Undo/Redo
- [ ] LSP Server
- [ ] CRDT Storage
- [ ] AST diff
- [ ] Language Server Protocol
- [ ] Text Editing
- [ ] Rendering Engine
- [ ] Configuration Language
Suppose we can parse a .ln file with a given grammar lang.peg.
That’d give us an AST as output. One option is to write the
translator as a tree traversal for that AST that emits code. That
will take one of those traversals per language that needs to be
supported. That’d double the burden on the user’s side, since
there was already the need of putting together the language
grammar.
In order to automate some of the process, one could maybe take the
lang.peg file as input and produce a lang.translator.peg, in
which rules that output trees would be translated into rules that
could also take structured data as input. Take the following
rules as an example:
Program <- _ Statement+ EndOfFile
Statement <- IfStm / WhileStm / AssignStm / Expression
AssignStm <- Identifier EQ Expression
IfStm <- IF PAROP Expression PARCL Body
WhileStm <- WHILE PAROP Expression PARCL Body
Body <- Statement / (CUROP Statement* CURCL)
# (...)
IF <- 'if' _
WHILE <- 'while' _
EQ <- 'eq' _
PAROP <- '(' _
PARCL <- ')' _
CUROP <- '{' _
CURCL <- '}' _
# (...)
The following output would be generated:
Program <- { "Program" _ Statement+ EndOfFile }
Statement <- { "Statement" IfStm / WhileStm / AssignStm / Expression }
AssignStm <- { "AssignStm" Identifier EQ Expression }
IfStm <- { "IfStm" IF PAROP Expression PARCL Body }
WhileStm <- { "WhileStm" WHILE PAROP Expression PARCL Body }
Body <- { "Body" Statement / (CUROP Statement* CURCL) }
# (...)
IF <- { "IF" Atom }
WHILE <- { "WHILE" Atom }
EQ <- { "EQ" Atom }
PAROP <- { "PAROP" Atom }
PARCL <- { "PARCL" Atom }
CUROP <- { "CUROP" Atom }
CURCL <- { "CURCL" Atom }
# (...)
Atom <- !{ .* } .
With that, we’d know how to traverse any tree returned by the
original lang.peg. We could then build a general traversal that
walks down the tree, printing out what was matched.
There is one type of information that is not available in the original grammar though. The specifics of each language! For example, in Python, default values for named arguments aren’t supposed to have spaces surrounding the equal sign e.g.:
def complex(real, imag=0.0):
return # (...)But that’s not the same as in JavaScript:
function multiply(a, b = 1) {
return a * b;
}To the same extent, minification rules for Python would be different from most other languages as well, given its indentation based definition of scopes.
The good news is that most of these differences, if not all, can be encoded as options available for all languages, leaving the user with a much smaller burden of defining only the overrides for each language that demands options that differ from the defaults in the code emitter.
In langlang, modules are recursive containers for other modules and for grammars.
--------
| Module |
|---|
| Rule1 |
| Rule2 |
| RuleN |
--------
type Offset usize;
type SymbolName String;
struct Module {
filename: String,
// module in which this module was declared
parent: Option<Module>,
// modules declared within this module
modules: Vec<Module>,
// symbols provided by this module
symbols: HashMap<SymbolName, Offset>,
// symbols used in this module but declared somewhere else
missing: HashMap<SymbolName, Offset>,
}$ mkdir -p ./lib/base # directory structure for user defined grammars
$ edit ./lib/base/rule.langlang # write a grammar
$ llsh lib::base::rule https://url.with.test/case # a file lib/base/rule.binlang will be created
$ llsh -i. lib::base::rule https://url.with.test/case # previous example worked because `-i./' is implicit
$ llsh -i./lib base::rule https://url.with.test/case # full name differs depending on where the root starts
$ MODULE_SEARCH_PATH=./lib llsh base::rule https://url.with.test/case # search path can be extended via env varWhen a symbol is requested, a look up to the symbol table is issued
and, if it is present there, its address is returned. If it is
not, then the BinaryLoader looks for it within the bytecode
cache, and if it’s not there, it will go through each search path
and try to find it in the file system.
# from stdin
echo https://clarete.li/langlang | llsh lib::rfc3986
# from a file
llsh lib::rfc5234 ~/lib/rfc3986.abnf
# from a URL
llsh lib::json https://jsonplaceholder.typicode.com/users
# interactive
llsh lib::peg
>> S <- 'a' / 'b'state = <pc, s, e, c>
<pc, s, e, c> – Char a –> <pc+1, s, e, a:c> <pc, s, e, c> – Choice i –> <pc+1, s, (pc+i,s,c):e, c>
Success
Throw f <pc,s,e> -----------→ Fail<f,s,e>
- inside choice
p / throw(label)when
pfails: -> log error tuple(location(), label)-> run expression withinR(label) - inside predicate
!(p / throw(label))when
psucceeds: -> return labelfailwhenpfails: ->Ris empty for predicates, so returnthrowdoesn’t do anything,labelis discarded and the operation succeeds.
Once an expression fails to be parsed and throw is called, a look
up for label is made within R. If a recovery expression is
found, it’s executed with the goal of moving the parser’s input
cursor to right before the first symbol of the next parsing
expression.
Follow Sets
An Expression e has a FOLLOW set of symbols that can be
intuitively described as the list of possible characters to be
matched after matching e.
- Base Case
G <- (E / ⇑l) "x"The symbol
xwould be the only element of theFOLLOWset of symbols ofE. - Recursive Case
G <- (E / ⇑l) (A / B) A <- "x" / "y" B <- "z" / "k"The
FOLLOWset ofEin this case isx, y, z, k, since any of these symbols could appear right after parsingE.