Skip to content

Latest commit

 

History

History
835 lines (666 loc) · 27.9 KB

File metadata and controls

835 lines (666 loc) · 27.9 KB

Introduction

langlang is a parser generator and aspires to provide a toolbox for interacting with computer languages in a more general way.

The core piece, and the only thing that is described in the roadmap is the top-down parser generator (based on Parsing Expression Grammars) that is rather feature rich and strives to be as minimal and elegant as possible, without losing sight of performance.

To shape the core piece, and enrich the tooling around extracting and manipulating the meaning of text and structured data, other algorithms and tools will be written along the way. The Rambling section contains some of the ideas about what could be built. And once these ideas get enough shape, they become implementation designs and get implemented.

Roadmap

Features

  • [X] Straightforward Grammar definition language
  • [X] Full unicode support, you can match emojies
  • [X] Matches data structures besides strings
  • [X] Full support for left recursive productions
  • [X] Operator precedence and associativity definition
  • [X] Non-left recursive rules are compiled with zero precedence
  • [X] Relative import system with disk and in-memory abstractions
  • [ ] Indentation based syntax matching (Relative spaces)
  • [ ] Transform matched values (Semantic action expressions)
  • [-] Error reporting
    • [X] Report correct error position (Farthest failure position)
    • [X] Position tracking and reporting (location spans)
    • [X] Option tracking on choice operations
    • [X] Throw custom error labels
    • [ ] Automatic label insertion
  • [-] Error recovery
    • [X] Manual Recovery Expression definition
    • [X] Associate manually defined labels to manually defined REs
    • [ ] Automatic recovery expression generation
  • [ ] Pipe multiple grammars through the command line
  • [ ] parameterized rules?
  • [ ] higher-order rules?
  • [ ] packages?

Standard Library

  • [X] ABNF (RFC-5234)
  • [ ] C
  • [ ] C++
  • [ ] CPP
  • [X] CSV
  • [ ] CSS
  • [ ] Go
  • [ ] Java
  • [ ] JavaScript
  • [X] JSON
  • [ ] JSX
  • [ ] HTML (tolerant to failures – invalid)
  • [ ] Python (relative space)
  • [X] PEG
  • [ ] Uniform Resource Identifier (URI): Generic Syntax (RFC-3986)
  • [ ] Rust
  • [ ] TypeScript
  • [ ] XML

Implemented Design

Capturing Values

Matches of an input against a grammar produce an output tree. The Virtual Machine uses a unified stack with three frame types: Backtracking, Call, and Capture. Captured nodes are stored in a shared arena (nodeArena) rather than a separate capture stack.

Stack frame types            nodeArena, shared, grows as frames are pushed
┌─────────────────────┐             ┌───┬───┬───┬───┬───┐
│ Capture(id=Expr)    │         ───▶│ 0 │ 1 │ 2 │   │   │───▶
│ nodesStart=0        │             └───┴───┴───┴───┴───┘
│ nodesEnd=3          │
├─────────────────────┤
│ Backtracking        │
│ cursor=5, pc=42     │      and shrinks as frames are popped off the stack
├─────────────────────┤             ┌───┬───┐
│ Call                │         ◀───│   │   │───◀
│ pc=100              │             └───┴───┘
└─────────────────────┘

Tree Node Types

The output tree contains four node types:

  • NodeType_String: a terminal match storing a byte range (start/end offsets into the input)
  • NodeType_Sequence: an ordered list of child nodes
  • NodeType_Node: a named rule match with a single child
  • NodeType_Error: a recovery point containing error metadata

Example: parsing "hello" with grammar Greeting <- 'hello':

NodeType_Node (name="Greeting", start=0, end=5)
└── NodeType_String (start=0, end=5)

The string node has the offset within the input:

input: "hello"
        ▲───▲
        0   5

Example: parsing "ab" with grammar Pair <- 'a' 'b':

NodeType_Node (name="Pair")
└── NodeType_Sequence
    ├── NodeType_String "a" (0,1)
    └── NodeType_String "b" (1,2)

Grammar Compilation and Capture Insertion

The AddCaptures step of the grammar compiler wraps each definition’s expression in a CaptureNode named after the definition. For non-syntactic rules (those containing calls to other rules via identifiers), inner terminal expressions are also wrapped with unnamed captures so that terminal matches are grouped appropriately. Purely syntactic rules (containing only terminals like literals, charsets, ranges, and .) don’t need inner captures since the entire match is captured as one unit. Certain rules like Spacing, Space, EOF, and EOL (and their dependencies) can be skipped to avoid capturing whitespace through the `–disable-capture-spaces` CLI flag.

Capture Instructions

The VM provides several capture-related instructions:

  • CapBegin: pushes a Capture frame onto the stack with an ID and the current cursor position. The frame tracks a range (nodesStart~/~nodesEnd) into the shared node arena.
  • CapEnd: pops the capture frame and creates a node from its captured children. If no children were captured but the cursor moved, it creates a NodeType_String. With one child, it reuses that node. With multiple children, it creates a NodeType_Sequence. If the capture has a name (non-zero ID), the result is wrapped in a NodeType_Node.
Matching "ab" against: Pair <- 'a' 'b'

1. CapBegin(Pair)
   Stack:
   ┌──────────────┐
   │Cap(Pair)     │
   │start=0,end=0 │
   └──────────────┘
   nodeArena: []

2. match 'a', CapTerm
   Stack:
   ┌──────────────┐
   │Cap(Pair)     │
   │start=0,end=1 │
   └──────────────┘
   nodeArena: [S0]

3. match 'b', CapTerm
   Stack:
   ┌──────────────┐
   │Cap(Pair)     │
   │start=0,end=2 │
   └──────────────┘
   nodeArena: [S0, S1]

4. CapEnd(Pair)
   - pops frame, sees 2 children, forms a Sequence
   - wraps in Node(Pair)
   - pushes result to parent (or nodes[])

   Output:
   Node(Pair)
   └── Seq[S0, S1]
       ├── String(0,1) "a"
       └── String(1,2) "b"
  • CapTerm: creates a terminal string node when the matched size is known at compile time.
  • CapNonTerm: creates a named terminal node when the matched size is known at compile time.
  • CapTermBeginOffset / CapNonTermBeginOffset / CapEndOffset: used when the matched size isn’t known at compile time (e.g., for Span operations). These track the start cursor and compute the offset dynamically.
  • CapCommit: pops the top frame and propagates its captures to the parent frame (or to the top-level nodes list). Used for ordered choice (/) on success.
  • CapBackCommit: like CapCommit but also restores the cursor. Used for predicates.
  • CapPartialCommit: collects captures from the current frame into its parent without popping, then resets the frame’s capture range. Used within ZeroOrMore (*) loops to accumulate iterations.
  • CapReturn: pops the capture frame and propagates captures. Emitted instead of Return when captures are enabled.

Backtracking and Capture Cleanup

When the Fail path is taken, the VM pops frames from the stack. For each popped frame, it truncates the node arena back to that frame’s starting position (nodesStart), effectively discarding any captures made within that frame. This ensures that backtracking properly undoes captured values.

Matching "ac" against: Pair <- 'a' 'b' / 'a' 'c'

1. Try first choice
   Choice(L2)
   ┌──────────────┐
   │Backtrack     │
   │cursor=0,pc=L2│
   └──────────────┘
   nodeArena: []



2. match 'a' ok
   CapTerm
   ┌──────────────┐
   │Backtrack     │
   │cursor=0,pc=L2│
   └──────────────┘
   nodeArena: [S0] (captured)

3. match 'b' FAIL!
   Backtrack
   ┌──────────────┐
   │(popped)      │
   └──────────────┘
   nodeArena: [] (truncated!)

4. jump to L2 (2nd choice)
   cursor restored to 0
   ┌──────────────┐
   │Cap(Pair)     │
   │start=0,end=0 │
   └──────────────┘
   nodeArena: []

5. match 'a','c' ok
   CapTerm, CapTerm
   ┌──────────────┐
   │Cap(Pair)     │
   │start=0,end=2 │
   └──────────────┘
   nodeArena: [S0,S1]

6. CapEnd
   Output:
   Node(Pair)
   └── Seq["a","c"]

Predicates

The NOT (!) and AND (&) predicates don’t produce captures since they don’t consume input. When entering a NOT predicate, the VM sets a predicate flag that suppresses capture operations.

Optional and ZeroOrMore

The Optional (?) operator emits code wrapped with Choice~/~CapCommit so that on success, captures are propagated.

The ZeroOrMore (*) operator uses CapPartialCommit to accumulate captures across iterations without creating a new frame per iteration.

Matching "aaa" against: As <- 'a'*

1. CapBegin(As)            2. match 'a', CapTerm   3. CapPartialCommit
   Choice(L1)                                      (collect to parent)
   ┌──────────────┐        ┌──────────────┐        ┌──────────────┐
   │Backtrack     │        │Backtrack     │        │Backtrack     │
   │start=0,end=0 │        │start=0,end=1 │        │start=1,end=1 │◀─reset
   └──────────────┘        └──────────────┘        └──────────────┘
   ┌──────────────┐        ┌──────────────┐        ┌──────────────┐
   │Cap(As)       │        │Cap(As)       │        │Cap(As)       │
   │start=0,end=0 │        │start=0,end=0 │        │start=0,end=1 │◀─extended
   └──────────────┘        └──────────────┘        └──────────────┘
   nodeArena: []           nodeArena: [S0]         nodeArena: [S0]

4. repeat 2-3 twice        5. match 'a' FAIL       6. CapCommit, CapEnd
                              (no more input)
   ┌──────────────┐        ┌──────────────┐        Output:
   │Backtrack     │        │(popped)      │        Node(As)
   │start=3,end=3 │        └──────────────┘        └── Seq["a","a","a"]
   └──────────────┘
   ┌──────────────┐        ┌──────────────┐
   │Cap(As)       │        │Cap(As)       │
   │start=0,end=3 │        │start=0,end=3 │
   └──────────────┘        └──────────────┘
   nodeArena:[S0,S1,S2]    nodeArena:[S0,S1,S2]

Error Recovery

Error recovery expressions (defined by error labels in the grammar) produce NodeType_Error nodes instead of regular nodes. These contain an error label ID, message ID, and optionally a child node with the recovered content.

Read along to see how one can also transform captured values after they match and before they’re returned.

Error Handling

Farthest Failure Position (FFP)

In the original definition of Parsing Expression Grammars, backtracking resets the input cursor to where it was before trying a different parsing expression. If there is no match, backtracking fails and the cursor is left at the position it was at the beginning of the last Ordered Choice.

To improve error reporting, a heuristic called the Farthest Failure Position introduces a new register to the Virtual Machine:

The ffp is the farthest cursor position reached before any failure

Every time the fail path is taken, if the current cursor exceeds ffp, both registers are updated. This value is immune to backtracking and provides a more accurate error position.

For example, consider the following grammar:

Pair <- 'a' 'b' 'c'

Let’s match the input "ab" against it, we expect to observe the following steps:

  1. match ‘a’ ok, cursor=1, ffp=-1
  2. match ‘b’ ok, cursor=2, ffp=-1
  3. match ‘c’ er, cursor=2, ffp=2 <- updated on failure

When the VM returns an error, it uses ffp to report where parsing actually got stuck, rather than where backtracking left the cursor:

Input: "ab"
        12  <---- ffp points to the last index, not at the start

Expected Values

When showFails is enabled, the VM collects “expected” values at the FFP. These are the characters or ranges that would have allowed parsing to continue:

Let’s consider the grammar

Pair <- 'a' ('b' / 'c')

When matching "ax" against it, we expect the following steps to happen:

  1. match ‘a’ ok
  2. match ‘b’, fail (collect ‘b’)
  3. backtrack
  4. match ‘c’, fail (collect ‘c’)
  5. fail

When parsing finally fails, it displays the characters it unsuccessfully attempted to match;

Error: "Expected 'b', 'c' but got 'x'"

The Throw Operator

The Throw Operator (^) allows grammar authors to signal that no more alternatives should be attempted and parsing should fail immediately. This is useful when you know that after matching certain tokens, no other choice could possibly succeed.

The opThrow instruction works as follows:

opThrow(labelID):
  if predicate:
    goto fail              ◀── within !/& predicates, just backtrack
  if recovery[labelID] exists:
    call recovery[labelID] ◀── try to recover and continue
  else:
    return error           ◀── stop parsing immediately

When to Use Throw

The general place where a Throw Operator is useful is right after matching tokens that uniquely identify a construct. Consider:

IfStatement <- IF LEFTP Expression RIGHTP Body
AllStatements <- IfStatement / ForStatement / WhileStatement ...

These inputs would unnecessarily trigger all alternatives in AllStatements:

'if x', 'if (', 'if (x'

With the Throw Operator, parsing stops early on failure:

IfStatement <- IF LEFTP^ Expression^ RIGHTP^ Body
Matching "if x" against: IfStatement

1. match 'if' ok          2. match '(' with ^
   cursor=2                  cursor=3

   Without ^:               With ^:
   → backtrack              → throw error
   → try ForStatement       → "Expected '(' at position 3"
   → try WhileStatement
   → ... (wasteful)

Custom Error Messages

The Throw Operator accepts an optional custom message:

IfStatement <- IF LEFTP^ Expression^"missing expression" RIGHTP^ Body

Compilation of Throw

The syntax expr^label is compiled as (expr / ⇑label):

    choice L1     <-- try expr first
    <expr code>
    commit L2     <-- expr succeeded, skip throw
L1: throw label   <-- expr failed, throw error
L2: ...

Error Recovery Expressions

When a label has an associated Recovery Expression, the throw instruction calls it instead of returning an error. This enables incremental parsing where errors are captured as NodeType_Error nodes in the output tree.

Consider the following grammar:

Expr        <- Term ('+' Term^MissingTerm)*
Term        <- [0-9]+
MissingTerm <- (!Term .)* Term?

Let’s try to match the input "1++2+3" against it:

  1. match ‘1’ ok
  2. match ‘+’ ok
  3. match Term FAIL: throw MissingTerm
  4. call MissingTerm recovery expr
  5. skip past ‘+’ Term (also captures Error node)
  6. match ‘+’ ok
  7. match ‘3’ ok

And this is the expected output:

Expr (1..7)
└── Sequence<5> (1..7)
    ├── Term (1..2)
    │   └── "1" (1..2)
    ├── "+" (2..3)
    ├── Error<MissingTerm> (3..5)
    │   └── Sequence<2> (3..5)
    │       ├──  (3..4)
    │       │   └── "+" (3..4)
    │       └── Term (4..5)
    │           └── "2" (4..5)
    ├── "+" (5..6)
    └── Term (6..7)
        └── "3" (6..7)

Rambling

Incremental Parsing

The parser will fail at the first error by default (as Parsing Expression Grammars do originally). But an incremental parsing mode is also included, but with annotation costs traded for precision.

When parsing is done incrementally, the Throw Operator won’t interrupt parsing right away. It will instead add a special node to the tree returned by the parser storing information about the error. The parser will then execute the Recovery Expression associated with the (implicitly created) label behind the Throw Operator, which should consume the input until where the matching of another expression can be attempted.

The default Recovery Expression of a label of an instance of the Throw Operator is the following:


Annotation costs come from the

Self Hosting

What would it take to build a parser generator tool capable of self-hosting? It can already take a stream of characters, transform it into a tree, and then it can take the tree and traverse it.

There’s now the need of emitting some code that could then be interpreted by the virtual machine that has been used run the parsing and the translating. Besides choosing a format for outputting the emitted code, it will also be necessary to provide auxiliary tooling for introspecting the output stream. Introspection in the sense of reading from arbitrary positions within the output stream, and knowing where the writing cursor is (e.g.: for creating labels).

So, before being capable of self-hosting, this tool has to be able to describe an entire compiler. A first good exercise would be to try and implement something similar to what is described in the article “PEG-based transformer provides front-, middle and back-end stages in a simple compiler” (Piumarta, 2010). langlang isn’t very far from achieving that. There are two main missing steps:

  1. semantic actions to allow transformations to be performed on the values produced upon matches.
  2. a modular output stream, that can encode values in different formats. The backtracking of the parser is the reason why it’s complicated to allow side effects on semantic actions. Our options to deal with that are to either build an output stream that is aware of the backtracking, or to apply the output rules after matching with a classic visitor style.

The initial format of the output stream can be text based, and the proof that it’d work is the ability to compile grammars into their correct text representation of the bytecode that the virtual machine can interpret. There’s some

Tools

Pretty Print / Minifier

  • [X] Parse input file with language grammar and get an AST
  • [X] Generate the tree traversal grammar off the original grammar
  • [ ] Traverse tree grammar and output code (ugly print)
  • [ ] Merge overrides to the default settings of the code emitter
  • [ ] Command line usage
    Usage: langlang print [OPTIONS] INPUT_FILE
    
    OPTIONS
    
    [-g|--grammar] GRAMMAR
        Grammar file used to parse INPUT_FILE
    
    [-o|--output] OUTPUT
        Writes output into OUTPUT file
    
    --cfg-tab-width NUMBER
        Configure indentation size
    
    --cfg-max-line-width NUMBER
        Maximum number of characters in a single line
    
    --cfg-add-trailing-comma
        Add trailing comma to the end of last list entry
        

Diff

  • [X] Parse both files and get their AST
  • [ ] Apply tree diff algorithm
  • [ ] Display results
  • [ ] Command line usage
    langlang diff file_v1.py file_v2.py
    langlang diff file.py file.js
        

Object Database

  • [ ] Undo/Redo
  • [ ] LSP Server
  • [ ] CRDT Storage
  • [ ] AST diff

Editor

  • [ ] Language Server Protocol
  • [ ] Text Editing
  • [ ] Rendering Engine
  • [ ] Configuration Language

Pretty Print / Minifier

Suppose we can parse a .ln file with a given grammar lang.peg. That’d give us an AST as output. One option is to write the translator as a tree traversal for that AST that emits code. That will take one of those traversals per language that needs to be supported. That’d double the burden on the user’s side, since there was already the need of putting together the language grammar.

In order to automate some of the process, one could maybe take the lang.peg file as input and produce a lang.translator.peg, in which rules that output trees would be translated into rules that could also take structured data as input. Take the following rules as an example:

Program    <- _ Statement+ EndOfFile
Statement  <- IfStm / WhileStm / AssignStm / Expression
AssignStm  <- Identifier EQ Expression
IfStm      <- IF PAROP Expression PARCL Body
WhileStm   <- WHILE PAROP Expression PARCL Body
Body       <- Statement / (CUROP Statement* CURCL)
# (...)
IF         <- 'if'    _
WHILE      <- 'while' _
EQ         <- 'eq'    _
PAROP      <- '('     _
PARCL      <- ')'     _
CUROP      <- '{'     _
CURCL      <- '}'     _
# (...)

The following output would be generated:

Program    <- { "Program" _ Statement+ EndOfFile }
Statement  <- { "Statement" IfStm / WhileStm / AssignStm / Expression }
AssignStm  <- { "AssignStm" Identifier EQ Expression  }
IfStm      <- { "IfStm" IF PAROP Expression PARCL Body }
WhileStm   <- { "WhileStm" WHILE PAROP Expression PARCL Body }
Body       <- { "Body" Statement / (CUROP Statement* CURCL) }
# (...)
IF         <- { "IF" Atom }
WHILE      <- { "WHILE" Atom }
EQ         <- { "EQ" Atom }
PAROP      <- { "PAROP" Atom }
PARCL      <- { "PARCL" Atom }
CUROP      <- { "CUROP" Atom }
CURCL      <- { "CURCL" Atom }
# (...)
Atom       <- !{ .* } .

With that, we’d know how to traverse any tree returned by the original lang.peg. We could then build a general traversal that walks down the tree, printing out what was matched.

There is one type of information that is not available in the original grammar though. The specifics of each language! For example, in Python, default values for named arguments aren’t supposed to have spaces surrounding the equal sign e.g.:

def complex(real, imag=0.0):
    return # (...)

But that’s not the same as in JavaScript:

function multiply(a, b = 1) {
  return a * b;
}

To the same extent, minification rules for Python would be different from most other languages as well, given its indentation based definition of scopes.

The good news is that most of these differences, if not all, can be encoded as options available for all languages, leaving the user with a much smaller burden of defining only the overrides for each language that demands options that differ from the defaults in the code emitter.

Semantic Actions

Modules

In langlang, modules are recursive containers for other modules and for grammars.

--------

Module
Rule1
Rule2
RuleN

--------

type Offset usize;
type SymbolName String;
struct Module {
  filename: String,
  // module in which this module was declared
  parent: Option<Module>,
  // modules declared within this module
  modules: Vec<Module>,
  // symbols provided by this module
  symbols: HashMap<SymbolName, Offset>,
  // symbols used in this module but declared somewhere else
  missing: HashMap<SymbolName, Offset>,
}
$ mkdir -p ./lib/base                                    # directory structure for user defined grammars
$ edit ./lib/base/rule.langlang                          # write a grammar
$ llsh lib::base::rule https://url.with.test/case        # a file lib/base/rule.binlang will be created
$ llsh -i. lib::base::rule https://url.with.test/case    # previous example worked because `-i./' is implicit
$ llsh -i./lib base::rule https://url.with.test/case     # full name differs depending on where the root starts
$ MODULE_SEARCH_PATH=./lib llsh base::rule https://url.with.test/case # search path can be extended via env var

When a symbol is requested, a look up to the symbol table is issued and, if it is present there, its address is returned. If it is not, then the BinaryLoader looks for it within the bytecode cache, and if it’s not there, it will go through each search path and try to find it in the file system.

Shell

# from stdin
echo https://clarete.li/langlang | llsh lib::rfc3986

# from a file
llsh lib::rfc5234 ~/lib/rfc3986.abnf

# from a URL
llsh lib::json https://jsonplaceholder.typicode.com/users

# interactive
llsh lib::peg
>> S <- 'a' / 'b'

Matching

Literal Strings

Left Recursion

Captures

state = <pc, s, e, c>

<pc, s, e, c> – Char a –> <pc+1, s, e, a:c> <pc, s, e, c> – Choice i –> <pc+1, s, (pc+i,s,c):e, c>

Error Handling

Success

Throw f <pc,s,e> -----------→ Fail<f,s,e>

  • inside choice
    p / throw(label)
        

    when p fails: -> log error tuple (location(), label) -> run expression within R(label)

  • inside predicate
    !(p / throw(label))
        

    when p succeeds: -> return label fail when p fails: -> R is empty for predicates, so return throw doesn’t do anything, label is discarded and the operation succeeds.

Once an expression fails to be parsed and throw is called, a look up for label is made within R. If a recovery expression is found, it’s executed with the goal of moving the parser’s input cursor to right before the first symbol of the next parsing expression.

Follow Sets

An Expression e has a FOLLOW set of symbols that can be intuitively described as the list of possible characters to be matched after matching e.

  1. Base Case
    G <- (E / ⇑l) "x"
        

    The symbol x would be the only element of the FOLLOW set of symbols of E.

  2. Recursive Case
    G <- (E / ⇑l) (A / B)
    A <- "x" / "y"
    B <- "z" / "k"
        

    The FOLLOW set of E in this case is x, y, z, k, since any of these symbols could appear right after parsing E.