Learn a parser from examples. Give it a corpus of concrete syntax trees (parsed by some other parser), and it synthesizes a grammar that reproduces the same parse structure.
Built on faster-miniKanren and Chez Scheme.
You provide a corpus of CSTs like this:
(list
'(expr (num "43") (*token* plus "+") (num "2"))
'(expr (num "1") (*token* minus "-") (num "99"))
'(expr (num "7"))
'(expr (num "100") (*token* plus "+") (num "200")))It learns a grammar:
plus -> (lit -)
minus -> (lit +)
expr -> (alt (sym num) (cat (sym num) (cat (alt (sym plus) (sym minus)) (sym num))))
num -> (rep (set (48 . 57)))
That grammar can then parse inputs it never saw during learning (999+1, 0-0) and reject
invalid ones (abc, 1+). The regex language is relational — the same grammar works for
matching, parsing with CST output, and string generation.
It has been tested on a JSON corpus (81 trees from JSONTestSuite, 11 nonterminal types).
It correctly learns rep-based repetition rules for arrays, objects, and strings — patterns
that require actual search to discover, not just pattern matching over the examples.
The patterns it learns, however, are total shit and not a correct grammar for the language. They just barely work for the input corpus, but there's still a lot more work needed to make this good, specifically: several iterations of using the learned grammar to generate inputs for the reference grammar to correct, and then learning from those counter-examples. We currently don't do this.
Requires Chez Scheme.
# Run the demo (covers everything, ~30 seconds)
chez --script demo.scm
# Run all tests (266 total)
chez --script test-phog.scm # 174 tests — main suite
chez --script root.scm # 35 tests — regex primitives
chez --script test-parseo.scm # 32 tests — structured parsing
chez --script test-verify.scm # 25 tests — structural verificationlit, cat, alt, opt, rep, sym, plus set for character classes. sym references a
named rule, making grammars recursive. matcho does relational matching. parseo is like
matcho but produces structured output.
Given a corpus:
- Extract statistics. Walk all trees, build a PHOG table (per-nonterminal, per-context probability distributions over child types) and a pattern frequency table.
- Synthesize rules. For each nonterminal type, collect the distinct child-type patterns
from all instances, then:
- Tiered parallel search (primary path): launch miniKanren searches for a regex that matches all observed child-type patterns at the symbol level. Tier 1 is a single thread per symbol (~2s). Tier 2 splits the search space by top-level constructor (cat/alt/rep/opt/sym), one thread each, first result wins.
- Anti-unification (currently broken and disabled!): deterministic O(n*m) — group patterns
by structural skeleton, anti-unify column-wise within groups, combine groups with
alt. Produces the least-general generalization without search. - Character-level synthesis for leaf nonterminals: tries character class patterns
(
(rep (digit)),(rep (lower)), etc.) before falling back to enumerating literals.
- Order by probability. PHOG statistics reorder
altbranches so more frequent productions come first.
During parsing, sym-matcho uses the PHOG table to bias search at alt choice points. It builds
a 5-element context (node-type, position, parent-type, grandparent-type, left-sibling-type), looks
up smoothed probabilities via Witten-Bell backoff, and allocates proportionally more search budget
to higher-probability branches via mplus-biased. This makes parsing significantly faster for
ambiguous grammars.
| File | What |
|---|---|
gram.scm |
Regex primitives, matcho, parseo, rule registry. Loads mk. |
tree.scm |
CST representation, traversal, corpus queries. |
phog.scm |
PHOG statistics, weighted search, anti-unification, tiered parallel search, grammar learning pipeline. |
mk.scm |
faster-miniKanren (unchanged). |
mk-vicare.scm |
Trie/intmap for Chez Scheme (loaded by mk.scm). |
corpus-json.scm |
81 JSON CSTs from JSONTestSuite. |
demo.scm |
Runnable demo of all capabilities. |
run-json.scm |
JSON corpus pipeline (learning works, verification hangs — see below). |
test-phog.scm |
Main test suite (174 tests). |
root.scm |
Regex primitive tests (35 tests). |
test-parseo.scm |
Structured parsing tests (32 tests). |
test-verify.scm |
Structural verification tests (25 tests). |
try-search.scm |
Standalone search experiments (development artifact). |
Three kinds of nodes:
;; Nonterminal: (symbol children...)
'(expr (num "43") (*token* plus "+") (num "2"))
;; Token: (*token* name text)
'(*token* plus "+")
;; Anonymous string leaf (literal text like punctuation, whitespace)
"+"cst->string on any node reconstructs the original source text.
-
Verification is broken for
rep-based rules: the learned rules are correct at the symbol level, butverify-grammar(which parses character-by-character viaparseo) hangs forrep-of-altrules.reptries every possible way to split the input string, and each split explores everyaltbranch character by character. (short inputs <15 chars parse fine but longer ones don't terminate)The root cause is
*anonymous*structural tokens (braces, commas, whitespace) being defined as(rep (any)), which matches anything and causesparseoto spin. I feel like inlining the actual observed literal texts would make parsing tractable, but needs careful design. -
We need better mechanisms for distinguishing non-terminals, terminals, keywords (which always match the same text), and trivia tokens when learning. Treating everything as a nonterminal is very inneficient and blows up the search space;
-
No incrementality:
learn-grammarrebuilds everything from scratch every time. There is no way to update a grammar with new examples;