GrammarStudio

Learn a parser from examples. Give it a corpus of concrete syntax trees (parsed by some other parser), and it synthesizes a grammar that reproduces the same parse structure.

Built on faster-miniKanren and Chez Scheme.

What it does

You provide a corpus of CSTs like this:

(list
  '(expr (num "43") (*token* plus "+") (num "2"))
  '(expr (num "1") (*token* minus "-") (num "99"))
  '(expr (num "7"))
  '(expr (num "100") (*token* plus "+") (num "200")))

It learns a grammar:

plus  -> (lit -)
minus -> (lit +)
expr  -> (alt (sym num) (cat (sym num) (cat (alt (sym plus) (sym minus)) (sym num))))
num   -> (rep (set (48 . 57)))

That grammar can then parse inputs it never saw during learning (999+1, 0-0) and reject invalid ones (abc, 1+). The regex language is relational — the same grammar works for matching, parsing with CST output, and string generation.

It has been tested on a JSON corpus (81 trees from JSONTestSuite, 11 nonterminal types). It correctly learns rep-based repetition rules for arrays, objects, and strings — patterns that require actual search to discover, not just pattern matching over the examples.

The patterns it learns, however, are total shit and not a correct grammar for the language. They just barely work for the input corpus, but there's still a lot more work needed to make this good, specifically: several iterations of using the learned grammar to generate inputs for the reference grammar to correct, and then learning from those counter-examples. We currently don't do this.

Quick start

Requires Chez Scheme.

# Run the demo (covers everything, ~30 seconds)
chez --script demo.scm

# Run all tests (266 total)
chez --script test-phog.scm   # 174 tests — main suite
chez --script root.scm         # 35 tests — regex primitives
chez --script test-parseo.scm  # 32 tests — structured parsing
chez --script test-verify.scm  # 25 tests — structural verification

How it works

Grammar IR

lit, cat, alt, opt, rep, sym, plus set for character classes. sym references a named rule, making grammars recursive. matcho does relational matching. parseo is like matcho but produces structured output.

Grammar Learning

Given a corpus:

Extract statistics. Walk all trees, build a PHOG table (per-nonterminal, per-context probability distributions over child types) and a pattern frequency table.
Synthesize rules. For each nonterminal type, collect the distinct child-type patterns from all instances, then:
- Tiered parallel search (primary path): launch miniKanren searches for a regex that matches all observed child-type patterns at the symbol level. Tier 1 is a single thread per symbol (~2s). Tier 2 splits the search space by top-level constructor (cat/alt/rep/opt/sym), one thread each, first result wins.
- Anti-unification (currently broken and disabled!): deterministic O(n*m) — group patterns by structural skeleton, anti-unify column-wise within groups, combine groups with alt. Produces the least-general generalization without search.
- Character-level synthesis for leaf nonterminals: tries character class patterns ((rep (digit)), (rep (lower)), etc.) before falling back to enumerating literals.
Order by probability. PHOG statistics reorder alt branches so more frequent productions come first.

PHOG-weighted parsing

During parsing, sym-matcho uses the PHOG table to bias search at alt choice points. It builds a 5-element context (node-type, position, parent-type, grandparent-type, left-sibling-type), looks up smoothed probabilities via Witten-Bell backoff, and allocates proportionally more search budget to higher-probability branches via mplus-biased. This makes parsing significantly faster for ambiguous grammars.

Project Structure

File	What
`gram.scm`	Regex primitives, `matcho`, `parseo`, rule registry. Loads mk.
`tree.scm`	CST representation, traversal, corpus queries.
`phog.scm`	PHOG statistics, weighted search, anti-unification, tiered parallel search, grammar learning pipeline.
`mk.scm`	faster-miniKanren (unchanged).
`mk-vicare.scm`	Trie/intmap for Chez Scheme (loaded by mk.scm).
`corpus-json.scm`	81 JSON CSTs from JSONTestSuite.
`demo.scm`	Runnable demo of all capabilities.
`run-json.scm`	JSON corpus pipeline (learning works, verification hangs — see below).
`test-phog.scm`	Main test suite (174 tests).
`root.scm`	Regex primitive tests (35 tests).
`test-parseo.scm`	Structured parsing tests (32 tests).
`test-verify.scm`	Structural verification tests (25 tests).
`try-search.scm`	Standalone search experiments (development artifact).

CST format

Three kinds of nodes:

;; Nonterminal: (symbol children...)
'(expr (num "43") (*token* plus "+") (num "2"))

;; Token: (*token* name text)
'(*token* plus "+")

;; Anonymous string leaf (literal text like punctuation, whitespace)
"+"

cst->string on any node reconstructs the original source text.

Caveats

Verification is broken for rep-based rules: the learned rules are correct at the symbol level, but verify-grammar (which parses character-by-character via parseo) hangs for rep-of-alt rules. rep tries every possible way to split the input string, and each split explores every alt branch character by character. (short inputs <15 chars parse fine but longer ones don't terminate)

The root cause is *anonymous* structural tokens (braces, commas, whitespace) being defined as (rep (any)), which matches anything and causes parseo to spin. I feel like inlining the actual observed literal texts would make parsing tractable, but needs careful design.
We need better mechanisms for distinguishing non-terminals, terminals, keywords (which always match the same text), and trivia tokens when learning. Treating everything as a nonterminal is very inneficient and blows up the search space;
No incrementality: learn-grammar rebuilds everything from scratch every time. There is no way to update a grammar with new examples;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GrammarStudio

What it does

Quick start

How it works

Grammar IR

Grammar Learning

PHOG-weighted parsing

Project Structure

CST format

Caveats

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
oracle		oracle
research		research
AGENTS.md		AGENTS.md
HUMANS.md		HUMANS.md
LIMITATIONS.md		LIMITATIONS.md
NOTES.md		NOTES.md
PROGRESS.md		PROGRESS.md
PROMPT.md		PROMPT.md
README.md		README.md
corpus-json.scm		corpus-json.scm
demo.scm		demo.scm
gram.scm		gram.scm
mk-vicare.scm		mk-vicare.scm
mk.scm		mk.scm
phog.scm		phog.scm
root.scm		root.scm
run-json.scm		run-json.scm
test-parseo.scm		test-parseo.scm
test-phog.scm		test-phog.scm
test-verify.scm		test-verify.scm
tree.scm		tree.scm
try-search.scm		try-search.scm

Folders and files

Latest commit

History

Repository files navigation

GrammarStudio

What it does

Quick start

How it works

Grammar IR

Grammar Learning

PHOG-weighted parsing

Project Structure

CST format

Caveats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages