Skip to content

luizberti/GrammarStudio

Repository files navigation

GrammarStudio

Learn a parser from examples. Give it a corpus of concrete syntax trees (parsed by some other parser), and it synthesizes a grammar that reproduces the same parse structure.

Built on faster-miniKanren and Chez Scheme.

What it does

You provide a corpus of CSTs like this:

(list
  '(expr (num "43") (*token* plus "+") (num "2"))
  '(expr (num "1") (*token* minus "-") (num "99"))
  '(expr (num "7"))
  '(expr (num "100") (*token* plus "+") (num "200")))

It learns a grammar:

plus  -> (lit -)
minus -> (lit +)
expr  -> (alt (sym num) (cat (sym num) (cat (alt (sym plus) (sym minus)) (sym num))))
num   -> (rep (set (48 . 57)))

That grammar can then parse inputs it never saw during learning (999+1, 0-0) and reject invalid ones (abc, 1+). The regex language is relational — the same grammar works for matching, parsing with CST output, and string generation.

It has been tested on a JSON corpus (81 trees from JSONTestSuite, 11 nonterminal types). It correctly learns rep-based repetition rules for arrays, objects, and strings — patterns that require actual search to discover, not just pattern matching over the examples.

The patterns it learns, however, are total shit and not a correct grammar for the language. They just barely work for the input corpus, but there's still a lot more work needed to make this good, specifically: several iterations of using the learned grammar to generate inputs for the reference grammar to correct, and then learning from those counter-examples. We currently don't do this.

Quick start

Requires Chez Scheme.

# Run the demo (covers everything, ~30 seconds)
chez --script demo.scm

# Run all tests (266 total)
chez --script test-phog.scm   # 174 tests — main suite
chez --script root.scm         # 35 tests — regex primitives
chez --script test-parseo.scm  # 32 tests — structured parsing
chez --script test-verify.scm  # 25 tests — structural verification

How it works

Grammar IR

lit, cat, alt, opt, rep, sym, plus set for character classes. sym references a named rule, making grammars recursive. matcho does relational matching. parseo is like matcho but produces structured output.

Grammar Learning

Given a corpus:

  1. Extract statistics. Walk all trees, build a PHOG table (per-nonterminal, per-context probability distributions over child types) and a pattern frequency table.
  2. Synthesize rules. For each nonterminal type, collect the distinct child-type patterns from all instances, then:
    • Tiered parallel search (primary path): launch miniKanren searches for a regex that matches all observed child-type patterns at the symbol level. Tier 1 is a single thread per symbol (~2s). Tier 2 splits the search space by top-level constructor (cat/alt/rep/opt/sym), one thread each, first result wins.
    • Anti-unification (currently broken and disabled!): deterministic O(n*m) — group patterns by structural skeleton, anti-unify column-wise within groups, combine groups with alt. Produces the least-general generalization without search.
    • Character-level synthesis for leaf nonterminals: tries character class patterns ((rep (digit)), (rep (lower)), etc.) before falling back to enumerating literals.
  3. Order by probability. PHOG statistics reorder alt branches so more frequent productions come first.

PHOG-weighted parsing

During parsing, sym-matcho uses the PHOG table to bias search at alt choice points. It builds a 5-element context (node-type, position, parent-type, grandparent-type, left-sibling-type), looks up smoothed probabilities via Witten-Bell backoff, and allocates proportionally more search budget to higher-probability branches via mplus-biased. This makes parsing significantly faster for ambiguous grammars.

Project Structure

File What
gram.scm Regex primitives, matcho, parseo, rule registry. Loads mk.
tree.scm CST representation, traversal, corpus queries.
phog.scm PHOG statistics, weighted search, anti-unification, tiered parallel search, grammar learning pipeline.
mk.scm faster-miniKanren (unchanged).
mk-vicare.scm Trie/intmap for Chez Scheme (loaded by mk.scm).
corpus-json.scm 81 JSON CSTs from JSONTestSuite.
demo.scm Runnable demo of all capabilities.
run-json.scm JSON corpus pipeline (learning works, verification hangs — see below).
test-phog.scm Main test suite (174 tests).
root.scm Regex primitive tests (35 tests).
test-parseo.scm Structured parsing tests (32 tests).
test-verify.scm Structural verification tests (25 tests).
try-search.scm Standalone search experiments (development artifact).

CST format

Three kinds of nodes:

;; Nonterminal: (symbol children...)
'(expr (num "43") (*token* plus "+") (num "2"))

;; Token: (*token* name text)
'(*token* plus "+")

;; Anonymous string leaf (literal text like punctuation, whitespace)
"+"

cst->string on any node reconstructs the original source text.

Caveats

  • Verification is broken for rep-based rules: the learned rules are correct at the symbol level, but verify-grammar (which parses character-by-character via parseo) hangs for rep-of-alt rules. rep tries every possible way to split the input string, and each split explores every alt branch character by character. (short inputs <15 chars parse fine but longer ones don't terminate)

    The root cause is *anonymous* structural tokens (braces, commas, whitespace) being defined as (rep (any)), which matches anything and causes parseo to spin. I feel like inlining the actual observed literal texts would make parsing tractable, but needs careful design.

  • We need better mechanisms for distinguishing non-terminals, terminals, keywords (which always match the same text), and trivia tokens when learning. Treating everything as a nonterminal is very inneficient and blows up the search space;

  • No incrementality: learn-grammar rebuilds everything from scratch every time. There is no way to update a grammar with new examples;

About

Sketch-based grammar synthesizer, basically lightning in a bottle ⚡

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors