- 3-letter primitives:
lit,cat,alt,opt,rep,sym,set setbyte-range primitive:(set (lo . hi) ...)for character class matching (like re2/rust regex crate)- Named classes:
digit,lower,upper,alpha,alnum,blank,any matcho: relational regex matching with cycle detection for recursive grammarsparseo: likematchobut threads a reducer function, produces structured output atsymboundariesmake-tree-reducerbuilds CST nodes,make-identity-reducerreturns matched text- Rule registry:
define-rule,lookup-rule,clear-rules! full-matcho: can synthesize regex from positive/negative examples (fromroot.scmera)
- Three-way node distinction: nonterminal
(symbol children...), token(*token* name text), anonymous string leaf"text" cst->stringreconstructs original source from any CST nodecst-child-typesextracts type-symbol sequence from a node's childrencst-walktraverses corpus with full 5-element context (node-type, position, parent-type, grandparent-type, left-sibling-type)- Corpus queries:
cst-collect-types,cst-collect-instances,cst-collect-child-patterns,cst-collect-token-types
extract-phog-statsbuilds two tables from a corpus:- PHOG table: nonterminal → (context → ((production . count) ...))
- Pattern table: nonterminal → ((child-pattern . count) ...)
phog-lookupwith Witten-Bell smoothing and 5-level backoff chain (full context → drop features right-to-left → global merge)- Per-nonterminal context functions via
*context-fns*registry; default uses all 5 features condp-static: weightedconde— sorts branches by descending weight, chains viampluscondp-budgeted: likecondp-staticbut chains withmplus-biased— proportional search budget allocation (weight/min-weight ratio per branch)mplus-biased: budget-based interleaving — gives left streamksteps per 1 step of right, followscase-infprotocol (no mk.scm changes needed)regex-weight-from-probs: extracts weight of a regex branch from a context-specificphog-lookupprobability distribution
learn-grammar: end-to-end — registers token types as literal rules, runs bulk parallel search, then synthesizes each nonterminallearn-single-rule: dispatches between empty nodes, token-like nodes, and nonterminals- Tiered parallel search (primary path for multi-pattern nonterminals):
bulk-search-ruleslaunches all multi-pattern symbols in parallel- Tier 1: one engine-bounded thread per symbol (~2s fuel), finds simple types (document, pair) in ~1ms
- Tier 2: or-parallel for tier-1 failures — 5 threads per symbol (one per top-level regex constructor:
cat/alt/rep/opt/sym), first ground result wins - All threads use Chez
make-enginefor bounded computation — self-terminate after fuel exhaustion (no zombie threads, no memory leaks) regex-ground?filter rejects results with unbound miniKanren variables (e.g.,_.0)- Falls back to anti-unification when search times out or produces no ground result
- Anti-unification (fallback for multi-pattern synthesis):
group-by-skeleton→transpose→anti-unify-column→anti-unify-group→anti-unify-patterns- O(n*m) deterministic — no search, no unbound variables, no overgeneralization
pattern-skeletonmaps named symbols →*sym*, keeps*anonymous*— two patterns group iff same skeleton- Anonymous string inlining via
collect-anonymous-texts-at-position+inline-anonymous-texts— scans ALL patterns in a skeleton group - Skeleton groups ordered by
pattern-tablefrequency (most common first)
- PHOG-ordered alt branches:
phog-order-regexpost-processes both search and anti-unified regexes, reorderingaltchildren by marginal PHOG frequency - PHOG-weighted parsing via
sym-matcho:- 7-arg signature:
(sym-matcho regex input rest phog-table nonterminal pos lsib) pos(child index) andlsib(left-sibling type) threaded throughcat/repbranches viaproject- Alt branch uses
condp-budgetedwith context-specific PHOG weights (not just marginals): builds(nonterminal pos nonterminal #f lsib)context, callsphog-lookupfor Witten-Bell-smoothed probabilities - Graceful degradation: unground vars or empty PHOG table → equal weights → same as old
conde full-sym-matchoviacase-lambda: 2-arg(regex pat)or 4-arg(regex pat phog-table nonterminal)— passes0 #ffor pos/lsib
- 7-arg signature:
- Character-level synthesis: leaf nonterminals generalize — learns
(rep (set (48 . 57)))instead of memorizing(alt (lit "43") (lit "1") ...) - Anonymous string handling: anonymous string leaves (e.g.,
" = ") are inlined as(lit ...)in synthesized rules verify-grammar: shallow check — didparse?consume the text?verify-grammar-structural: deep check — parses viaparseo+ tree-reducer, compares parsed CST structure against original corpus CSTs
- 81 CSTs from JSONTestSuite, 11 nonterminal types
- Full pipeline:
extract-phog-stats→learn-grammar→verify-grammar - Rule synthesis works: search finds
rep-based rules for all complex types in <25s total:document→(alt (sym array) (alt (sym object) ...))pair→(cat (cat (sym string) (sym *anonymous*)) (alt (sym string) ...))string→(rep (alt (sym *anonymous*) (alt (sym string_content) (sym escape_sequence))))object→(rep (alt (sym *anonymous*) (sym pair)))array→(rep (alt (sym *anonymous*) (alt (sym string) ...)))
- Verification blocked:
verify-grammar(character-levelparseo) hangs onrep-of-altrules — see Open Issues
The arithmetic corpus ((expr (num "43") (*token* plus "+") (num "2")) ...) produces:
minus → (lit -)
plus → (lit +)
expr → (alt (sym num) (cat (sym num) (cat (alt (sym plus) (sym minus)) (sym num))))
num → (rep (set (48 . 57)))
- No unbound variables (
_.0), no overgeneralization - Parses all corpus examples + generalizes to unseen valid inputs (
5+3,10-1) - Rejects invalid inputs
- PHOG ordering puts more frequent branches first
root.scm: 35 — regex constructors, matching, generation, synthesis, grammar rulestest-phog.scm: 174 — tree predicates, stats, Witten-Bell, condp-static, condp-budgeted, mplus-biased, regex-weight-from-probs, sym-matcho (weighted + unweighted + context-specific), position threading, anti-unification helpers (group-by-skeleton, anti-unify column/group/patterns), PHOG ordering, rule synthesis, grammar learning, verification, set primitive, char-level synthesis, anonymous strings, end-to-end qualitytest-parseo.scm: 32 — identity/tree reducers, nested sym, matched text correctnesstest-verify.scm: 25 — structural verification helpers, full pipeline structural checks
The old miniKanren-synthesized expr rule had (opt _.0) accepting any single character. Anti-unification produces exactly the least-general generalization — no unbound variables by construction.
Previously, PHOG statistics were computed but never reached synthesis decisions. Now wired in at three levels:
phog-order-regexreorders alt branches in synthesized rules by marginal PHOG frequencysym-matchousescondp-budgetedwith context-specific PHOG weights at alt choice points during parsingmplus-biasedgives proportionally more search budget to higher-weight branches (not just static reordering)
Now drives frequency ordering of skeleton groups in anti-unify-patterns (most common pattern skeleton first in top-level alt).
Replaced O(exp) miniKanren (run budget (regex) ...) search with O(n*m) deterministic anti-unification. Dead code (syntho, synth-weight, make-synth-ctx, etc.) cleaned up. Symbol-level search now available as the primary path via engine-bounded parallel tiered search.
group-by-length merged same-length patterns with different anonymous/symbol positions, causing column-wise anti-unification to produce broken regexes. Replaced with group-by-skeleton — groups only when anonymous positions align. anti-unify-column now scans all patterns in the group (not just the representative) for anonymous text collection.
Or-parallel tiered search from try-search.scm is now the primary multi-pattern synthesis path in learn-grammar. Key engineering:
- Bulk parallelism: all multi-pattern symbols searched simultaneously (not sequentially), total wall-time ~25s for JSON corpus
- Engine-bounded threads: Chez
make-enginewraps each thread's computation with a fuel limit. Threads self-terminate after exhausting ticks — eliminates the zombie thread / 27GB RAM leak that occurred with barefork-thread - Ground filter:
regex-ground?rejects search results containing unbound miniKanren variables, falling back to anti-unification - Graceful fallback:
synth-rule-for-symbolis nowcase-lambda— takes optionalsearch-resultsalist, uses anti-unification when no search result available
sym-matcho now threads pos and lsib through the regex traversal. Alt branches build a 5-element context key and call phog-lookup for Witten-Bell-smoothed, context-specific probabilities. Falls back to marginal/uniform gracefully when context data is sparse.
mplus-biased implements proportional budget allocation at the stream level, following the same case-inf protocol as mplus. condp-budgeted computes budget ratios (weight/min-weight) and chains branches with mplus-biased. Zero changes to mk.scm — the abstraction sits entirely in phog.scm.
The search produces correct rep-of-alt rules (e.g., (rep (alt (sym *anonymous*) (sym pair))) for object), but verify-grammar — which uses character-level parseo — hangs. The root cause is exponential backtracking: rep tries every possible way to split the input into repetitions, and each split tries every alt branch, which in turn does character-level matching via matcho. A 17-character object took 44 seconds; longer inputs never terminate.
This is the critical blocker. The learned rules are correct at the symbol level but untestable at the character level.
Possible solutions:
- Symbol-level verification: verify rules against child-type patterns via
full-sym-matcho(symbol-level, not character-level). This bypassesparseoentirely and directly tests what the search actually operates on. Fast and correct, but doesn't validate character-level parsing end-to-end. - Bounded verification: wrap
parseocalls inmake-enginewith a fuel limit. Report "inconclusive" instead of hanging. Simple but doesn't actually verify. - Fix
*anonymous*: currently(rep (any))— too permissive. Replace with corpus-observed literal texts (attempted; reverted per user request). This would makeparseotractable by eliminating the infinite loop in*anonymous*expansion, but requires careful handling of which texts to include.
Search-produced regexes reference (sym *anonymous*) for structural tokens (braces, commas, colons, whitespace). The catch-all (rep (any)) matches any string, which:
- Makes character-level
parseoexponentially slow (see issue #1) - Means the grammar doesn't validate punctuation — only structure
User explicitly deferred this: "let's not kill *anonymous* just yet."
JSON corpus nodes like (true "true") aren't in (*token* name text) format, so cst-collect-token-types returns (). These get overly permissive char-level rules ((rep (set (97..122)))) instead of (lit "true"). Need a heuristic: nonterminals with a single *anonymous* child where all instances share the same text → treat as token.
Anti-unification makes negatives unnecessary for single-nonterminal synthesis (no overgeneralization by construction). Negatives remain useful for:
- Ambiguity resolution: when two nonterminals have overlapping patterns, negatives from one can disambiguate the other
- Held-out evaluation: split corpus, validate grammar on unseen examples
- Alist performance: nested alists with
assoc/equal?. Fine for toy corpora, needs hash tables for real ones. - Incremental learning:
learn-grammarrebuilds from scratch each time. No incremental update path yet.
- Redundant alt branches: anti-unification produces
(alt (lit x) (lit x))when all patterns in a skeleton group agree on the same literal at a position. Cosmetic — doesn't affect correctness. (Only happens when anti-unification fallback is used.)