This document is for AI agents editing this codebase. It explains the architecture, invariants, and non-obvious constraints that will break things if violated.
Racksynth learns a grammar from a corpus of concrete syntax trees (CSTs), then uses that grammar to parse text back into CSTs. It is built on miniKanren (a relational/logic programming engine) and uses PHOG (Probabilistic Higher-Order Grammar) statistics to bias search toward more probable parses.
The pipeline: corpus of CSTs in -> extract statistics -> anti-unify patterns -> synthesize regex rules -> register rules -> parse new text via relational regex matching.
Load order is strict. Every file assumes the previous ones are loaded.
mk-vicare.scm Trie/intmap for Chez Scheme substitution maps
|
mk.scm faster-miniKanren core (DO NOT MODIFY — see below)
|
gram.scm Regex engine, matcho, parseo, rule registry, reducers
|
tree.scm CST representation, predicates, traversal, corpus queries
|
phog.scm PHOG statistics, weighted search, anti-unification, grammar learning
Test files each load gram.scm (which chains to mk-vicare + mk), then tree.scm, then phog.scm. Run them with chez --script <file>.scm.
| Test file | What it covers |
|---|---|
root.scm |
Core regex matching, generation, synthesis, grammar rules |
test-parseo.scm |
parseo with reducers, CST output correctness |
test-phog.scm |
Everything in tree.scm and phog.scm — the main test suite |
test-verify.scm |
Structural verification of learned grammars |
Always run all four after any change. They're fast (< 2 seconds total).
mk.scm is the upstream faster-miniKanren implementation. We have not modified it and do not intend to. All custom search behavior (mplus-biased, condp-static, condp-budgeted) is implemented in phog.scm using the public case-inf protocol. If you think you need to change mk.scm, you're wrong — find another way.
There are two distinct matchers that operate on different domains. Confusing them is the most common source of bugs.
- Operates on lists of characters (
(#\h #\e #\l #\l #\o)) - Regex nodes:
lit,cat,alt,opt,rep,set,sym setuses byte ranges:(set (48 . 57))matches digit charssymdereferences the rule registry (*rules*) with cycle detectionmatchois purely relational — can run backwards to generate strings (exceptset, which requires ground input)parseois likematchobut threads a reducer to produce structured output
- Operates on lists of symbols (
(num plus num)) - Same regex node types but matches type-name symbols, not characters
- Has 7 parameters:
(sym-matcho regex input rest phog-table nonterminal pos lsib) - The last 4 are for PHOG weighting. Pass
'() #f 0 #fwhen you don't need weighting. full-sym-matchois acase-lambdaconvenience wrapper (2-arg or 4-arg forms, fills in0 #ffor pos/lsib)- Used during grammar learning to verify synthesized rules against observed patterns
Do not pass character lists to sym-matcho or symbol lists to matcho.
;; Nonterminal: (symbol children...)
'(expr (num "43") (*token* plus "+") (num "2"))
;; Token: (*token* name text)
'(*token* plus "+")
;; Anonymous leaf: bare string
" = "The *token* tag is a regular symbol. token? checks for it with (eq? (car x) '*token*). cst-node? explicitly excludes tokens: (and (pair? x) (symbol? (car x)) (not (token? x))).
Invariant: concatenating all leaves left-to-right via cst->string reconstructs the original source text exactly. Every byte is accounted for. Anonymous string leaves hold whitespace, punctuation, etc. that wasn't given a named token by the upstream parser.
There are two global registries. Both are alists mutated with set!.
The rule registry. define-rule prepends, lookup-rule searches with assq, clear-rules! resets to '(). learn-grammar calls clear-rules! at the start, then registers all learned rules. If you call learn-grammar twice, the first grammar is wiped.
Per-nonterminal context function overrides. Almost never used (all current code uses default-context-fn). register-context-fn! prepends. There is no clear-context-fns! — tests manually (set! *context-fns* '()) to clean up.
miniKanren search streams have four shapes. All custom stream combinators (mplus-biased, condp-static, condp-budgeted) must handle all four:
#f failure (no results)
(lambda () ...) suspended computation (thunk)
state single result (not a pair-with-procedure)
(cons state thunk) result + continuation
case-inf is a macro that dispatches on these. Study the mplus definition in mk.scm before writing new stream combinators. The discrimination logic is: #f = failure, procedure? = suspension, (pair? x) && (procedure? (cdr x)) = result+continuation, else = single result.
Landmine: a single state object must never be (cons non-proc non-proc) where the cdr is a procedure, or it'll be misinterpreted as result+continuation. This is handled by miniKanren's state representation (it's always a record, not a raw pair), but be aware if you ever construct streams manually.
Both take interleaved weight thunk weight thunk ... args where each thunk is (lambda (st) ...) returning a search stream.
condp-static: sorts by weight descending, chains with standardmplus. The highest-weight branch is tried first, but interleaving is still fair (round-robin).condp-budgeted: sorts by weight descending, chains withmplus-biased. The budget per branch ismax(1, round(weight / min-weight)). A branch with 10x the weight gets 10 search steps per 1 step of the next branch.
Both call (state-with-scope st (new-scope)) before executing thunks — this is required for correct miniKanren scoping in conde-like forms. Do not skip this.
sym-matcho threads pos (child index) and lsib (left-sibling type symbol) through the traversal. These are used to build context keys for phog-lookup.
The cat and rep branches use project to compute updated values after matching. This only works when input and mid are ground (parsing mode). When either is a logic variable (synthesis mode), it falls back to pos=0, lsib=#f. The guard is:
(project (input mid pos lsib)
(if (or (var? input) (var? mid))
;; synthesis fallback
...
;; parsing mode with position tracking
...))Do not remove the var? guards. Calling length on a logic variable will crash.
project is miniKanren's escape hatch: it walks the substitution to get ground values for logic variables, then runs ordinary Scheme code. Inside project:
var?checks if a value is still an unbound logic variable- You can call any Scheme function on ground values
- You return a goal (created by
==,conde,fresh, etc.) - You can return
succeedorfailfor boolean decisions
Landmine: if you project a variable that isn't ground yet, you get the raw variable object (a vector). Calling length, car, list-ref etc. on it will crash or return nonsense. Always check var? first.
Grammar rule synthesis does not use miniKanren search. It uses deterministic anti-unification:
- Group patterns by length
- Sort groups by corpus frequency
- For each group, transpose into columns
- For each column, if all values are identical ->
(sym x), otherwise ->(alt (sym a) (sym b) ...) - Concatenate columns into
catchains - If multiple length groups, combine with top-level
alt
This is O(n*m), produces no unbound variables, and never overgeneralizes. The previous miniKanren-based synthesis (syntho, synth-weight, etc.) was removed because it was exponentially slow and produced (opt _.0) garbage.
Do not try to re-introduce miniKanren search for rule synthesis. It was tried and failed.
learn-grammar does two passes:
- Register token types (
*token*nodes) as literal rules - Synthesize nonterminal rules (which may reference token names via
sym)
If a nonterminal rule references a token name that isn't registered yet, lookup-rule will crash. The two-pass order is load-bearing.
- R6RS / Chez Scheme, not Racket
list-sort(notsort)printfwith~aformat specifierlet-values/let*-valuesfor multiple return valuesdefine-record-type,fxsra,fxslletc. available but mostly used in mk-vicare.scmlast-pairis a Chez builtin (returns the last pair of a list)filteris available- No
hash-tableusage yet — everything is alists withassoc/assq
The set regex node (byte-range matching) uses project internally and requires ground input:
(project (c ranges)
(if (or (var? c) (var? ranges))
fail
...))This means (run 5 (s) (full-matcho (rep (digit)) s)) returns () — it can't generate characters from a range. Only lit-based regexes support backward generation. This is a known limitation, not a bug.
PHOG probabilities use Chez Scheme's exact rationals (1/3, 7/10, etc.), not floats. This is intentional — Witten-Bell smoothing produces exact fractions that sum to exactly 1. Use exact->inexact only for display. Do not convert to floats internally; it accumulates rounding error in the interpolation chain.
-
Passing wrong number of args to sym-matcho: it takes 7. If you see
incorrect argument count, count the args.full-sym-matchois the convenience wrapper. -
Forgetting
clear-rules!: if you learn a grammar and then learn another, stale rules from the first one may shadow the second.learn-grammarcallsclear-rules!internally, but if you're manually registering rules, clear first. -
Confusing
regex-weightandregex-weight-from-probs:regex-weighttakes a phog-table and returns marginal counts (used at synthesis time byphog-order-regex).regex-weight-from-probstakes a probability distribution fromphog-lookupand returns context-specific probabilities (used at parse time bysym-matcho). They serve different purposes at different stages. -
Testing parse? on prefix matches:
parse?requires full consumption (rest ='()). If your test input has trailing garbage,parse?correctly returns#f. This was fixed recently — older code had a bug whereparse?accepted prefix matches. -
Modifying regex AST structure: all regex nodes are binary.
(alt a b c)is invalid — it must be(alt a (alt b c)). The constructor functionsalt,cathandle this automatically (they're variadic and build right-associative trees). But if you construct AST nodes manually with quasiquote, you must get the nesting right. -
Vectors are variables: miniKanren uses vectors as logic variables (
(var? x)=(vector? x)). Never put actual vectors into terms. If you need vector-like data, use lists.