Racksynth — Agent Guide

This document is for AI agents editing this codebase. It explains the architecture, invariants, and non-obvious constraints that will break things if violated.

What This Project Does

Racksynth learns a grammar from a corpus of concrete syntax trees (CSTs), then uses that grammar to parse text back into CSTs. It is built on miniKanren (a relational/logic programming engine) and uses PHOG (Probabilistic Higher-Order Grammar) statistics to bias search toward more probable parses.

The pipeline: corpus of CSTs in -> extract statistics -> anti-unify patterns -> synthesize regex rules -> register rules -> parse new text via relational regex matching.

File Map and Load Order

Load order is strict. Every file assumes the previous ones are loaded.

mk-vicare.scm   Trie/intmap for Chez Scheme substitution maps
      |
   mk.scm       faster-miniKanren core (DO NOT MODIFY — see below)
      |
   gram.scm     Regex engine, matcho, parseo, rule registry, reducers
      |
   tree.scm     CST representation, predicates, traversal, corpus queries
      |
   phog.scm     PHOG statistics, weighted search, anti-unification, grammar learning

Test files each load gram.scm (which chains to mk-vicare + mk), then tree.scm, then phog.scm. Run them with chez --script <file>.scm.

Test file	What it covers
`root.scm`	Core regex matching, generation, synthesis, grammar rules
`test-parseo.scm`	parseo with reducers, CST output correctness
`test-phog.scm`	Everything in tree.scm and phog.scm — the main test suite
`test-verify.scm`	Structural verification of learned grammars

Always run all four after any change. They're fast (< 2 seconds total).

Do Not Modify mk.scm

mk.scm is the upstream faster-miniKanren implementation. We have not modified it and do not intend to. All custom search behavior (mplus-biased, condp-static, condp-budgeted) is implemented in phog.scm using the public case-inf protocol. If you think you need to change mk.scm, you're wrong — find another way.

Two Separate Regex Worlds

There are two distinct matchers that operate on different domains. Confusing them is the most common source of bugs.

Character-level: `matcho` / `parseo` (gram.scm)

Operates on lists of characters ((#\h #\e #\l #\l #\o))
Regex nodes: lit, cat, alt, opt, rep, set, sym
set uses byte ranges: (set (48 . 57)) matches digit chars
sym dereferences the rule registry (*rules*) with cycle detection
matcho is purely relational — can run backwards to generate strings (except set, which requires ground input)
parseo is like matcho but threads a reducer to produce structured output

Symbol-level: `sym-matcho` (phog.scm)

Operates on lists of symbols ((num plus num))
Same regex node types but matches type-name symbols, not characters
Has 7 parameters: (sym-matcho regex input rest phog-table nonterminal pos lsib)
The last 4 are for PHOG weighting. Pass '() #f 0 #f when you don't need weighting.
full-sym-matcho is a case-lambda convenience wrapper (2-arg or 4-arg forms, fills in 0 #f for pos/lsib)
Used during grammar learning to verify synthesized rules against observed patterns

Do not pass character lists to sym-matcho or symbol lists to matcho.

CST Format — Three Node Types

;; Nonterminal: (symbol children...)
'(expr (num "43") (*token* plus "+") (num "2"))

;; Token: (*token* name text)
'(*token* plus "+")

;; Anonymous leaf: bare string
" = "

The *token* tag is a regular symbol. token? checks for it with (eq? (car x) '*token*). cst-node? explicitly excludes tokens: (and (pair? x) (symbol? (car x)) (not (token? x))).

Invariant: concatenating all leaves left-to-right via cst->string reconstructs the original source text exactly. Every byte is accounted for. Anonymous string leaves hold whitespace, punctuation, etc. that wasn't given a named token by the upstream parser.

Global Mutable State

There are two global registries. Both are alists mutated with set!.

`rules` (gram.scm)

The rule registry. define-rule prepends, lookup-rule searches with assq, clear-rules! resets to '(). learn-grammar calls clear-rules! at the start, then registers all learned rules. If you call learn-grammar twice, the first grammar is wiped.

`context-fns` (phog.scm)

Per-nonterminal context function overrides. Almost never used (all current code uses default-context-fn). register-context-fn! prepends. There is no clear-context-fns! — tests manually (set! *context-fns* '()) to clean up.

The case-inf Protocol

miniKanren search streams have four shapes. All custom stream combinators (mplus-biased, condp-static, condp-budgeted) must handle all four:

#f                    failure (no results)
(lambda () ...)       suspended computation (thunk)
state                 single result (not a pair-with-procedure)
(cons state thunk)    result + continuation

case-inf is a macro that dispatches on these. Study the mplus definition in mk.scm before writing new stream combinators. The discrimination logic is: #f = failure, procedure? = suspension, (pair? x) && (procedure? (cdr x)) = result+continuation, else = single result.

Landmine: a single state object must never be (cons non-proc non-proc) where the cdr is a procedure, or it'll be misinterpreted as result+continuation. This is handled by miniKanren's state representation (it's always a record, not a raw pair), but be aware if you ever construct streams manually.

condp-static vs condp-budgeted

Both take interleaved weight thunk weight thunk ... args where each thunk is (lambda (st) ...) returning a search stream.

condp-static: sorts by weight descending, chains with standard mplus. The highest-weight branch is tried first, but interleaving is still fair (round-robin).
condp-budgeted: sorts by weight descending, chains with mplus-biased. The budget per branch is max(1, round(weight / min-weight)). A branch with 10x the weight gets 10 search steps per 1 step of the next branch.

Both call (state-with-scope st (new-scope)) before executing thunks — this is required for correct miniKanren scoping in conde-like forms. Do not skip this.

sym-matcho Position Threading

sym-matcho threads pos (child index) and lsib (left-sibling type symbol) through the traversal. These are used to build context keys for phog-lookup.

The cat and rep branches use project to compute updated values after matching. This only works when input and mid are ground (parsing mode). When either is a logic variable (synthesis mode), it falls back to pos=0, lsib=#f. The guard is:

(project (input mid pos lsib)
  (if (or (var? input) (var? mid))
      ;; synthesis fallback
      ...
      ;; parsing mode with position tracking
      ...))

Do not remove the var? guards. Calling length on a logic variable will crash.

project — Escaping to Scheme

project is miniKanren's escape hatch: it walks the substitution to get ground values for logic variables, then runs ordinary Scheme code. Inside project:

var? checks if a value is still an unbound logic variable
You can call any Scheme function on ground values
You return a goal (created by ==, conde, fresh, etc.)
You can return succeed or fail for boolean decisions

Landmine: if you project a variable that isn't ground yet, you get the raw variable object (a vector). Calling length, car, list-ref etc. on it will crash or return nonsense. Always check var? first.

Anti-Unification (Not miniKanren Search)

Grammar rule synthesis does not use miniKanren search. It uses deterministic anti-unification:

Group patterns by length
Sort groups by corpus frequency
For each group, transpose into columns
For each column, if all values are identical -> (sym x), otherwise -> (alt (sym a) (sym b) ...)
Concatenate columns into cat chains
If multiple length groups, combine with top-level alt

This is O(n*m), produces no unbound variables, and never overgeneralizes. The previous miniKanren-based synthesis (syntho, synth-weight, etc.) was removed because it was exponentially slow and produced (opt _.0) garbage.

Do not try to re-introduce miniKanren search for rule synthesis. It was tried and failed.

Token Rules Must Be Registered First

learn-grammar does two passes:

Register token types (*token* nodes) as literal rules
Synthesize nonterminal rules (which may reference token names via sym)

If a nonterminal rule references a token name that isn't registered yet, lookup-rule will crash. The two-pass order is load-bearing.

Chez Scheme Specifics

R6RS / Chez Scheme, not Racket
list-sort (not sort)
printf with ~a format specifier
let-values / let*-values for multiple return values
define-record-type, fxsra, fxsll etc. available but mostly used in mk-vicare.scm
last-pair is a Chez builtin (returns the last pair of a list)
filter is available
No hash-table usage yet — everything is alists with assoc/assq

set Nodes Cannot Generate Backwards

The set regex node (byte-range matching) uses project internally and requires ground input:

(project (c ranges)
  (if (or (var? c) (var? ranges))
      fail
      ...))

This means (run 5 (s) (full-matcho (rep (digit)) s)) returns () — it can't generate characters from a range. Only lit-based regexes support backward generation. This is a known limitation, not a bug.

Exact Arithmetic

PHOG probabilities use Chez Scheme's exact rationals (1/3, 7/10, etc.), not floats. This is intentional — Witten-Bell smoothing produces exact fractions that sum to exactly 1. Use exact->inexact only for display. Do not convert to floats internally; it accumulates rounding error in the interpolation chain.

Common Mistakes

Passing wrong number of args to sym-matcho: it takes 7. If you see incorrect argument count, count the args. full-sym-matcho is the convenience wrapper.
Forgetting clear-rules!: if you learn a grammar and then learn another, stale rules from the first one may shadow the second. learn-grammar calls clear-rules! internally, but if you're manually registering rules, clear first.
Confusing regex-weight and regex-weight-from-probs: regex-weight takes a phog-table and returns marginal counts (used at synthesis time by phog-order-regex). regex-weight-from-probs takes a probability distribution from phog-lookup and returns context-specific probabilities (used at parse time by sym-matcho). They serve different purposes at different stages.
Testing parse? on prefix matches: parse? requires full consumption (rest = '()). If your test input has trailing garbage, parse? correctly returns #f. This was fixed recently — older code had a bug where parse? accepted prefix matches.
Modifying regex AST structure: all regex nodes are binary. (alt a b c) is invalid — it must be (alt a (alt b c)). The constructor functions alt, cat handle this automatically (they're variadic and build right-associative trees). But if you construct AST nodes manually with quasiquote, you must get the nesting right.
Vectors are variables: miniKanren uses vectors as logic variables ((var? x) = (vector? x)). Never put actual vectors into terms. If you need vector-like data, use lists.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Racksynth — Agent Guide

What This Project Does

File Map and Load Order

Do Not Modify mk.scm

Two Separate Regex Worlds

Character-level: `matcho` / `parseo` (gram.scm)

Symbol-level: `sym-matcho` (phog.scm)

CST Format — Three Node Types

Global Mutable State

`rules` (gram.scm)

`context-fns` (phog.scm)

The case-inf Protocol

condp-static vs condp-budgeted

sym-matcho Position Threading

project — Escaping to Scheme

Anti-Unification (Not miniKanren Search)

Token Rules Must Be Registered First

Chez Scheme Specifics

set Nodes Cannot Generate Backwards

Exact Arithmetic

Common Mistakes

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Racksynth — Agent Guide

What This Project Does

File Map and Load Order

Do Not Modify mk.scm

Two Separate Regex Worlds

Character-level: matcho / parseo (gram.scm)

Symbol-level: sym-matcho (phog.scm)

CST Format — Three Node Types

Global Mutable State

*rules* (gram.scm)

*context-fns* (phog.scm)

The case-inf Protocol

condp-static vs condp-budgeted

sym-matcho Position Threading

project — Escaping to Scheme

Anti-Unification (Not miniKanren Search)

Token Rules Must Be Registered First

Chez Scheme Specifics

set Nodes Cannot Generate Backwards

Exact Arithmetic

Common Mistakes

Character-level: `matcho` / `parseo` (gram.scm)

Symbol-level: `sym-matcho` (phog.scm)

`rules` (gram.scm)

`context-fns` (phog.scm)