Skip to content

Latest commit

 

History

History
206 lines (138 loc) · 11.4 KB

File metadata and controls

206 lines (138 loc) · 11.4 KB

Racksynth — Agent Guide

This document is for AI agents editing this codebase. It explains the architecture, invariants, and non-obvious constraints that will break things if violated.

What This Project Does

Racksynth learns a grammar from a corpus of concrete syntax trees (CSTs), then uses that grammar to parse text back into CSTs. It is built on miniKanren (a relational/logic programming engine) and uses PHOG (Probabilistic Higher-Order Grammar) statistics to bias search toward more probable parses.

The pipeline: corpus of CSTs in -> extract statistics -> anti-unify patterns -> synthesize regex rules -> register rules -> parse new text via relational regex matching.

File Map and Load Order

Load order is strict. Every file assumes the previous ones are loaded.

mk-vicare.scm   Trie/intmap for Chez Scheme substitution maps
      |
   mk.scm       faster-miniKanren core (DO NOT MODIFY — see below)
      |
   gram.scm     Regex engine, matcho, parseo, rule registry, reducers
      |
   tree.scm     CST representation, predicates, traversal, corpus queries
      |
   phog.scm     PHOG statistics, weighted search, anti-unification, grammar learning

Test files each load gram.scm (which chains to mk-vicare + mk), then tree.scm, then phog.scm. Run them with chez --script <file>.scm.

Test file What it covers
root.scm Core regex matching, generation, synthesis, grammar rules
test-parseo.scm parseo with reducers, CST output correctness
test-phog.scm Everything in tree.scm and phog.scm — the main test suite
test-verify.scm Structural verification of learned grammars

Always run all four after any change. They're fast (< 2 seconds total).

Do Not Modify mk.scm

mk.scm is the upstream faster-miniKanren implementation. We have not modified it and do not intend to. All custom search behavior (mplus-biased, condp-static, condp-budgeted) is implemented in phog.scm using the public case-inf protocol. If you think you need to change mk.scm, you're wrong — find another way.

Two Separate Regex Worlds

There are two distinct matchers that operate on different domains. Confusing them is the most common source of bugs.

Character-level: matcho / parseo (gram.scm)

  • Operates on lists of characters ((#\h #\e #\l #\l #\o))
  • Regex nodes: lit, cat, alt, opt, rep, set, sym
  • set uses byte ranges: (set (48 . 57)) matches digit chars
  • sym dereferences the rule registry (*rules*) with cycle detection
  • matcho is purely relational — can run backwards to generate strings (except set, which requires ground input)
  • parseo is like matcho but threads a reducer to produce structured output

Symbol-level: sym-matcho (phog.scm)

  • Operates on lists of symbols ((num plus num))
  • Same regex node types but matches type-name symbols, not characters
  • Has 7 parameters: (sym-matcho regex input rest phog-table nonterminal pos lsib)
  • The last 4 are for PHOG weighting. Pass '() #f 0 #f when you don't need weighting.
  • full-sym-matcho is a case-lambda convenience wrapper (2-arg or 4-arg forms, fills in 0 #f for pos/lsib)
  • Used during grammar learning to verify synthesized rules against observed patterns

Do not pass character lists to sym-matcho or symbol lists to matcho.

CST Format — Three Node Types

;; Nonterminal: (symbol children...)
'(expr (num "43") (*token* plus "+") (num "2"))

;; Token: (*token* name text)
'(*token* plus "+")

;; Anonymous leaf: bare string
" = "

The *token* tag is a regular symbol. token? checks for it with (eq? (car x) '*token*). cst-node? explicitly excludes tokens: (and (pair? x) (symbol? (car x)) (not (token? x))).

Invariant: concatenating all leaves left-to-right via cst->string reconstructs the original source text exactly. Every byte is accounted for. Anonymous string leaves hold whitespace, punctuation, etc. that wasn't given a named token by the upstream parser.

Global Mutable State

There are two global registries. Both are alists mutated with set!.

*rules* (gram.scm)

The rule registry. define-rule prepends, lookup-rule searches with assq, clear-rules! resets to '(). learn-grammar calls clear-rules! at the start, then registers all learned rules. If you call learn-grammar twice, the first grammar is wiped.

*context-fns* (phog.scm)

Per-nonterminal context function overrides. Almost never used (all current code uses default-context-fn). register-context-fn! prepends. There is no clear-context-fns! — tests manually (set! *context-fns* '()) to clean up.

The case-inf Protocol

miniKanren search streams have four shapes. All custom stream combinators (mplus-biased, condp-static, condp-budgeted) must handle all four:

#f                    failure (no results)
(lambda () ...)       suspended computation (thunk)
state                 single result (not a pair-with-procedure)
(cons state thunk)    result + continuation

case-inf is a macro that dispatches on these. Study the mplus definition in mk.scm before writing new stream combinators. The discrimination logic is: #f = failure, procedure? = suspension, (pair? x) && (procedure? (cdr x)) = result+continuation, else = single result.

Landmine: a single state object must never be (cons non-proc non-proc) where the cdr is a procedure, or it'll be misinterpreted as result+continuation. This is handled by miniKanren's state representation (it's always a record, not a raw pair), but be aware if you ever construct streams manually.

condp-static vs condp-budgeted

Both take interleaved weight thunk weight thunk ... args where each thunk is (lambda (st) ...) returning a search stream.

  • condp-static: sorts by weight descending, chains with standard mplus. The highest-weight branch is tried first, but interleaving is still fair (round-robin).
  • condp-budgeted: sorts by weight descending, chains with mplus-biased. The budget per branch is max(1, round(weight / min-weight)). A branch with 10x the weight gets 10 search steps per 1 step of the next branch.

Both call (state-with-scope st (new-scope)) before executing thunks — this is required for correct miniKanren scoping in conde-like forms. Do not skip this.

sym-matcho Position Threading

sym-matcho threads pos (child index) and lsib (left-sibling type symbol) through the traversal. These are used to build context keys for phog-lookup.

The cat and rep branches use project to compute updated values after matching. This only works when input and mid are ground (parsing mode). When either is a logic variable (synthesis mode), it falls back to pos=0, lsib=#f. The guard is:

(project (input mid pos lsib)
  (if (or (var? input) (var? mid))
      ;; synthesis fallback
      ...
      ;; parsing mode with position tracking
      ...))

Do not remove the var? guards. Calling length on a logic variable will crash.

project — Escaping to Scheme

project is miniKanren's escape hatch: it walks the substitution to get ground values for logic variables, then runs ordinary Scheme code. Inside project:

  • var? checks if a value is still an unbound logic variable
  • You can call any Scheme function on ground values
  • You return a goal (created by ==, conde, fresh, etc.)
  • You can return succeed or fail for boolean decisions

Landmine: if you project a variable that isn't ground yet, you get the raw variable object (a vector). Calling length, car, list-ref etc. on it will crash or return nonsense. Always check var? first.

Anti-Unification (Not miniKanren Search)

Grammar rule synthesis does not use miniKanren search. It uses deterministic anti-unification:

  1. Group patterns by length
  2. Sort groups by corpus frequency
  3. For each group, transpose into columns
  4. For each column, if all values are identical -> (sym x), otherwise -> (alt (sym a) (sym b) ...)
  5. Concatenate columns into cat chains
  6. If multiple length groups, combine with top-level alt

This is O(n*m), produces no unbound variables, and never overgeneralizes. The previous miniKanren-based synthesis (syntho, synth-weight, etc.) was removed because it was exponentially slow and produced (opt _.0) garbage.

Do not try to re-introduce miniKanren search for rule synthesis. It was tried and failed.

Token Rules Must Be Registered First

learn-grammar does two passes:

  1. Register token types (*token* nodes) as literal rules
  2. Synthesize nonterminal rules (which may reference token names via sym)

If a nonterminal rule references a token name that isn't registered yet, lookup-rule will crash. The two-pass order is load-bearing.

Chez Scheme Specifics

  • R6RS / Chez Scheme, not Racket
  • list-sort (not sort)
  • printf with ~a format specifier
  • let-values / let*-values for multiple return values
  • define-record-type, fxsra, fxsll etc. available but mostly used in mk-vicare.scm
  • last-pair is a Chez builtin (returns the last pair of a list)
  • filter is available
  • No hash-table usage yet — everything is alists with assoc/assq

set Nodes Cannot Generate Backwards

The set regex node (byte-range matching) uses project internally and requires ground input:

(project (c ranges)
  (if (or (var? c) (var? ranges))
      fail
      ...))

This means (run 5 (s) (full-matcho (rep (digit)) s)) returns () — it can't generate characters from a range. Only lit-based regexes support backward generation. This is a known limitation, not a bug.

Exact Arithmetic

PHOG probabilities use Chez Scheme's exact rationals (1/3, 7/10, etc.), not floats. This is intentional — Witten-Bell smoothing produces exact fractions that sum to exactly 1. Use exact->inexact only for display. Do not convert to floats internally; it accumulates rounding error in the interpolation chain.

Common Mistakes

  1. Passing wrong number of args to sym-matcho: it takes 7. If you see incorrect argument count, count the args. full-sym-matcho is the convenience wrapper.

  2. Forgetting clear-rules!: if you learn a grammar and then learn another, stale rules from the first one may shadow the second. learn-grammar calls clear-rules! internally, but if you're manually registering rules, clear first.

  3. Confusing regex-weight and regex-weight-from-probs: regex-weight takes a phog-table and returns marginal counts (used at synthesis time by phog-order-regex). regex-weight-from-probs takes a probability distribution from phog-lookup and returns context-specific probabilities (used at parse time by sym-matcho). They serve different purposes at different stages.

  4. Testing parse? on prefix matches: parse? requires full consumption (rest = '()). If your test input has trailing garbage, parse? correctly returns #f. This was fixed recently — older code had a bug where parse? accepted prefix matches.

  5. Modifying regex AST structure: all regex nodes are binary. (alt a b c) is invalid — it must be (alt a (alt b c)). The constructor functions alt, cat handle this automatically (they're variadic and build right-associative trees). But if you construct AST nodes manually with quasiquote, you must get the nesting right.

  6. Vectors are variables: miniKanren uses vectors as logic variables ((var? x) = (vector? x)). Never put actual vectors into terms. If you need vector-like data, use lists.