PHOG-Weighted Grammar Synthesis — Progress

Current Capabilities

Regex Engine (`gram.scm`)

3-letter primitives: lit, cat, alt, opt, rep, sym, set
set byte-range primitive: (set (lo . hi) ...) for character class matching (like re2/rust regex crate)
Named classes: digit, lower, upper, alpha, alnum, blank, any
matcho: relational regex matching with cycle detection for recursive grammars
parseo: like matcho but threads a reducer function, produces structured output at sym boundaries
make-tree-reducer builds CST nodes, make-identity-reducer returns matched text
Rule registry: define-rule, lookup-rule, clear-rules!
full-matcho: can synthesize regex from positive/negative examples (from root.scm era)

CST Representation (`tree.scm`)

Three-way node distinction: nonterminal (symbol children...), token (*token* name text), anonymous string leaf "text"
cst->string reconstructs original source from any CST node
cst-child-types extracts type-symbol sequence from a node's children
cst-walk traverses corpus with full 5-element context (node-type, position, parent-type, grandparent-type, left-sibling-type)
Corpus queries: cst-collect-types, cst-collect-instances, cst-collect-child-patterns, cst-collect-token-types

PHOG Statistics (`phog.scm`)

extract-phog-stats builds two tables from a corpus:
1. PHOG table: nonterminal → (context → ((production . count) ...))
2. Pattern table: nonterminal → ((child-pattern . count) ...)
phog-lookup with Witten-Bell smoothing and 5-level backoff chain (full context → drop features right-to-left → global merge)
Per-nonterminal context functions via *context-fns* registry; default uses all 5 features
condp-static: weighted conde — sorts branches by descending weight, chains via mplus
condp-budgeted: like condp-static but chains with mplus-biased — proportional search budget allocation (weight/min-weight ratio per branch)
mplus-biased: budget-based interleaving — gives left stream k steps per 1 step of right, follows case-inf protocol (no mk.scm changes needed)
regex-weight-from-probs: extracts weight of a regex branch from a context-specific phog-lookup probability distribution

Grammar Learning Pipeline (`phog.scm`)

learn-grammar: end-to-end — registers token types as literal rules, runs bulk parallel search, then synthesizes each nonterminal
learn-single-rule: dispatches between empty nodes, token-like nodes, and nonterminals
Tiered parallel search (primary path for multi-pattern nonterminals):
- bulk-search-rules launches all multi-pattern symbols in parallel
- Tier 1: one engine-bounded thread per symbol (~2s fuel), finds simple types (document, pair) in ~1ms
- Tier 2: or-parallel for tier-1 failures — 5 threads per symbol (one per top-level regex constructor: cat/alt/rep/opt/sym), first ground result wins
- All threads use Chez make-engine for bounded computation — self-terminate after fuel exhaustion (no zombie threads, no memory leaks)
- regex-ground? filter rejects results with unbound miniKanren variables (e.g., _.0)
- Falls back to anti-unification when search times out or produces no ground result
Anti-unification (fallback for multi-pattern synthesis):
- group-by-skeleton → transpose → anti-unify-column → anti-unify-group → anti-unify-patterns
- O(n*m) deterministic — no search, no unbound variables, no overgeneralization
- pattern-skeleton maps named symbols → *sym*, keeps *anonymous* — two patterns group iff same skeleton
- Anonymous string inlining via collect-anonymous-texts-at-position + inline-anonymous-texts — scans ALL patterns in a skeleton group
- Skeleton groups ordered by pattern-table frequency (most common first)
PHOG-ordered alt branches: phog-order-regex post-processes both search and anti-unified regexes, reordering alt children by marginal PHOG frequency
PHOG-weighted parsing via sym-matcho:
- 7-arg signature: (sym-matcho regex input rest phog-table nonterminal pos lsib)
- pos (child index) and lsib (left-sibling type) threaded through cat/rep branches via project
- Alt branch uses condp-budgeted with context-specific PHOG weights (not just marginals): builds (nonterminal pos nonterminal #f lsib) context, calls phog-lookup for Witten-Bell-smoothed probabilities
- Graceful degradation: unground vars or empty PHOG table → equal weights → same as old conde
- full-sym-matcho via case-lambda: 2-arg (regex pat) or 4-arg (regex pat phog-table nonterminal) — passes 0 #f for pos/lsib
Character-level synthesis: leaf nonterminals generalize — learns (rep (set (48 . 57))) instead of memorizing (alt (lit "43") (lit "1") ...)
Anonymous string handling: anonymous string leaves (e.g., " = ") are inlined as (lit ...) in synthesized rules
verify-grammar: shallow check — did parse? consume the text?
verify-grammar-structural: deep check — parses via parseo + tree-reducer, compares parsed CST structure against original corpus CSTs

JSON Corpus (`run-json.scm`, `corpus-json.scm`)

81 CSTs from JSONTestSuite, 11 nonterminal types
Full pipeline: extract-phog-stats → learn-grammar → verify-grammar
Rule synthesis works: search finds rep-based rules for all complex types in <25s total:
- document → (alt (sym array) (alt (sym object) ...))
- pair → (cat (cat (sym string) (sym *anonymous*)) (alt (sym string) ...))
- string → (rep (alt (sym *anonymous*) (alt (sym string_content) (sym escape_sequence))))
- object → (rep (alt (sym *anonymous*) (sym pair)))
- array → (rep (alt (sym *anonymous*) (alt (sym string) ...)))
Verification blocked: verify-grammar (character-level parseo) hangs on rep-of-alt rules — see Open Issues

Output Quality

The arithmetic corpus ((expr (num "43") (*token* plus "+") (num "2")) ...) produces:

minus → (lit -)
plus  → (lit +)
expr  → (alt (sym num) (cat (sym num) (cat (alt (sym plus) (sym minus)) (sym num))))
num   → (rep (set (48 . 57)))

No unbound variables (_.0), no overgeneralization
Parses all corpus examples + generalizes to unseen valid inputs (5+3, 10-1)
Rejects invalid inputs
PHOG ordering puts more frequent branches first

Test Coverage (266 total, all passing)

root.scm: 35 — regex constructors, matching, generation, synthesis, grammar rules
test-phog.scm: 174 — tree predicates, stats, Witten-Bell, condp-static, condp-budgeted, mplus-biased, regex-weight-from-probs, sym-matcho (weighted + unweighted + context-specific), position threading, anti-unification helpers (group-by-skeleton, anti-unify column/group/patterns), PHOG ordering, rule synthesis, grammar learning, verification, set primitive, char-level synthesis, anonymous strings, end-to-end quality
test-parseo.scm: 32 — identity/tree reducers, nested sym, matched text correctness
test-verify.scm: 25 — structural verification helpers, full pipeline structural checks

Problems Solved

`(opt _.0)` overgeneralization — FIXED

The old miniKanren-synthesized expr rule had (opt _.0) accepting any single character. Anti-unification produces exactly the least-general generalization — no unbound variables by construction.

PHOG weights dead — FIXED

Previously, PHOG statistics were computed but never reached synthesis decisions. Now wired in at three levels:

phog-order-regex reorders alt branches in synthesized rules by marginal PHOG frequency
sym-matcho uses condp-budgeted with context-specific PHOG weights at alt choice points during parsing
mplus-biased gives proportionally more search budget to higher-weight branches (not just static reordering)

`pattern-table` unused — FIXED

Now drives frequency ordering of skeleton groups in anti-unify-patterns (most common pattern skeleton first in top-level alt).

Exponential synthesis — FIXED

Replaced O(exp) miniKanren (run budget (regex) ...) search with O(n*m) deterministic anti-unification. Dead code (syntho, synth-weight, make-synth-ctx, etc.) cleaned up. Symbol-level search now available as the primary path via engine-bounded parallel tiered search.

Anti-unification structural misalignment — FIXED

group-by-length merged same-length patterns with different anonymous/symbol positions, causing column-wise anti-unification to produce broken regexes. Replaced with group-by-skeleton — groups only when anonymous positions align. anti-unify-column now scans all patterns in the group (not just the representative) for anonymous text collection.

Search integrated into `learn-grammar` — DONE

Or-parallel tiered search from try-search.scm is now the primary multi-pattern synthesis path in learn-grammar. Key engineering:

Bulk parallelism: all multi-pattern symbols searched simultaneously (not sequentially), total wall-time ~25s for JSON corpus
Engine-bounded threads: Chez make-engine wraps each thread's computation with a fuel limit. Threads self-terminate after exhausting ticks — eliminates the zombie thread / 27GB RAM leak that occurred with bare fork-thread
Ground filter: regex-ground? rejects search results containing unbound miniKanren variables, falling back to anti-unification
Graceful fallback: synth-rule-for-symbol is now case-lambda — takes optional search-results alist, uses anti-unification when no search result available

Context-specific PHOG in sym-matcho — DONE

sym-matcho now threads pos and lsib through the regex traversal. Alt branches build a 5-element context key and call phog-lookup for Witten-Bell-smoothed, context-specific probabilities. Falls back to marginal/uniform gracefully when context data is sparse.

Weighted interleaving — DONE (no mk.scm changes needed)

mplus-biased implements proportional budget allocation at the stream level, following the same case-inf protocol as mplus. condp-budgeted computes budget ratios (weight/min-weight) and chains branches with mplus-biased. Zero changes to mk.scm — the abstraction sits entirely in phog.scm.

Open Issues

1. Verification hangs with `rep`-based rules (blocking)

The search produces correct rep-of-alt rules (e.g., (rep (alt (sym *anonymous*) (sym pair))) for object), but verify-grammar — which uses character-level parseo — hangs. The root cause is exponential backtracking: rep tries every possible way to split the input into repetitions, and each split tries every alt branch, which in turn does character-level matching via matcho. A 17-character object took 44 seconds; longer inputs never terminate.

This is the critical blocker. The learned rules are correct at the symbol level but untestable at the character level.

Possible solutions:

Symbol-level verification: verify rules against child-type patterns via full-sym-matcho (symbol-level, not character-level). This bypasses parseo entirely and directly tests what the search actually operates on. Fast and correct, but doesn't validate character-level parsing end-to-end.
Bounded verification: wrap parseo calls in make-engine with a fuel limit. Report "inconclusive" instead of hanging. Simple but doesn't actually verify.
Fix *anonymous*: currently (rep (any)) — too permissive. Replace with corpus-observed literal texts (attempted; reverted per user request). This would make parseo tractable by eliminating the infinite loop in *anonymous* expansion, but requires careful handling of which texts to include.

2. `anonymous` rule is `(rep (any))` — too permissive for parsing

Search-produced regexes reference (sym *anonymous*) for structural tokens (braces, commas, colons, whitespace). The catch-all (rep (any)) matches any string, which:

Makes character-level parseo exponentially slow (see issue #1)
Means the grammar doesn't validate punctuation — only structure

User explicitly deferred this: "let's not kill *anonymous* just yet."

3. Token type detection for non-`token` corpora

JSON corpus nodes like (true "true") aren't in (*token* name text) format, so cst-collect-token-types returns (). These get overly permissive char-level rules ((rep (set (97..122)))) instead of (lit "true"). Need a heuristic: nonterminals with a single *anonymous* child where all instances share the same text → treat as token.

Deferred Enhancements

Negative examples for disambiguation

Anti-unification makes negatives unnecessary for single-nonterminal synthesis (no overgeneralization by construction). Negatives remain useful for:

Ambiguity resolution: when two nonterminals have overlapping patterns, negatives from one can disambiguate the other
Held-out evaluation: split corpus, validate grammar on unseen examples

Performance

Alist performance: nested alists with assoc/equal?. Fine for toy corpora, needs hash tables for real ones.
Incremental learning: learn-grammar rebuilds from scratch each time. No incremental update path yet.

Cosmetic

Redundant alt branches: anti-unification produces (alt (lit x) (lit x)) when all patterns in a skeleton group agree on the same literal at a position. Cosmetic — doesn't affect correctness. (Only happens when anti-unification fallback is used.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHOG-Weighted Grammar Synthesis — Progress

Current Capabilities

Regex Engine (`gram.scm`)

CST Representation (`tree.scm`)

PHOG Statistics (`phog.scm`)

Grammar Learning Pipeline (`phog.scm`)

JSON Corpus (`run-json.scm`, `corpus-json.scm`)

Output Quality

Test Coverage (266 total, all passing)

Problems Solved

`(opt _.0)` overgeneralization — FIXED

PHOG weights dead — FIXED

`pattern-table` unused — FIXED

Exponential synthesis — FIXED

Anti-unification structural misalignment — FIXED

Search integrated into `learn-grammar` — DONE

Context-specific PHOG in sym-matcho — DONE

Weighted interleaving — DONE (no mk.scm changes needed)

Open Issues

1. Verification hangs with `rep`-based rules (blocking)

2. `anonymous` rule is `(rep (any))` — too permissive for parsing

3. Token type detection for non-`token` corpora

Deferred Enhancements

Negative examples for disambiguation

Performance

Cosmetic

FilesExpand file tree

PROGRESS.md

Latest commit

History

PROGRESS.md

File metadata and controls

PHOG-Weighted Grammar Synthesis — Progress

Current Capabilities

Regex Engine (gram.scm)

CST Representation (tree.scm)

PHOG Statistics (phog.scm)

Grammar Learning Pipeline (phog.scm)

JSON Corpus (run-json.scm, corpus-json.scm)

Output Quality

Test Coverage (266 total, all passing)

Problems Solved

(opt _.0) overgeneralization — FIXED

PHOG weights dead — FIXED

pattern-table unused — FIXED

Exponential synthesis — FIXED

Anti-unification structural misalignment — FIXED

Search integrated into learn-grammar — DONE

Context-specific PHOG in sym-matcho — DONE

Weighted interleaving — DONE (no mk.scm changes needed)

Open Issues

1. Verification hangs with rep-based rules (blocking)

2. *anonymous* rule is (rep (any)) — too permissive for parsing

3. Token type detection for non-*token* corpora

Deferred Enhancements

Negative examples for disambiguation

Performance

Cosmetic

Regex Engine (`gram.scm`)

CST Representation (`tree.scm`)

PHOG Statistics (`phog.scm`)

Grammar Learning Pipeline (`phog.scm`)

JSON Corpus (`run-json.scm`, `corpus-json.scm`)

`(opt _.0)` overgeneralization — FIXED

`pattern-table` unused — FIXED

Search integrated into `learn-grammar` — DONE

1. Verification hangs with `rep`-based rules (blocking)

2. `anonymous` rule is `(rep (any))` — too permissive for parsing

3. Token type detection for non-`token` corpora