Skip to content

Latest commit

 

History

History
167 lines (133 loc) · 13.3 KB

File metadata and controls

167 lines (133 loc) · 13.3 KB

PHOG-Weighted Grammar Synthesis — Progress

Current Capabilities

Regex Engine (gram.scm)

  • 3-letter primitives: lit, cat, alt, opt, rep, sym, set
  • set byte-range primitive: (set (lo . hi) ...) for character class matching (like re2/rust regex crate)
  • Named classes: digit, lower, upper, alpha, alnum, blank, any
  • matcho: relational regex matching with cycle detection for recursive grammars
  • parseo: like matcho but threads a reducer function, produces structured output at sym boundaries
  • make-tree-reducer builds CST nodes, make-identity-reducer returns matched text
  • Rule registry: define-rule, lookup-rule, clear-rules!
  • full-matcho: can synthesize regex from positive/negative examples (from root.scm era)

CST Representation (tree.scm)

  • Three-way node distinction: nonterminal (symbol children...), token (*token* name text), anonymous string leaf "text"
  • cst->string reconstructs original source from any CST node
  • cst-child-types extracts type-symbol sequence from a node's children
  • cst-walk traverses corpus with full 5-element context (node-type, position, parent-type, grandparent-type, left-sibling-type)
  • Corpus queries: cst-collect-types, cst-collect-instances, cst-collect-child-patterns, cst-collect-token-types

PHOG Statistics (phog.scm)

  • extract-phog-stats builds two tables from a corpus:
    1. PHOG table: nonterminal → (context → ((production . count) ...))
    2. Pattern table: nonterminal → ((child-pattern . count) ...)
  • phog-lookup with Witten-Bell smoothing and 5-level backoff chain (full context → drop features right-to-left → global merge)
  • Per-nonterminal context functions via *context-fns* registry; default uses all 5 features
  • condp-static: weighted conde — sorts branches by descending weight, chains via mplus
  • condp-budgeted: like condp-static but chains with mplus-biased — proportional search budget allocation (weight/min-weight ratio per branch)
  • mplus-biased: budget-based interleaving — gives left stream k steps per 1 step of right, follows case-inf protocol (no mk.scm changes needed)
  • regex-weight-from-probs: extracts weight of a regex branch from a context-specific phog-lookup probability distribution

Grammar Learning Pipeline (phog.scm)

  • learn-grammar: end-to-end — registers token types as literal rules, runs bulk parallel search, then synthesizes each nonterminal
  • learn-single-rule: dispatches between empty nodes, token-like nodes, and nonterminals
  • Tiered parallel search (primary path for multi-pattern nonterminals):
    • bulk-search-rules launches all multi-pattern symbols in parallel
    • Tier 1: one engine-bounded thread per symbol (~2s fuel), finds simple types (document, pair) in ~1ms
    • Tier 2: or-parallel for tier-1 failures — 5 threads per symbol (one per top-level regex constructor: cat/alt/rep/opt/sym), first ground result wins
    • All threads use Chez make-engine for bounded computation — self-terminate after fuel exhaustion (no zombie threads, no memory leaks)
    • regex-ground? filter rejects results with unbound miniKanren variables (e.g., _.0)
    • Falls back to anti-unification when search times out or produces no ground result
  • Anti-unification (fallback for multi-pattern synthesis):
    • group-by-skeletontransposeanti-unify-columnanti-unify-groupanti-unify-patterns
    • O(n*m) deterministic — no search, no unbound variables, no overgeneralization
    • pattern-skeleton maps named symbols → *sym*, keeps *anonymous* — two patterns group iff same skeleton
    • Anonymous string inlining via collect-anonymous-texts-at-position + inline-anonymous-texts — scans ALL patterns in a skeleton group
    • Skeleton groups ordered by pattern-table frequency (most common first)
  • PHOG-ordered alt branches: phog-order-regex post-processes both search and anti-unified regexes, reordering alt children by marginal PHOG frequency
  • PHOG-weighted parsing via sym-matcho:
    • 7-arg signature: (sym-matcho regex input rest phog-table nonterminal pos lsib)
    • pos (child index) and lsib (left-sibling type) threaded through cat/rep branches via project
    • Alt branch uses condp-budgeted with context-specific PHOG weights (not just marginals): builds (nonterminal pos nonterminal #f lsib) context, calls phog-lookup for Witten-Bell-smoothed probabilities
    • Graceful degradation: unground vars or empty PHOG table → equal weights → same as old conde
    • full-sym-matcho via case-lambda: 2-arg (regex pat) or 4-arg (regex pat phog-table nonterminal) — passes 0 #f for pos/lsib
  • Character-level synthesis: leaf nonterminals generalize — learns (rep (set (48 . 57))) instead of memorizing (alt (lit "43") (lit "1") ...)
  • Anonymous string handling: anonymous string leaves (e.g., " = ") are inlined as (lit ...) in synthesized rules
  • verify-grammar: shallow check — did parse? consume the text?
  • verify-grammar-structural: deep check — parses via parseo + tree-reducer, compares parsed CST structure against original corpus CSTs

JSON Corpus (run-json.scm, corpus-json.scm)

  • 81 CSTs from JSONTestSuite, 11 nonterminal types
  • Full pipeline: extract-phog-statslearn-grammarverify-grammar
  • Rule synthesis works: search finds rep-based rules for all complex types in <25s total:
    • document(alt (sym array) (alt (sym object) ...))
    • pair(cat (cat (sym string) (sym *anonymous*)) (alt (sym string) ...))
    • string(rep (alt (sym *anonymous*) (alt (sym string_content) (sym escape_sequence))))
    • object(rep (alt (sym *anonymous*) (sym pair)))
    • array(rep (alt (sym *anonymous*) (alt (sym string) ...)))
  • Verification blocked: verify-grammar (character-level parseo) hangs on rep-of-alt rules — see Open Issues

Output Quality

The arithmetic corpus ((expr (num "43") (*token* plus "+") (num "2")) ...) produces:

minus → (lit -)
plus  → (lit +)
expr  → (alt (sym num) (cat (sym num) (cat (alt (sym plus) (sym minus)) (sym num))))
num   → (rep (set (48 . 57)))
  • No unbound variables (_.0), no overgeneralization
  • Parses all corpus examples + generalizes to unseen valid inputs (5+3, 10-1)
  • Rejects invalid inputs
  • PHOG ordering puts more frequent branches first

Test Coverage (266 total, all passing)

  • root.scm: 35 — regex constructors, matching, generation, synthesis, grammar rules
  • test-phog.scm: 174 — tree predicates, stats, Witten-Bell, condp-static, condp-budgeted, mplus-biased, regex-weight-from-probs, sym-matcho (weighted + unweighted + context-specific), position threading, anti-unification helpers (group-by-skeleton, anti-unify column/group/patterns), PHOG ordering, rule synthesis, grammar learning, verification, set primitive, char-level synthesis, anonymous strings, end-to-end quality
  • test-parseo.scm: 32 — identity/tree reducers, nested sym, matched text correctness
  • test-verify.scm: 25 — structural verification helpers, full pipeline structural checks

Problems Solved

(opt _.0) overgeneralization — FIXED

The old miniKanren-synthesized expr rule had (opt _.0) accepting any single character. Anti-unification produces exactly the least-general generalization — no unbound variables by construction.

PHOG weights dead — FIXED

Previously, PHOG statistics were computed but never reached synthesis decisions. Now wired in at three levels:

  1. phog-order-regex reorders alt branches in synthesized rules by marginal PHOG frequency
  2. sym-matcho uses condp-budgeted with context-specific PHOG weights at alt choice points during parsing
  3. mplus-biased gives proportionally more search budget to higher-weight branches (not just static reordering)

pattern-table unused — FIXED

Now drives frequency ordering of skeleton groups in anti-unify-patterns (most common pattern skeleton first in top-level alt).

Exponential synthesis — FIXED

Replaced O(exp) miniKanren (run budget (regex) ...) search with O(n*m) deterministic anti-unification. Dead code (syntho, synth-weight, make-synth-ctx, etc.) cleaned up. Symbol-level search now available as the primary path via engine-bounded parallel tiered search.

Anti-unification structural misalignment — FIXED

group-by-length merged same-length patterns with different anonymous/symbol positions, causing column-wise anti-unification to produce broken regexes. Replaced with group-by-skeleton — groups only when anonymous positions align. anti-unify-column now scans all patterns in the group (not just the representative) for anonymous text collection.

Search integrated into learn-grammar — DONE

Or-parallel tiered search from try-search.scm is now the primary multi-pattern synthesis path in learn-grammar. Key engineering:

  • Bulk parallelism: all multi-pattern symbols searched simultaneously (not sequentially), total wall-time ~25s for JSON corpus
  • Engine-bounded threads: Chez make-engine wraps each thread's computation with a fuel limit. Threads self-terminate after exhausting ticks — eliminates the zombie thread / 27GB RAM leak that occurred with bare fork-thread
  • Ground filter: regex-ground? rejects search results containing unbound miniKanren variables, falling back to anti-unification
  • Graceful fallback: synth-rule-for-symbol is now case-lambda — takes optional search-results alist, uses anti-unification when no search result available

Context-specific PHOG in sym-matcho — DONE

sym-matcho now threads pos and lsib through the regex traversal. Alt branches build a 5-element context key and call phog-lookup for Witten-Bell-smoothed, context-specific probabilities. Falls back to marginal/uniform gracefully when context data is sparse.

Weighted interleaving — DONE (no mk.scm changes needed)

mplus-biased implements proportional budget allocation at the stream level, following the same case-inf protocol as mplus. condp-budgeted computes budget ratios (weight/min-weight) and chains branches with mplus-biased. Zero changes to mk.scm — the abstraction sits entirely in phog.scm.


Open Issues

1. Verification hangs with rep-based rules (blocking)

The search produces correct rep-of-alt rules (e.g., (rep (alt (sym *anonymous*) (sym pair))) for object), but verify-grammar — which uses character-level parseo — hangs. The root cause is exponential backtracking: rep tries every possible way to split the input into repetitions, and each split tries every alt branch, which in turn does character-level matching via matcho. A 17-character object took 44 seconds; longer inputs never terminate.

This is the critical blocker. The learned rules are correct at the symbol level but untestable at the character level.

Possible solutions:

  • Symbol-level verification: verify rules against child-type patterns via full-sym-matcho (symbol-level, not character-level). This bypasses parseo entirely and directly tests what the search actually operates on. Fast and correct, but doesn't validate character-level parsing end-to-end.
  • Bounded verification: wrap parseo calls in make-engine with a fuel limit. Report "inconclusive" instead of hanging. Simple but doesn't actually verify.
  • Fix *anonymous*: currently (rep (any)) — too permissive. Replace with corpus-observed literal texts (attempted; reverted per user request). This would make parseo tractable by eliminating the infinite loop in *anonymous* expansion, but requires careful handling of which texts to include.

2. *anonymous* rule is (rep (any)) — too permissive for parsing

Search-produced regexes reference (sym *anonymous*) for structural tokens (braces, commas, colons, whitespace). The catch-all (rep (any)) matches any string, which:

  • Makes character-level parseo exponentially slow (see issue #1)
  • Means the grammar doesn't validate punctuation — only structure

User explicitly deferred this: "let's not kill *anonymous* just yet."

3. Token type detection for non-*token* corpora

JSON corpus nodes like (true "true") aren't in (*token* name text) format, so cst-collect-token-types returns (). These get overly permissive char-level rules ((rep (set (97..122)))) instead of (lit "true"). Need a heuristic: nonterminals with a single *anonymous* child where all instances share the same text → treat as token.


Deferred Enhancements

Negative examples for disambiguation

Anti-unification makes negatives unnecessary for single-nonterminal synthesis (no overgeneralization by construction). Negatives remain useful for:

  • Ambiguity resolution: when two nonterminals have overlapping patterns, negatives from one can disambiguate the other
  • Held-out evaluation: split corpus, validate grammar on unseen examples

Performance

  • Alist performance: nested alists with assoc/equal?. Fine for toy corpora, needs hash tables for real ones.
  • Incremental learning: learn-grammar rebuilds from scratch each time. No incremental update path yet.

Cosmetic

  • Redundant alt branches: anti-unification produces (alt (lit x) (lit x)) when all patterns in a skeleton group agree on the same literal at a position. Cosmetic — doesn't affect correctness. (Only happens when anti-unification fallback is used.)