This repository implements the plan outlined in Project_Blueprint.md and
Synthetic_ARC_Task_Generator_Spec.md. The goal is to build a synthetic ARC
task generator, train JEPA representations, learn hierarchical option policies,
and evaluate out-of-distribution reasoning.
arcgen/— Core grid utilities, primitive transformations, and program synthesizers for generating ARC-like tasks. Includes object extraction and relational helpers inarcgen/objects.py.training/— Training loops and experiment orchestration for JEPA, HRL, and Meta-JEPA components. Object tokenization and discrete latent modules live intraining/modules/, object-centric JEPA helpers undertraining/jepa/, the typed DSL / enumerator scaffolding undertraining/dsl/, rollout mining + promotion utilities intraining/options/, the few-shot solver pipeline intraining/solver/, meta-rule clustering/training intraining/meta_jepa/, and evaluation helpers intraining/eval/.envs/— Environment wrappers exposing ARC-style tasks to RL agents.configs/— YAML configuration files for data generation, training, and evaluation runs.scripts/— Command-line entry points for dataset generation and training.generate_dataset.py— Build synthetic manifests; supports curriculum schedules andallowed_primitivesconstraints.train_jepa.py— Run manifest-backed JEPA pretraining (supports--dry-run,--device, and--ddp).train_meta_jepa.py— Train the rule-family encoder on JSONL tasks with contrastive loss.train_guidance.py— Fit the DSL neural guidance scorer on synthetic tasks using a JEPA encoder for latent features.train_hierarchical.py— Launch RLlib PPO over the latent option environment using configs underconfigs/training/rl/.evaluate_arc.py— Run the evaluation/ablation suite and emit JSON metrics.run_jepa_ablation.py— LeJEPA ablations (InfoNCE vs +VQ/relational/invariance/SIGReg) with JSON/Markdown summaries.prepare_program_triples.py— Convert generator manifests into (input, program, output) triples for counterfactual JEPA.train_program_conditioned_jepa.py— Train the program-conditioned JEPA for latent counterfactual prediction.train_active_reasoner.py— Train the hypothesis-search policy onHypothesisSearchEnv(PPO-style updates using counterfactual JEPA latents).pretrain_bc.py— Bootstrap the active reasoner with behavioral cloning traces before PPO fine-tuning.
tests/— Unit and integration tests covering generators, models, and envs.
The default DSL registry (training/dsl/primitives.py) now spans topology and
control-flow operators in addition to basic geometry:
- Topology:
flood_fill,connected_components,shape_bbox,shape_centroid, andshape_areaexpose component-level reasoning hooks. - Collections:
components_filter_by_*,components_map_to_subgrids, andcomponents_fold_overlayimplement map/filter/fold combinators over the newShapeList/GridListtypes. - Logic: integer/shape list predicates plus
if_then_elseallow conditional program branches.
10+ regression tests live in tests/test_dsl_primitives.py, ensuring the few-shot
solver enumerator can rely on the extended registry without regressions.
- Create a Python environment with uv:
uv venv --python 3.11 .venv && source .venv/bin/activate. - Install dependencies via uv:
uv pip install --python .venv/bin/python -r requirements.txt(userequirements-rl.txtfor RL/Ray runs; LeJEPA code is vendored locally, no extra pip package needed). - Generate synthetic datasets (pick a recipe below) and start pretraining.
- Baseline sequential mix —
python scripts/generate_dataset.py --config configs/data/pilot.yaml - Curriculum (atomic + sequential) —
python scripts/generate_dataset.py --config configs/data/pilot_curriculum.yaml - OOD slice (large grids, constrained primitives) —
python scripts/generate_dataset.py --config configs/data/pilot_ood.yaml
Each config writes a manifest and summary.json to its output_root. The generator now accepts task_schedule at the config root (or under generator) and optional generator.allowed_primitives for structural constraints/outlier pockets.
Control program complexity directly inside the program block:
program:
max_depth: 4 # optional safety clamp for any phase
length_schedule:
sequential: # phase-specific histogram (atomic/sequential or roman numerals)
2: 0.5
3: 0.3
4: 0.2
atomic:
1: 1.0
Supplying a single mapping (e.g., {1: 0.4, 2: 0.4, 3: 0.2}) applies to every phase. Weights are normalised automatically and entries beyond max_depth are dropped. The generator stores the sampled program_length per task, and summary.json now reports both descriptive stats and a literal program_length_histogram to verify the curriculum.
Evaluate a manifest to sanity-check solve rates and program counts:
PYTHONPATH=. .venv/bin/python scripts/evaluate_arc.py --tasks data/pilot_curriculum/manifest.jsonl --output artifacts/eval/pilot_curriculum.json
To benchmark against the official ARC dev (training) set, point the harness at the directory (or individual JSON file) that contains the canonical tasks:
PYTHONPATH=. .venv/bin/python scripts/evaluate_arc.py \
--arc-dev-root /path/to/arc-dataset/training \
--output artifacts/eval/arc_dev.json
The loader validates each ARC file, uses all provided train pairs as few-shot examples, and checks predictions against any available test outputs.
The repo tracks a baseline evaluation against the official ARC-1 training split.
-
Download the dataset once (the official JSON bundle is open-source):
git clone --depth 1 https://github.com/fchollet/ARC external/ARC -
Run the evaluation harness against the training directory:
PYTHONPATH=. .venv/bin/python scripts/evaluate_arc.py \ --arc-dev-root external/ARC/data/training \ --output artifacts/eval/arc_dev_baseline.json
Current baseline (DSL enumerator, max_nodes=3, no meta priors available for ARC-1):
| Variant | Success Rate | Avg. Programs Tested |
|---|---|---|
dsl_only |
2.25% | 235.57 |
meta_guided* |
2.25% | 235.57 |
*Meta-guided mode falls back to the vanilla registry on ARC-1 because no rule traces are provided to derive primitive histograms.
The JSON summary for reproducibility lives at artifacts/eval/arc_dev_baseline.json.
Use torchrun to launch multi-process JEPA training on a single node. Example tiny run:
PYTHONPATH=. torchrun --standalone --nproc_per_node=2 \
.venv/bin/python scripts/train_jepa.py \
--config configs/training/jepa_ddp_tiny.yaml \
--ddp --ddp-backend gloo
Use nccl backend on CUDA machines; set --device cuda or rely on LOCAL_RANK device assignment. The script skips checkpoint/logging on non-rank-0 workers.
See CONTRIBUTING.md for environment setup (uv), testing expectations, Beads workflow,
and doc/style guidelines.
Long JEPA runs avoid Python tokenization overhead by precomputing object tokens once and streaming them from disk:
PYTHONPATH=. .venv/bin/python scripts/pretokenize_jepa.py \
--config configs/training/jepa_pretrain.yaml \
--output artifacts/tokenized/pilot_curriculum
This reads the manifest/config defaults, writes sharded .pt tensors plus a
metadata.json descriptor, and preserves per-sample metadata in each shard.
Point the trainer at the tokenized directory to enable the zero-tokenization
path (use the same tokenizer + data.context_window you precomputed with):
# configs/training/jepa_pretrain.yaml
pre_tokenized:
path: artifacts/tokenized/pilot_curriculum
scripts/train_jepa.py automatically switches to the new
TokenizedPairDataset when pre_tokenized.path is set; otherwise it falls back
to manifest-time tokenization.
Need to sanity-check tokenizer throughput? Run the micro-benchmark:
PYTHONPATH=. .venv/bin/python scripts/benchmark_tokenizer.py --samples 512 --height 24 --width 24 --respect-colors --connectivity 8
The script times both the legacy path and the vectorized implementation and validates numerical parity. For heavy ARC grids (120x120, 1024 object slots, 512 color features) the command below reports ~3.1× speedup on a CPU-only run:
PYTHONPATH=. .venv/bin/python scripts/benchmark_tokenizer.py \
--samples 16 --height 120 --width 120 \
--respect-colors --connectivity 8 \
--max-objects 1024 --max-color-features 512
Run full JEPA training against any manifest:
PYTHONPATH=. .venv/bin/python scripts/train_jepa.py --config configs/training/jepa_pretrain.yaml --device cpu
Pass --device cuda on GPU boxes. Checkpoints and metrics.json land in artifacts/jepa/pretrain/ (configurable via training.checkpoint_dir). Use --dry-run for a single dummy optimisation step.
For full A6000 runs, start from configs/training/jepa_pretrain_gpu.yaml; it sets a larger batch, longer schedule, and defaults to TensorBoard logging under artifacts/jepa/pretrain_gpu/tensorboard/.
Mixed precision is controlled via training.amp in the config. Set it to true on CUDA hosts to enable torch.cuda.amp autocast + GradScaler; the loop automatically reverts to FP32 if CUDA/AMP is unavailable (or when forcing --device cpu).
BYOL-style target encoder: set loss.use_target_encoder=true and loss.target_ema_decay=<0-1] in the JEPA config to enable a stop-gradient EMA copy of the encoder/projection head. The training loop automatically keeps the target network in sync via EMA updates and routes the contrastive loss through the stabilized branch.
Validation splits & early stopping: set data.validation.split (alias split_ratio) to carve off a deterministic validation subset from the manifest/tokenized dataset. Optional keys include batch_size and seed for per-split loader overrides. Pair it with a training.early_stopping block to halt runs when the validation InfoNCE loss stops improving:
data:
validation:
split: 0.1 # reserve 10% of samples for validation
batch_size: 128 # optional override
seed: 17 # deterministic split seed
training:
early_stopping:
enabled: true
patience: 4
min_delta: 1.0e-3
train_jepa.py logs val/loss to TensorBoard when enabled, includes validation curves in metrics.json, and stops once the patience budget is exhausted. Early stopping requires a validation split.
SIGReg + embedding diagnostics: set sigreg.weight>0 to enable the LeJEPA isotropic Gaussian penalty (implementation vendored under lejepa/). Toggle diagnostics.embedding_metrics.enabled=true to log isotropy, codebook usage, and rank stats during training. Both options are available in JEPA configs and are exercised in scripts/run_jepa_ablation.py.
Use scripts/validate_jepa_correlation.py to quantify how predictive the JEPA validation
loss is for downstream solving success. The script:
- Loads each checkpoint and evaluates the InfoNCE loss on a validation manifest.
- Builds a latent-distance-guided solver that ranks DSL programs by how closely their JEPA embeddings match the target embeddings.
- Runs the solver on either a synthetic JSONL manifest (
--tasks) or the ARC dev set (--arc-dev-root). - Logs per-checkpoint metrics plus the Pearson correlation between loss and solve rate.
Example (tiny smoke test):
PYTHONPATH=. .venv/bin/python scripts/validate_jepa_correlation.py \
--jepa-config configs/training/jepa_tiny.yaml \
--checkpoints artifacts/jepa/tiny_run/checkpoint_epoch_0001.pt \
artifacts/jepa/tiny_run/checkpoint_epoch_0002.pt \
artifacts/jepa/tiny_run/checkpoint_epoch_0003.pt \
--val-manifest data/tiny_manifest.jsonl \
--tasks data/ood_surprise_tasks.jsonl \
--output artifacts/eval/jepa_correlation_demo.json
A strong negative correlation (e.g., ≤ −0.7) means lower JEPA loss predicts higher solve rates and can be used for model selection without running the full solver suite. Weak or positive correlations signal that the current JEPA objective is not aligned with downstream performance and warrants investigation (data quality, augmentations, projection head capacity, etc.).
Each run emits a JSON summary (see artifacts/eval/jepa_correlation_demo.json for a
sample) that records checkpoint paths, validation loss, solver success rate, average
programs evaluated, and the overall correlation statistic.
Quick path to the RL-driven hypothesis search stack:
- Generate program triples from a synthetic manifest:
python scripts/prepare_program_triples.py --manifest data/pilot_curriculum/manifest.jsonl --output data/program_triples.jsonl. - Train the program-conditioned JEPA on those triples:
python scripts/train_program_conditioned_jepa.py --config configs/training/jepa_program_conditioned.yaml(setsigreg.weightif you want LeJEPA regularization here too). - (Optional) Bootstrap RL with behavioral cloning traces:
python scripts/pretrain_bc.py --config configs/training/rl/active_reasoner_bc.yaml. - Train the Active Reasoner policy in
HypothesisSearchEnv:python scripts/train_active_reasoner.py --config configs/training/rl/active_reasoner.yaml.
HypothesisSearchEnv consumes JEPA latents and program traces instead of raw grids, so the triple dataset and program-conditioned JEPA are required inputs. Reward shaping and curriculum are configured inside the RL YAML (see HypothesisRewardConfig keys in training/reasoner/hypothesis_env.py).
scripts/train_meta_jepa.py now accepts --val-split (fraction of rule families reserved for validation), --split-seed, and --early-stopping-* CLI flags. Example:
PYTHONPATH=. .venv/bin/python scripts/train_meta_jepa.py \
--tasks artifacts/synthetic/pilot.jsonl \
--epochs 30 --batch-size 64 --lr 5e-4 \
--val-split 0.15 --split-seed 13 \
--early-stopping-patience 5 --early-stopping-min-delta 5e-4
The trainer reports both training/validation losses per epoch and stops once validation does not improve within the configured patience window. Early stopping requires a non-zero validation split.
scripts/train_guidance.py honours the train.validation_split, train.split_seed, and train.early_stopping keys in the DSL guidance YAML file:
train:
epochs: 25
batch_size: 64
validation_split: 0.2
split_seed: 21
early_stopping:
enabled: true
patience: 6
min_delta: 5e-4
The script automatically builds train/validation DataLoaders, prints Validation loss per epoch, and short-circuits when the patience budget expires.
The RL stack uses RLlib. Install the optional dependency first:
pip install "ray[rllib]"
Then launch PPO on the latent option environment:
PYTHONPATH=. .venv/bin/python scripts/train_hierarchical.py \
--config configs/training/rl/ppo_latent_env.yaml \
--output-dir artifacts/rl/ppo_run --stop-iters 2
The sample config wires RLlib into ArcLatentOptionEnv, loads JEPA encoder settings via jepa_config_path, and exposes knobs for reward shaping, curriculum schedules, and PPO hyper-parameters. Metrics plus the final checkpoint path are stored under the chosen output directory.
- Collect rollouts (random or trained policies) with
scripts/rollout_latent_env.py --env-config configs/training/rl/latent_rollout_env.yaml --jepa-config <jepa.yaml> --episodes 100 --output data/latent_option_traces.jsonl. Each step now records the exactgrid_before/grid_afterpairs required for replay. - Extract candidate macros via
python scripts/discover_options.py --env-config configs/training/rl/latent_rollout_env.yaml --traces data/latent_option_traces.jsonl --output artifacts/options/discovered.json --min-support 3 --max-length 3. - (Optional) Pass
--promoteto validate registration in the DSL registry; the JSON summary lists each auto-named primitive and its underlying option sequence so you can wire them into future solver runs.
training/options/traces.py exposes load_option_episodes_from_traces(...) if you want to plug the mined episodes directly into the few-shot solver or custom discovery logic.
This project is under active development. Consult the blueprint documents for the full roadmap and open Beads issues for next steps.
- Python deps are tracked via
requirements*.txtfiles in the repo root. Useuv venv --python 3.11 .venvfollowed byuv pip install --python .venv/bin/python -r requirements.txt. - Install optional extras as needed:
requirements-dev.txtfor test tooling (pytest).requirements-rl.txtfor RLlib-based PPO/A2C training. LeJEPA utilities are vendored under thelejepa/directory (no separate pip install required).
scipy>=1.11is included inrequirements.txtto power the vectorized object tokenization path; when it's missing the code falls back to the legacy Python implementation.- See
docs/DEPENDENCIES.mdfor the full process, package descriptions, and contribution guidelines.
Architecture choices are tracked in docs/adr/.
0001-rl-engine.md— RLlib selected for hierarchical option training.0002-jepa-objective.md— JEPA uses multi-step InfoNCE with memory queue and specified augmentation policy.