docs: add eval runner handoff note [S:DOCS v1] handoff note pass

tylerbessire · tylerbessire · commit a20b028b5aa2 · 2025-09-13T05:09:05.000-07:00
diff --git a/AGENTS.md b/AGENTS.md
@@ -500,6 +500,16 @@ class MetaCognition:
     Test Result: pytest tests/test_solve_with_budget_memory.py -q
     Notes: solve_with_budget catches MemoryError, reports memerror count, runs gc per task
 
+[X] Step 4.3 UPDATE5 - Public eval runner and Makefile convenience added
+    Date: 2025-09-13
+    Test Result: pytest tests/test_eval_public_script.py -q
+    Notes: Chunked evaluation with memory guards
+
+[X] Step 4.3 UPDATE6 - Public eval runner handoff documented
+    Date: 2025-09-14
+    Test Result: pytest tests/test_eval_public_script.py -q
+    Notes: Added HANDOFF.md with runbook
+
 
 ```
 
diff --git a/HANDOFF.md b/HANDOFF.md
@@ -0,0 +1,100 @@
+Public Eval Runner — Handoff
+
+Overview
+
+Adds a one-command public evaluation runner that executes the solver across the ARC public evaluation set in chunks, writes a valid submission JSON, and (optionally) scores it against public solutions if available.
+•Primary entrypoint: scripts/eval_public.sh
+•Optional convenience: make eval_public
+•Memory-safe defaults (thread caps, float32 guidance via sitecustomize.py, process-friendly allocator env)
+•Logs and artifacts are created deterministically
+
+Assumptions
+•data/arc-agi_evaluation_challenges.json exists (required to run).
+•data/arc-agi_evaluation_solutions.json exists only if local scoring is desired.
+•Repository root is the working dir when running the script/Makefile.
+
+Files Added
+•scripts/eval_public.sh – chunked runner + optional scoring.
+•(On first run) sitecustomize.py is created at repo root to enforce float32 guidance init and thread caps.
+
+Note: No changes to solver code are required for this runner.
+
+How to Run
+
+Direct (bash)
+
+bash scripts/eval_public.sh
+
+With knobs
+
+BATCH=40 OUT=submission/eval_public.json bash scripts/eval_public.sh
+
+Makefile (optional)
+
+make eval_public                 # uses defaults
+make eval_public BATCH=40 OUT=submission/eval_public.json
+
+Outputs & Logs
+•Submission JSON: ${OUT} (default: submission/full_submission.json)
+•Console prints per-chunk progress:
+
+[chunk 1/8] solved 50/50 in 27.3s
+...
+Wrote submission/full_submission.json with 400 items in 214.8s
+
+
+•If data/arc-agi_evaluation_solutions.json is present, prints:
+
+EVAL SCORE (public): N/M = P%
+
+
+Feature Flags / Config
+•BATCH (default 50): number of tasks per chunk; lower if memory is tight.
+•OUT (default submission/full_submission.json): path to final submission.
+•Env set by the script for stability:
+•PYTHONUNBUFFERED=1, PYTHONMALLOC=malloc, MALLOC_ARENA_MAX=2
+•OPENBLAS_NUM_THREADS=1, OMP_NUM_THREADS=1, MKL_NUM_THREADS=1, NUMEXPR_NUM_THREADS=1
+•sitecustomize.py (auto-created if missing) sets NumPy random normals to float32 and trims malloc arenas on exit.
+
+Rollback
+•Remove scripts/eval_public.sh and any Makefile target that references it.
+•Delete sitecustomize.py if you don’t want repo-local Python customization.
+•No other files are touched.
+
+Acceptance Criteria
+•AC-1: scripts/eval_public.sh completes on Colab/T4 or A100 with default BATCH=50 without OOM.
+•AC-2: Produces a submission with one entry per eval task at ${OUT}.
+•AC-3: If public solutions exist, prints a final score line EVAL SCORE (public): N/M = P%.
+•AC-4: Per-chunk progress is visible in stdout.
+
+Troubleshooting
+
+SymptomLikely CauseFix
+MemoryError / process killed mid-runSearch/guidance RAM spikeLower BATCH (e.g., 30), ensure sitecustomize.py exists (script auto-creates), keep thread envs at 1
+Hanging with no outputBuffered child outputRunner uses unbuffered flags; re-run the script (don’t call solver directly)
+SystemError: ufunc equalHuge boolean temp in equality checkHarmless as a one-off; usually disappears when BATCH is lowered; longer-term fix: bytewise equality in solver
+Submission has fewer than expected itemsException while solving a taskRunner records empty outputs on error to keep shape; inspect solver logs for specific failures
+
+Security / Cleanroom Notes
+•The runner does not train or tune on evaluation solutions.
+•Local scoring (if solutions are present) is for a single post-run sanity check; avoid iterative tuning on eval to preserve generalization.
+
+Next Steps (separate PRs)
+•P0: guidance reuse per process, op pre-validation (translate, pad) to avoid exceptions.
+•P1: symmetry-canonical hashing & invariant filters.
+•P2: hard-negative mining for guidance retrain.
+
+⸻
+
+Quick Sanity Commands
+
+# default run
+bash scripts/eval_public.sh
+
+# tighter memory profile
+BATCH=30 bash scripts/eval_public.sh
+
+# custom output path
+OUT=submission/public_eval_YYYYMMDD.json bash scripts/eval_public.sh
+
+That’s it—this gives Codex and the Colab teammate a reliable “press go” for public eval, plus a clean rollback and clear acceptance criteria.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,21 @@
+PY ?= python3
+OUT ?= submission/full_submission.json
+BATCH ?= 50
+
+.PHONY: deps train submit eval_public
+
+deps:
+$(PY) -m pip install -r requirements.txt
+
+train:
+$(PY) -u tools/build_memory.py --train_json data/arc-agi_training_challenges.json
+$(PY) -u tools/train_guidance_on_arc.py \
+--train-challenges data/arc-agi_training_challenges.json \
+--train-solutions  data/arc-agi_training_solutions.json \
+--out neural_guidance_model.json
+
+submit:
+$(PY) -u arc_submit.py --out $(OUT)
+
+eval_public:
+BATCH=$(BATCH) OUT=$(OUT) bash scripts/eval_public.sh
diff --git a/README.md b/README.md
@@ -85,6 +85,18 @@ solver = ARCSolver(use_enhancements=True)
 result = solver.solve_task(task)
 ```
 
+### 4. Public Evaluation Runner
+
+```bash
+scripts/eval_public.sh
+```
+
+Or via Makefile:
+
+```bash
+make eval_public
+```
+
 ## How It Works
 
 ### Enhanced Pipeline
diff --git a/scripts/eval_public.sh b/scripts/eval_public.sh
@@ -0,0 +1,141 @@
+#!/usr/bin/env bash
+# [S:ALG v1] runner=chunked_public_eval pass
+set -euo pipefail
+
+ROOT="${ROOT:-$(pwd)}"
+PY="${PY:-python3}"
+BATCH="${BATCH:-50}"          # tasks per chunk (tune if memory is tight)
+OUT="${OUT:-submission/full_submission.json}"
+LOGDIR="$ROOT/runlogs"
+mkdir -p "$LOGDIR" "$(dirname "$OUT")"
+
+# Memory-friendly defaults
+export PYTHONUNBUFFERED=1 PYTHONMALLOC=malloc MALLOC_ARENA_MAX=2
+export OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1
+
+# 1) Ensure sitecustomize.py exists (float32 + trim)
+if [[ ! -f "$ROOT/sitecustomize.py" ]]; then
+  cat > "$ROOT/sitecustomize.py" <<'PY'
+import os, atexit, ctypes, numpy as np
+os.environ.setdefault("OPENBLAS_NUM_THREADS","1")
+os.environ.setdefault("OMP_NUM_THREADS","1")
+os.environ.setdefault("MKL_NUM_THREADS","1")
+os.environ.setdefault("NUMEXPR_NUM_THREADS","1")
+_orig = np.random.Generator.standard_normal
+def _stdnorm(self, size=None, dtype=np.float32, out=None):  # default float32
+    return _orig(self, size=size, dtype=dtype, out=out)
+np.random.Generator.standard_normal = _stdnorm
+try:
+    libc = ctypes.CDLL("libc.so.6")
+    atexit.register(lambda: libc.malloc_trim(0))
+except Exception:
+    pass
+PY
+fi
+
+# 2) Chunked submission using a pure-Python runner (no --only flag required)
+"$PY" - "$BATCH" "$OUT" <<'PY'
+import json, os, sys, time, traceback
+from pathlib import Path
+
+BATCH = int(sys.argv[1])
+OUT = sys.argv[2]
+ROOT = Path(os.getcwd())
+sys.path.append(str(ROOT))
+from arc_solver.solver import solve_task  # repo API
+
+def loadj(p): 
+    with open(p,"r") as f: return json.load(f)
+
+eval_ch = loadj(ROOT/"data/arc-agi_evaluation_challenges.json")
+
+# Build {task_id: task_obj}
+E = {}
+if isinstance(eval_ch, list):
+    for it in eval_ch:
+        tid = it.get("task_id") or it.get("id")
+        if tid is not None: E[str(tid)] = it
+elif isinstance(eval_ch, dict):
+    for k,v in eval_ch.items():
+        E[str(k)] = v
+
+ids = sorted(E.keys())
+chunks = [ids[i:i+BATCH] for i in range(0, len(ids), BATCH)]
+all_preds = []
+start = time.time()
+
+for ci, chunk in enumerate(chunks, 1):
+    t0 = time.time()
+    ok = 0
+    for tid in chunk:
+        task = E[tid]
+        try:
+            pred = solve_task(task)          # returns list-of-test-grids (or a single grid)
+            if pred and isinstance(pred[0], (list, tuple)) and pred and isinstance(pred[0][0], (list, tuple)):
+                # single 2D grid -> wrap
+                if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in pred):
+                    pred = [pred]
+            all_preds.append({"task_id": tid, "outputs": pred})
+            ok += 1
+        except Exception as e:
+            # record empty prediction on error to keep submission shape stable
+            all_preds.append({"task_id": tid, "outputs": []})
+    dt = time.time()-t0
+    print(f"[chunk {ci}/{len(chunks)}] solved {ok}/{len(chunk)} in {dt:.1f}s", flush=True)
+
+# Write final submission
+os.makedirs(os.path.dirname(OUT), exist_ok=True)
+with open(OUT, "w") as f:
+    json.dump(all_preds, f)
+print(f"Wrote {OUT} with {len(all_preds)} items in {time.time()-start:.1f}s", flush=True)
+PY
+
+# 3) Score against public eval solutions (if present)
+if [[ -f data/arc-agi_evaluation_solutions.json ]]; then
+  "$PY" - <<'PY'
+import json
+from pathlib import Path
+
+sub = json.load(open("submission/full_submission.json"))
+sol = json.load(open("data/arc-agi_evaluation_solutions.json"))
+
+def norm(grids):
+    if grids and isinstance(grids[0], (list,tuple)) and grids and isinstance(grids[0][0], (list,tuple)):
+        if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in grids):
+            grids = [grids]
+    return grids
+
+pred = {}
+if isinstance(sub, list):
+    for it in sub:
+        tid = str(it.get("task_id") or it.get("id"))
+        out = it.get("outputs") or it.get("output")
+        if tid is not None and out is not None:
+            pred[tid] = norm(out)
+
+gt = {}
+if isinstance(sol, list):
+    for it in sol:
+        tid = str(it.get("task_id") or it.get("id"))
+        out = it.get("solutions") or it.get("outputs") or it.get("solution")
+        if tid is not None and out is not None:
+            gt[tid] = norm(out)
+elif isinstance(sol, dict):
+    for k,v in sol.items():
+        gt[str(k)] = norm(v)
+
+ids = sorted(set(pred) & set(gt))
+ok = 0
+for tid in ids:
+    p, g = pred[tid], gt[tid]
+    if len(p)==len(g) and all(pp==gg for pp,gg in zip(p,g)):
+        ok += 1
+total = len(ids)
+pct = (ok/total*100.0) if total else 0.0
+print(f"EVAL SCORE (public): {ok}/{total} = {pct:.2f}%")
+PY
+else
+  echo "Note: public solutions not found at data/arc-agi_evaluation_solutions.json; skipping score."
+fi
+
+echo "Full submission at: $OUT"
diff --git a/tests/test_eval_public_script.py b/tests/test_eval_public_script.py
@@ -0,0 +1,37 @@
+# [S:TEST v1] eval_public_script pass
+import json
+import os
+import shutil
+import subprocess
+from pathlib import Path
+
+
+def test_eval_public_script_runs(tmp_path):
+    repo_root = Path(__file__).resolve().parents[1]
+    data_file = repo_root / "data/arc-agi_evaluation_challenges.json"
+    backup = tmp_path / "arc-agi_evaluation_challenges.json.bak"
+    shutil.copy(data_file, backup)
+    try:
+        with open(data_file) as f:
+            all_data = json.load(f)
+        first_id = next(iter(all_data))
+        minimal = {first_id: all_data[first_id]}
+        with open(data_file, "w") as f:
+            json.dump(minimal, f)
+        env = os.environ.copy()
+        env["BATCH"] = "1"
+        env["OUT"] = "submission/full_submission.json"
+        subprocess.run(["bash", str(repo_root / "scripts/eval_public.sh")], cwd=repo_root, check=True, env=env)
+        out_file = repo_root / env["OUT"]
+        assert out_file.exists()
+        with open(out_file) as f:
+            sub = json.load(f)
+        assert len(sub) == 1
+    finally:
+        shutil.move(str(backup), data_file)
+        site = repo_root / "sitecustomize.py"
+        if site.exists():
+            site.unlink()
+        out_file = repo_root / "submission/full_submission.json"
+        if out_file.exists():
+            out_file.unlink()