Skip to content

Commit a20b028

Browse files
committed
docs: add eval runner handoff note [S:DOCS v1] handoff note pass
1 parent 02c2d34 commit a20b028

File tree

6 files changed

+321
-0
lines changed

6 files changed

+321
-0
lines changed

AGENTS.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -500,6 +500,16 @@ class MetaCognition:
500500
Test Result: pytest tests/test_solve_with_budget_memory.py -q
501501
Notes: solve_with_budget catches MemoryError, reports memerror count, runs gc per task
502502

503+
[X] Step 4.3 UPDATE5 - Public eval runner and Makefile convenience added
504+
Date: 2025-09-13
505+
Test Result: pytest tests/test_eval_public_script.py -q
506+
Notes: Chunked evaluation with memory guards
507+
508+
[X] Step 4.3 UPDATE6 - Public eval runner handoff documented
509+
Date: 2025-09-14
510+
Test Result: pytest tests/test_eval_public_script.py -q
511+
Notes: Added HANDOFF.md with runbook
512+
503513

504514
```
505515

HANDOFF.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
Public Eval Runner — Handoff
2+
3+
Overview
4+
5+
Adds a one-command public evaluation runner that executes the solver across the ARC public evaluation set in chunks, writes a valid submission JSON, and (optionally) scores it against public solutions if available.
6+
•Primary entrypoint: scripts/eval_public.sh
7+
•Optional convenience: make eval_public
8+
•Memory-safe defaults (thread caps, float32 guidance via sitecustomize.py, process-friendly allocator env)
9+
•Logs and artifacts are created deterministically
10+
11+
Assumptions
12+
•data/arc-agi_evaluation_challenges.json exists (required to run).
13+
•data/arc-agi_evaluation_solutions.json exists only if local scoring is desired.
14+
•Repository root is the working dir when running the script/Makefile.
15+
16+
Files Added
17+
•scripts/eval_public.sh – chunked runner + optional scoring.
18+
•(On first run) sitecustomize.py is created at repo root to enforce float32 guidance init and thread caps.
19+
20+
Note: No changes to solver code are required for this runner.
21+
22+
How to Run
23+
24+
Direct (bash)
25+
26+
bash scripts/eval_public.sh
27+
28+
With knobs
29+
30+
BATCH=40 OUT=submission/eval_public.json bash scripts/eval_public.sh
31+
32+
Makefile (optional)
33+
34+
make eval_public # uses defaults
35+
make eval_public BATCH=40 OUT=submission/eval_public.json
36+
37+
Outputs & Logs
38+
•Submission JSON: ${OUT} (default: submission/full_submission.json)
39+
•Console prints per-chunk progress:
40+
41+
[chunk 1/8] solved 50/50 in 27.3s
42+
...
43+
Wrote submission/full_submission.json with 400 items in 214.8s
44+
45+
46+
•If data/arc-agi_evaluation_solutions.json is present, prints:
47+
48+
EVAL SCORE (public): N/M = P%
49+
50+
51+
Feature Flags / Config
52+
•BATCH (default 50): number of tasks per chunk; lower if memory is tight.
53+
•OUT (default submission/full_submission.json): path to final submission.
54+
•Env set by the script for stability:
55+
•PYTHONUNBUFFERED=1, PYTHONMALLOC=malloc, MALLOC_ARENA_MAX=2
56+
•OPENBLAS_NUM_THREADS=1, OMP_NUM_THREADS=1, MKL_NUM_THREADS=1, NUMEXPR_NUM_THREADS=1
57+
•sitecustomize.py (auto-created if missing) sets NumPy random normals to float32 and trims malloc arenas on exit.
58+
59+
Rollback
60+
•Remove scripts/eval_public.sh and any Makefile target that references it.
61+
•Delete sitecustomize.py if you don’t want repo-local Python customization.
62+
•No other files are touched.
63+
64+
Acceptance Criteria
65+
•AC-1: scripts/eval_public.sh completes on Colab/T4 or A100 with default BATCH=50 without OOM.
66+
•AC-2: Produces a submission with one entry per eval task at ${OUT}.
67+
•AC-3: If public solutions exist, prints a final score line EVAL SCORE (public): N/M = P%.
68+
•AC-4: Per-chunk progress is visible in stdout.
69+
70+
Troubleshooting
71+
72+
SymptomLikely CauseFix
73+
MemoryError / process killed mid-runSearch/guidance RAM spikeLower BATCH (e.g., 30), ensure sitecustomize.py exists (script auto-creates), keep thread envs at 1
74+
Hanging with no outputBuffered child outputRunner uses unbuffered flags; re-run the script (don’t call solver directly)
75+
SystemError: ufunc equalHuge boolean temp in equality checkHarmless as a one-off; usually disappears when BATCH is lowered; longer-term fix: bytewise equality in solver
76+
Submission has fewer than expected itemsException while solving a taskRunner records empty outputs on error to keep shape; inspect solver logs for specific failures
77+
78+
Security / Cleanroom Notes
79+
•The runner does not train or tune on evaluation solutions.
80+
•Local scoring (if solutions are present) is for a single post-run sanity check; avoid iterative tuning on eval to preserve generalization.
81+
82+
Next Steps (separate PRs)
83+
•P0: guidance reuse per process, op pre-validation (translate, pad) to avoid exceptions.
84+
•P1: symmetry-canonical hashing & invariant filters.
85+
•P2: hard-negative mining for guidance retrain.
86+
87+
88+
89+
Quick Sanity Commands
90+
91+
# default run
92+
bash scripts/eval_public.sh
93+
94+
# tighter memory profile
95+
BATCH=30 bash scripts/eval_public.sh
96+
97+
# custom output path
98+
OUT=submission/public_eval_YYYYMMDD.json bash scripts/eval_public.sh
99+
100+
That’s it—this gives Codex and the Colab teammate a reliable “press go” for public eval, plus a clean rollback and clear acceptance criteria.

Makefile

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
PY ?= python3
2+
OUT ?= submission/full_submission.json
3+
BATCH ?= 50
4+
5+
.PHONY: deps train submit eval_public
6+
7+
deps:
8+
$(PY) -m pip install -r requirements.txt
9+
10+
train:
11+
$(PY) -u tools/build_memory.py --train_json data/arc-agi_training_challenges.json
12+
$(PY) -u tools/train_guidance_on_arc.py \
13+
--train-challenges data/arc-agi_training_challenges.json \
14+
--train-solutions data/arc-agi_training_solutions.json \
15+
--out neural_guidance_model.json
16+
17+
submit:
18+
$(PY) -u arc_submit.py --out $(OUT)
19+
20+
eval_public:
21+
BATCH=$(BATCH) OUT=$(OUT) bash scripts/eval_public.sh

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,18 @@ solver = ARCSolver(use_enhancements=True)
8585
result = solver.solve_task(task)
8686
```
8787

88+
### 4. Public Evaluation Runner
89+
90+
```bash
91+
scripts/eval_public.sh
92+
```
93+
94+
Or via Makefile:
95+
96+
```bash
97+
make eval_public
98+
```
99+
88100
## How It Works
89101

90102
### Enhanced Pipeline

scripts/eval_public.sh

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
#!/usr/bin/env bash
2+
# [S:ALG v1] runner=chunked_public_eval pass
3+
set -euo pipefail
4+
5+
ROOT="${ROOT:-$(pwd)}"
6+
PY="${PY:-python3}"
7+
BATCH="${BATCH:-50}" # tasks per chunk (tune if memory is tight)
8+
OUT="${OUT:-submission/full_submission.json}"
9+
LOGDIR="$ROOT/runlogs"
10+
mkdir -p "$LOGDIR" "$(dirname "$OUT")"
11+
12+
# Memory-friendly defaults
13+
export PYTHONUNBUFFERED=1 PYTHONMALLOC=malloc MALLOC_ARENA_MAX=2
14+
export OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1
15+
16+
# 1) Ensure sitecustomize.py exists (float32 + trim)
17+
if [[ ! -f "$ROOT/sitecustomize.py" ]]; then
18+
cat > "$ROOT/sitecustomize.py" <<'PY'
19+
import os, atexit, ctypes, numpy as np
20+
os.environ.setdefault("OPENBLAS_NUM_THREADS","1")
21+
os.environ.setdefault("OMP_NUM_THREADS","1")
22+
os.environ.setdefault("MKL_NUM_THREADS","1")
23+
os.environ.setdefault("NUMEXPR_NUM_THREADS","1")
24+
_orig = np.random.Generator.standard_normal
25+
def _stdnorm(self, size=None, dtype=np.float32, out=None): # default float32
26+
return _orig(self, size=size, dtype=dtype, out=out)
27+
np.random.Generator.standard_normal = _stdnorm
28+
try:
29+
libc = ctypes.CDLL("libc.so.6")
30+
atexit.register(lambda: libc.malloc_trim(0))
31+
except Exception:
32+
pass
33+
PY
34+
fi
35+
36+
# 2) Chunked submission using a pure-Python runner (no --only flag required)
37+
"$PY" - "$BATCH" "$OUT" <<'PY'
38+
import json, os, sys, time, traceback
39+
from pathlib import Path
40+
41+
BATCH = int(sys.argv[1])
42+
OUT = sys.argv[2]
43+
ROOT = Path(os.getcwd())
44+
sys.path.append(str(ROOT))
45+
from arc_solver.solver import solve_task # repo API
46+
47+
def loadj(p):
48+
with open(p,"r") as f: return json.load(f)
49+
50+
eval_ch = loadj(ROOT/"data/arc-agi_evaluation_challenges.json")
51+
52+
# Build {task_id: task_obj}
53+
E = {}
54+
if isinstance(eval_ch, list):
55+
for it in eval_ch:
56+
tid = it.get("task_id") or it.get("id")
57+
if tid is not None: E[str(tid)] = it
58+
elif isinstance(eval_ch, dict):
59+
for k,v in eval_ch.items():
60+
E[str(k)] = v
61+
62+
ids = sorted(E.keys())
63+
chunks = [ids[i:i+BATCH] for i in range(0, len(ids), BATCH)]
64+
all_preds = []
65+
start = time.time()
66+
67+
for ci, chunk in enumerate(chunks, 1):
68+
t0 = time.time()
69+
ok = 0
70+
for tid in chunk:
71+
task = E[tid]
72+
try:
73+
pred = solve_task(task) # returns list-of-test-grids (or a single grid)
74+
if pred and isinstance(pred[0], (list, tuple)) and pred and isinstance(pred[0][0], (list, tuple)):
75+
# single 2D grid -> wrap
76+
if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in pred):
77+
pred = [pred]
78+
all_preds.append({"task_id": tid, "outputs": pred})
79+
ok += 1
80+
except Exception as e:
81+
# record empty prediction on error to keep submission shape stable
82+
all_preds.append({"task_id": tid, "outputs": []})
83+
dt = time.time()-t0
84+
print(f"[chunk {ci}/{len(chunks)}] solved {ok}/{len(chunk)} in {dt:.1f}s", flush=True)
85+
86+
# Write final submission
87+
os.makedirs(os.path.dirname(OUT), exist_ok=True)
88+
with open(OUT, "w") as f:
89+
json.dump(all_preds, f)
90+
print(f"Wrote {OUT} with {len(all_preds)} items in {time.time()-start:.1f}s", flush=True)
91+
PY
92+
93+
# 3) Score against public eval solutions (if present)
94+
if [[ -f data/arc-agi_evaluation_solutions.json ]]; then
95+
"$PY" - <<'PY'
96+
import json
97+
from pathlib import Path
98+
99+
sub = json.load(open("submission/full_submission.json"))
100+
sol = json.load(open("data/arc-agi_evaluation_solutions.json"))
101+
102+
def norm(grids):
103+
if grids and isinstance(grids[0], (list,tuple)) and grids and isinstance(grids[0][0], (list,tuple)):
104+
if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in grids):
105+
grids = [grids]
106+
return grids
107+
108+
pred = {}
109+
if isinstance(sub, list):
110+
for it in sub:
111+
tid = str(it.get("task_id") or it.get("id"))
112+
out = it.get("outputs") or it.get("output")
113+
if tid is not None and out is not None:
114+
pred[tid] = norm(out)
115+
116+
gt = {}
117+
if isinstance(sol, list):
118+
for it in sol:
119+
tid = str(it.get("task_id") or it.get("id"))
120+
out = it.get("solutions") or it.get("outputs") or it.get("solution")
121+
if tid is not None and out is not None:
122+
gt[tid] = norm(out)
123+
elif isinstance(sol, dict):
124+
for k,v in sol.items():
125+
gt[str(k)] = norm(v)
126+
127+
ids = sorted(set(pred) & set(gt))
128+
ok = 0
129+
for tid in ids:
130+
p, g = pred[tid], gt[tid]
131+
if len(p)==len(g) and all(pp==gg for pp,gg in zip(p,g)):
132+
ok += 1
133+
total = len(ids)
134+
pct = (ok/total*100.0) if total else 0.0
135+
print(f"EVAL SCORE (public): {ok}/{total} = {pct:.2f}%")
136+
PY
137+
else
138+
echo "Note: public solutions not found at data/arc-agi_evaluation_solutions.json; skipping score."
139+
fi
140+
141+
echo "Full submission at: $OUT"

tests/test_eval_public_script.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# [S:TEST v1] eval_public_script pass
2+
import json
3+
import os
4+
import shutil
5+
import subprocess
6+
from pathlib import Path
7+
8+
9+
def test_eval_public_script_runs(tmp_path):
10+
repo_root = Path(__file__).resolve().parents[1]
11+
data_file = repo_root / "data/arc-agi_evaluation_challenges.json"
12+
backup = tmp_path / "arc-agi_evaluation_challenges.json.bak"
13+
shutil.copy(data_file, backup)
14+
try:
15+
with open(data_file) as f:
16+
all_data = json.load(f)
17+
first_id = next(iter(all_data))
18+
minimal = {first_id: all_data[first_id]}
19+
with open(data_file, "w") as f:
20+
json.dump(minimal, f)
21+
env = os.environ.copy()
22+
env["BATCH"] = "1"
23+
env["OUT"] = "submission/full_submission.json"
24+
subprocess.run(["bash", str(repo_root / "scripts/eval_public.sh")], cwd=repo_root, check=True, env=env)
25+
out_file = repo_root / env["OUT"]
26+
assert out_file.exists()
27+
with open(out_file) as f:
28+
sub = json.load(f)
29+
assert len(sub) == 1
30+
finally:
31+
shutil.move(str(backup), data_file)
32+
site = repo_root / "sitecustomize.py"
33+
if site.exists():
34+
site.unlink()
35+
out_file = repo_root / "submission/full_submission.json"
36+
if out_file.exists():
37+
out_file.unlink()

0 commit comments

Comments
 (0)