-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add public evaluation runner #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| Public Eval Runner — Handoff | ||
|
|
||
| Overview | ||
|
|
||
| Adds a one-command public evaluation runner that executes the solver across the ARC public evaluation set in chunks, writes a valid submission JSON, and (optionally) scores it against public solutions if available. | ||
| •Primary entrypoint: scripts/eval_public.sh | ||
| •Optional convenience: make eval_public | ||
| •Memory-safe defaults (thread caps, float32 guidance via sitecustomize.py, process-friendly allocator env) | ||
| •Logs and artifacts are created deterministically | ||
|
|
||
| Assumptions | ||
| •data/arc-agi_evaluation_challenges.json exists (required to run). | ||
| •data/arc-agi_evaluation_solutions.json exists only if local scoring is desired. | ||
| •Repository root is the working dir when running the script/Makefile. | ||
|
|
||
| Files Added | ||
| •scripts/eval_public.sh – chunked runner + optional scoring. | ||
| •(On first run) sitecustomize.py is created at repo root to enforce float32 guidance init and thread caps. | ||
|
|
||
| Note: No changes to solver code are required for this runner. | ||
|
|
||
| How to Run | ||
|
|
||
| Direct (bash) | ||
|
|
||
| bash scripts/eval_public.sh | ||
|
|
||
| With knobs | ||
|
|
||
| BATCH=40 OUT=submission/eval_public.json bash scripts/eval_public.sh | ||
|
|
||
| Makefile (optional) | ||
|
|
||
| make eval_public # uses defaults | ||
| make eval_public BATCH=40 OUT=submission/eval_public.json | ||
|
|
||
| Outputs & Logs | ||
| •Submission JSON: ${OUT} (default: submission/full_submission.json) | ||
| •Console prints per-chunk progress: | ||
|
|
||
| [chunk 1/8] solved 50/50 in 27.3s | ||
| ... | ||
| Wrote submission/full_submission.json with 400 items in 214.8s | ||
|
|
||
|
|
||
| •If data/arc-agi_evaluation_solutions.json is present, prints: | ||
|
|
||
| EVAL SCORE (public): N/M = P% | ||
|
|
||
|
|
||
| Feature Flags / Config | ||
| •BATCH (default 50): number of tasks per chunk; lower if memory is tight. | ||
| •OUT (default submission/full_submission.json): path to final submission. | ||
| •Env set by the script for stability: | ||
| •PYTHONUNBUFFERED=1, PYTHONMALLOC=malloc, MALLOC_ARENA_MAX=2 | ||
| •OPENBLAS_NUM_THREADS=1, OMP_NUM_THREADS=1, MKL_NUM_THREADS=1, NUMEXPR_NUM_THREADS=1 | ||
| •sitecustomize.py (auto-created if missing) sets NumPy random normals to float32 and trims malloc arenas on exit. | ||
|
|
||
| Rollback | ||
| •Remove scripts/eval_public.sh and any Makefile target that references it. | ||
| •Delete sitecustomize.py if you don’t want repo-local Python customization. | ||
| •No other files are touched. | ||
|
|
||
| Acceptance Criteria | ||
| •AC-1: scripts/eval_public.sh completes on Colab/T4 or A100 with default BATCH=50 without OOM. | ||
| •AC-2: Produces a submission with one entry per eval task at ${OUT}. | ||
| •AC-3: If public solutions exist, prints a final score line EVAL SCORE (public): N/M = P%. | ||
| •AC-4: Per-chunk progress is visible in stdout. | ||
|
|
||
| Troubleshooting | ||
|
|
||
| SymptomLikely CauseFix | ||
| MemoryError / process killed mid-runSearch/guidance RAM spikeLower BATCH (e.g., 30), ensure sitecustomize.py exists (script auto-creates), keep thread envs at 1 | ||
| Hanging with no outputBuffered child outputRunner uses unbuffered flags; re-run the script (don’t call solver directly) | ||
| SystemError: ufunc equalHuge boolean temp in equality checkHarmless as a one-off; usually disappears when BATCH is lowered; longer-term fix: bytewise equality in solver | ||
| Submission has fewer than expected itemsException while solving a taskRunner records empty outputs on error to keep shape; inspect solver logs for specific failures | ||
|
|
||
| Security / Cleanroom Notes | ||
| •The runner does not train or tune on evaluation solutions. | ||
| •Local scoring (if solutions are present) is for a single post-run sanity check; avoid iterative tuning on eval to preserve generalization. | ||
|
|
||
| Next Steps (separate PRs) | ||
| •P0: guidance reuse per process, op pre-validation (translate, pad) to avoid exceptions. | ||
| •P1: symmetry-canonical hashing & invariant filters. | ||
| •P2: hard-negative mining for guidance retrain. | ||
|
|
||
| ⸻ | ||
|
|
||
| Quick Sanity Commands | ||
|
|
||
| # default run | ||
| bash scripts/eval_public.sh | ||
|
|
||
| # tighter memory profile | ||
| BATCH=30 bash scripts/eval_public.sh | ||
|
|
||
| # custom output path | ||
| OUT=submission/public_eval_YYYYMMDD.json bash scripts/eval_public.sh | ||
|
|
||
| That’s it—this gives Codex and the Colab teammate a reliable “press go” for public eval, plus a clean rollback and clear acceptance criteria. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| PY ?= python3 | ||
| OUT ?= submission/full_submission.json | ||
| BATCH ?= 50 | ||
|
|
||
| .PHONY: deps train submit eval_public | ||
|
|
||
| deps: | ||
| $(PY) -m pip install -r requirements.txt | ||
|
|
||
| train: | ||
| $(PY) -u tools/build_memory.py --train_json data/arc-agi_training_challenges.json | ||
| $(PY) -u tools/train_guidance_on_arc.py \ | ||
| --train-challenges data/arc-agi_training_challenges.json \ | ||
| --train-solutions data/arc-agi_training_solutions.json \ | ||
| --out neural_guidance_model.json | ||
|
|
||
| submit: | ||
| $(PY) -u arc_submit.py --out $(OUT) | ||
|
|
||
| eval_public: | ||
| BATCH=$(BATCH) OUT=$(OUT) bash scripts/eval_public.sh | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| #!/usr/bin/env bash | ||
| # [S:ALG v1] runner=chunked_public_eval pass | ||
| set -euo pipefail | ||
|
|
||
| ROOT="${ROOT:-$(pwd)}" | ||
| PY="${PY:-python3}" | ||
| BATCH="${BATCH:-50}" # tasks per chunk (tune if memory is tight) | ||
| OUT="${OUT:-submission/full_submission.json}" | ||
| LOGDIR="$ROOT/runlogs" | ||
| mkdir -p "$LOGDIR" "$(dirname "$OUT")" | ||
|
|
||
| # Memory-friendly defaults | ||
| export PYTHONUNBUFFERED=1 PYTHONMALLOC=malloc MALLOC_ARENA_MAX=2 | ||
| export OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 | ||
|
|
||
| # 1) Ensure sitecustomize.py exists (float32 + trim) | ||
| if [[ ! -f "$ROOT/sitecustomize.py" ]]; then | ||
| cat > "$ROOT/sitecustomize.py" <<'PY' | ||
| import os, atexit, ctypes, numpy as np | ||
| os.environ.setdefault("OPENBLAS_NUM_THREADS","1") | ||
| os.environ.setdefault("OMP_NUM_THREADS","1") | ||
| os.environ.setdefault("MKL_NUM_THREADS","1") | ||
| os.environ.setdefault("NUMEXPR_NUM_THREADS","1") | ||
| _orig = np.random.Generator.standard_normal | ||
| def _stdnorm(self, size=None, dtype=np.float32, out=None): # default float32 | ||
| return _orig(self, size=size, dtype=dtype, out=out) | ||
| np.random.Generator.standard_normal = _stdnorm | ||
| try: | ||
| libc = ctypes.CDLL("libc.so.6") | ||
| atexit.register(lambda: libc.malloc_trim(0)) | ||
| except Exception: | ||
| pass | ||
| PY | ||
| fi | ||
|
|
||
| # 2) Chunked submission using a pure-Python runner (no --only flag required) | ||
| "$PY" - "$BATCH" "$OUT" <<'PY' | ||
| import json, os, sys, time, traceback | ||
| from pathlib import Path | ||
|
|
||
| BATCH = int(sys.argv[1]) | ||
| OUT = sys.argv[2] | ||
| ROOT = Path(os.getcwd()) | ||
| sys.path.append(str(ROOT)) | ||
| from arc_solver.solver import solve_task # repo API | ||
|
|
||
| def loadj(p): | ||
| with open(p,"r") as f: return json.load(f) | ||
|
|
||
| eval_ch = loadj(ROOT/"data/arc-agi_evaluation_challenges.json") | ||
|
|
||
| # Build {task_id: task_obj} | ||
| E = {} | ||
| if isinstance(eval_ch, list): | ||
| for it in eval_ch: | ||
| tid = it.get("task_id") or it.get("id") | ||
| if tid is not None: E[str(tid)] = it | ||
| elif isinstance(eval_ch, dict): | ||
| for k,v in eval_ch.items(): | ||
| E[str(k)] = v | ||
|
|
||
| ids = sorted(E.keys()) | ||
| chunks = [ids[i:i+BATCH] for i in range(0, len(ids), BATCH)] | ||
| all_preds = [] | ||
| start = time.time() | ||
|
|
||
| for ci, chunk in enumerate(chunks, 1): | ||
| t0 = time.time() | ||
| ok = 0 | ||
| for tid in chunk: | ||
| task = E[tid] | ||
| try: | ||
| pred = solve_task(task) # returns list-of-test-grids (or a single grid) | ||
| if pred and isinstance(pred[0], (list, tuple)) and pred and isinstance(pred[0][0], (list, tuple)): | ||
| # single 2D grid -> wrap | ||
| if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in pred): | ||
| pred = [pred] | ||
| all_preds.append({"task_id": tid, "outputs": pred}) | ||
|
Comment on lines
+70
to
+78
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [P1] Handle solver results without KeyError The chunk loop assumes Useful? React with 👍 / 👎. |
||
| ok += 1 | ||
| except Exception as e: | ||
| # record empty prediction on error to keep submission shape stable | ||
| all_preds.append({"task_id": tid, "outputs": []}) | ||
| dt = time.time()-t0 | ||
| print(f"[chunk {ci}/{len(chunks)}] solved {ok}/{len(chunk)} in {dt:.1f}s", flush=True) | ||
|
|
||
| # Write final submission | ||
| os.makedirs(os.path.dirname(OUT), exist_ok=True) | ||
| with open(OUT, "w") as f: | ||
| json.dump(all_preds, f) | ||
| print(f"Wrote {OUT} with {len(all_preds)} items in {time.time()-start:.1f}s", flush=True) | ||
| PY | ||
|
|
||
| # 3) Score against public eval solutions (if present) | ||
| if [[ -f data/arc-agi_evaluation_solutions.json ]]; then | ||
| "$PY" - <<'PY' | ||
| import json | ||
| from pathlib import Path | ||
|
|
||
| sub = json.load(open("submission/full_submission.json")) | ||
| sol = json.load(open("data/arc-agi_evaluation_solutions.json")) | ||
|
|
||
| def norm(grids): | ||
| if grids and isinstance(grids[0], (list,tuple)) and grids and isinstance(grids[0][0], (list,tuple)): | ||
| if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in grids): | ||
| grids = [grids] | ||
| return grids | ||
|
|
||
| pred = {} | ||
| if isinstance(sub, list): | ||
| for it in sub: | ||
| tid = str(it.get("task_id") or it.get("id")) | ||
| out = it.get("outputs") or it.get("output") | ||
| if tid is not None and out is not None: | ||
| pred[tid] = norm(out) | ||
|
|
||
| gt = {} | ||
| if isinstance(sol, list): | ||
| for it in sol: | ||
| tid = str(it.get("task_id") or it.get("id")) | ||
| out = it.get("solutions") or it.get("outputs") or it.get("solution") | ||
| if tid is not None and out is not None: | ||
| gt[tid] = norm(out) | ||
| elif isinstance(sol, dict): | ||
| for k,v in sol.items(): | ||
| gt[str(k)] = norm(v) | ||
|
|
||
| ids = sorted(set(pred) & set(gt)) | ||
| ok = 0 | ||
| for tid in ids: | ||
| p, g = pred[tid], gt[tid] | ||
| if len(p)==len(g) and all(pp==gg for pp,gg in zip(p,g)): | ||
| ok += 1 | ||
| total = len(ids) | ||
| pct = (ok/total*100.0) if total else 0.0 | ||
| print(f"EVAL SCORE (public): {ok}/{total} = {pct:.2f}%") | ||
| PY | ||
| else | ||
| echo "Note: public solutions not found at data/arc-agi_evaluation_solutions.json; skipping score." | ||
| fi | ||
|
|
||
| echo "Full submission at: $OUT" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| # [S:TEST v1] eval_public_script pass | ||
| import json | ||
| import os | ||
| import shutil | ||
| import subprocess | ||
| from pathlib import Path | ||
|
|
||
|
|
||
| def test_eval_public_script_runs(tmp_path): | ||
| repo_root = Path(__file__).resolve().parents[1] | ||
| data_file = repo_root / "data/arc-agi_evaluation_challenges.json" | ||
| backup = tmp_path / "arc-agi_evaluation_challenges.json.bak" | ||
| shutil.copy(data_file, backup) | ||
| try: | ||
| with open(data_file) as f: | ||
| all_data = json.load(f) | ||
| first_id = next(iter(all_data)) | ||
| minimal = {first_id: all_data[first_id]} | ||
| with open(data_file, "w") as f: | ||
| json.dump(minimal, f) | ||
| env = os.environ.copy() | ||
| env["BATCH"] = "1" | ||
| env["OUT"] = "submission/full_submission.json" | ||
| subprocess.run(["bash", str(repo_root / "scripts/eval_public.sh")], cwd=repo_root, check=True, env=env) | ||
| out_file = repo_root / env["OUT"] | ||
| assert out_file.exists() | ||
| with open(out_file) as f: | ||
| sub = json.load(f) | ||
| assert len(sub) == 1 | ||
| finally: | ||
| shutil.move(str(backup), data_file) | ||
| site = repo_root / "sitecustomize.py" | ||
| if site.exists(): | ||
| site.unlink() | ||
| out_file = repo_root / "submission/full_submission.json" | ||
| if out_file.exists(): | ||
| out_file.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Insert tabs so Make targets are runnable
All recipe lines in the new Makefile start at column 1 rather than being prefixed by a tab. GNU Make requires each command in a rule to begin with a tab; otherwise
make eval_public(and the other targets advertised in README) fail immediately with “missing separator”. Each command line underdeps,train,submit, andeval_publicneeds a leading tab.Useful? React with 👍 / 👎.