feat: add public evaluation runner #16

tylerbessire · 2025-09-13T12:09:06Z

Summary

add chunked public evaluation runner with memory guards
wire up Makefile eval_public target and doc usage in README
cover runner with unit test and log progress marker
provide detailed handoff note documenting assumptions, runbook, and rollback

Testing

BATCH=1 OUT=submission/full_submission.json bash scripts/eval_public.sh
pytest -q (fails: ModuleNotFoundError: No module named 'arc_solver.enhanced_solver', ModuleNotFoundError: No module named 'scipy')
pytest tests/test_eval_public_script.py -q

https://chatgpt.com/codex/tasks/task_e_68c5498b028c8322bc22e845ad6e8c06

Summary by CodeRabbit

New Features
- Introduced a public evaluation runner with chunked execution, memory-safe defaults, and optional scoring against public solutions.
- Added a Makefile target to streamline dependency setup, training, submission, and public evaluation.
Documentation
- Added a handoff/runbook detailing the public eval workflow, feature flags, rollback steps, and troubleshooting.
- Updated README with a “Public Evaluation Runner” section and usage guidance.
- Appended update notes to AGENTS.
Tests
- Added an automated test validating the public evaluation workflow and submission output generation.

coderabbitai · 2025-09-13T12:09:13Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a public evaluation runner workflow: a new shell script for chunked ARC evaluation with memory/thread guards, accompanying Makefile targets, documentation updates (README, AGENTS updates, HANDOFF runbook), and a test validating the script’s end-to-end execution on a minimal dataset.

Changes

Cohort / File(s)	Summary
Docs: Public Eval Runner `AGENTS.md`, `README.md`, `HANDOFF.md`	Added documentation for public eval runner, Makefile usage, handoff/runbook details, dates/notes in AGENTS updates. No code/API changes.
Build & Automation `Makefile`	Introduces targets: deps, train, submit, eval_public; variables with defaults (PY, OUT, BATCH); wires eval_public to `scripts/eval_public.sh`.
Eval Script `scripts/eval_public.sh`	New script: sets env/thread limits; provisions `sitecustomize.py`; runs chunked Python evaluation over ARC public set; writes submission JSON; optional scoring against public solutions.
Tests `tests/test_eval_public_script.py`	New test executes the eval script on a single challenge slice; asserts output file integrity; cleans up artifacts.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant SH as scripts/eval_public.sh
  participant SC as sitecustomize.py (auto)
  participant Py as Python Runner
  participant SV as arc_solver.solver
  participant FS as Submission JSON
  participant GT as Public Solutions (optional)

  Dev->>SH: Run eval_public (BATCH, OUT)
  SH->>SH: Configure env & thread caps
  SH->>SC: Ensure sitecustomize.py present
  SH->>Py: Launch chunked evaluation
  Py->>Py: Load evaluation challenges
  loop For each chunk
    Py->>SV: solve_task(task)
    alt success
      SV-->>Py: Predicted grid(s)
      Py->>FS: Append {task_id, outputs}
    else error
      Py->>FS: Append {task_id, outputs: []}
    end
  end
  Py-->>SH: Completed with results
  opt If solutions available
    SH->>GT: Compare predictions vs. ground truth
    GT-->>SH: Public score printed
  end
  SH-->>Dev: Path to submission and logs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump my paw—new scripts take flight,
In tidy chunks we chew the night.
Makefile hums, the tests approve,
Logs hop by—such nimble groove.
A JSON trail where answers bloom—
Public eval, boom-boom-boom! 🥕🐇

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch codex/add-public-eval-runner-and-codex-todo

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02c2d34 and a20b028.

📒 Files selected for processing (6)

AGENTS.md (1 hunks)
HANDOFF.md (1 hunks)
Makefile (1 hunks)
README.md (1 hunks)
scripts/eval_public.sh (1 hunks)
tests/test_eval_public_script.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

chatgpt-codex-connector · 2025-09-13T12:11:03Z

scripts/eval_public.sh

+    for tid in chunk:
+        task = E[tid]
+        try:
+            pred = solve_task(task)          # returns list-of-test-grids (or a single grid)
+            if pred and isinstance(pred[0], (list, tuple)) and pred and isinstance(pred[0][0], (list, tuple)):
+                # single 2D grid -> wrap
+                if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in pred):
+                    pred = [pred]
+            all_preds.append({"task_id": tid, "outputs": pred})


[P1] Handle solver results without KeyError

The chunk loop assumes solve_task returns a list and immediately indexes pred[0]. arc_solver.solver.solve_task actually returns a dict ({"attempt_1": ..., "attempt_2": ...}), so pred[0] raises KeyError and the except path records an empty output for every task. This makes the runner claim 0/N solved and produces a submission full of empty outputs. The script should extract the list of attempts from the dict instead of treating the result as a sequence.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-09-13T12:11:03Z

Makefile

+deps:
+$(PY) -m pip install -r requirements.txt
+
+train:
+$(PY) -u tools/build_memory.py --train_json data/arc-agi_training_challenges.json
+$(PY) -u tools/train_guidance_on_arc.py \
+--train-challenges data/arc-agi_training_challenges.json \
+--train-solutions  data/arc-agi_training_solutions.json \
+--out neural_guidance_model.json
+
+submit:
+$(PY) -u arc_submit.py --out $(OUT)
+
+eval_public:
+BATCH=$(BATCH) OUT=$(OUT) bash scripts/eval_public.sh


[P1] Insert tabs so Make targets are runnable

All recipe lines in the new Makefile start at column 1 rather than being prefixed by a tab. GNU Make requires each command in a rule to begin with a tab; otherwise make eval_public (and the other targets advertised in README) fail immediately with “missing separator”. Each command line under deps, train, submit, and eval_public needs a leading tab.

Useful? React with 👍 / 👎.

docs: add eval runner handoff note [S:DOCS v1] handoff note pass

a20b028

tylerbessire added the codex label Sep 13, 2025 — with ChatGPT Codex Connector

tylerbessire merged commit 3e167b6 into main Sep 13, 2025
2 of 6 checks passed

tylerbessire deleted the codex/add-public-eval-runner-and-codex-todo branch September 13, 2025 12:09

chatgpt-codex-connector bot reviewed Sep 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add public evaluation runner #16

feat: add public evaluation runner #16

Uh oh!

tylerbessire commented Sep 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 13, 2025 •

edited

Loading

Review failed

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Sep 13, 2025

Uh oh!

chatgpt-codex-connector bot Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add public evaluation runner #16

feat: add public evaluation runner #16

Uh oh!

Conversation

tylerbessire commented Sep 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 13, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tylerbessire commented Sep 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 13, 2025 •

edited

Loading