Skip to content

Conversation

@tylerbessire
Copy link
Owner

@tylerbessire tylerbessire commented Sep 13, 2025

Summary

  • add chunked public evaluation runner with memory guards
  • wire up Makefile eval_public target and doc usage in README
  • cover runner with unit test and log progress marker
  • provide detailed handoff note documenting assumptions, runbook, and rollback

Testing

  • BATCH=1 OUT=submission/full_submission.json bash scripts/eval_public.sh
  • pytest -q (fails: ModuleNotFoundError: No module named 'arc_solver.enhanced_solver', ModuleNotFoundError: No module named 'scipy')
  • pytest tests/test_eval_public_script.py -q

https://chatgpt.com/codex/tasks/task_e_68c5498b028c8322bc22e845ad6e8c06

Summary by CodeRabbit

  • New Features
    • Introduced a public evaluation runner with chunked execution, memory-safe defaults, and optional scoring against public solutions.
    • Added a Makefile target to streamline dependency setup, training, submission, and public evaluation.
  • Documentation
    • Added a handoff/runbook detailing the public eval workflow, feature flags, rollback steps, and troubleshooting.
    • Updated README with a “Public Evaluation Runner” section and usage guidance.
    • Appended update notes to AGENTS.
  • Tests
    • Added an automated test validating the public evaluation workflow and submission output generation.

@coderabbitai
Copy link

coderabbitai bot commented Sep 13, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a public evaluation runner workflow: a new shell script for chunked ARC evaluation with memory/thread guards, accompanying Makefile targets, documentation updates (README, AGENTS updates, HANDOFF runbook), and a test validating the script’s end-to-end execution on a minimal dataset.

Changes

Cohort / File(s) Summary
Docs: Public Eval Runner
`AGENTS.md`, `README.md`, `HANDOFF.md`
Added documentation for public eval runner, Makefile usage, handoff/runbook details, dates/notes in AGENTS updates. No code/API changes.
Build & Automation
`Makefile`
Introduces targets: deps, train, submit, eval_public; variables with defaults (PY, OUT, BATCH); wires eval_public to `scripts/eval_public.sh`.
Eval Script
`scripts/eval_public.sh`
New script: sets env/thread limits; provisions `sitecustomize.py`; runs chunked Python evaluation over ARC public set; writes submission JSON; optional scoring against public solutions.
Tests
`tests/test_eval_public_script.py`
New test executes the eval script on a single challenge slice; asserts output file integrity; cleans up artifacts.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant SH as scripts/eval_public.sh
  participant SC as sitecustomize.py (auto)
  participant Py as Python Runner
  participant SV as arc_solver.solver
  participant FS as Submission JSON
  participant GT as Public Solutions (optional)

  Dev->>SH: Run eval_public (BATCH, OUT)
  SH->>SH: Configure env & thread caps
  SH->>SC: Ensure sitecustomize.py present
  SH->>Py: Launch chunked evaluation
  Py->>Py: Load evaluation challenges
  loop For each chunk
    Py->>SV: solve_task(task)
    alt success
      SV-->>Py: Predicted grid(s)
      Py->>FS: Append {task_id, outputs}
    else error
      Py->>FS: Append {task_id, outputs: []}
    end
  end
  Py-->>SH: Completed with results
  opt If solutions available
    SH->>GT: Compare predictions vs. ground truth
    GT-->>SH: Public score printed
  end
  SH-->>Dev: Path to submission and logs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump my paw—new scripts take flight,
In tidy chunks we chew the night.
Makefile hums, the tests approve,
Logs hop by—such nimble groove.
A JSON trail where answers bloom—
Public eval, boom-boom-boom! 🥕🐇

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/add-public-eval-runner-and-codex-todo

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02c2d34 and a20b028.

📒 Files selected for processing (6)
  • AGENTS.md (1 hunks)
  • HANDOFF.md (1 hunks)
  • Makefile (1 hunks)
  • README.md (1 hunks)
  • scripts/eval_public.sh (1 hunks)
  • tests/test_eval_public_script.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tylerbessire tylerbessire merged commit 3e167b6 into main Sep 13, 2025
2 of 6 checks passed
@tylerbessire tylerbessire deleted the codex/add-public-eval-runner-and-codex-todo branch September 13, 2025 12:09
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Comment on lines +70 to +78
for tid in chunk:
task = E[tid]
try:
pred = solve_task(task) # returns list-of-test-grids (or a single grid)
if pred and isinstance(pred[0], (list, tuple)) and pred and isinstance(pred[0][0], (list, tuple)):
# single 2D grid -> wrap
if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in pred):
pred = [pred]
all_preds.append({"task_id": tid, "outputs": pred})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Handle solver results without KeyError

The chunk loop assumes solve_task returns a list and immediately indexes pred[0]. arc_solver.solver.solve_task actually returns a dict ({"attempt_1": ..., "attempt_2": ...}), so pred[0] raises KeyError and the except path records an empty output for every task. This makes the runner claim 0/N solved and produces a submission full of empty outputs. The script should extract the list of attempts from the dict instead of treating the result as a sequence.

Useful? React with 👍 / 👎.

Comment on lines +7 to +21
deps:
$(PY) -m pip install -r requirements.txt

train:
$(PY) -u tools/build_memory.py --train_json data/arc-agi_training_challenges.json
$(PY) -u tools/train_guidance_on_arc.py \
--train-challenges data/arc-agi_training_challenges.json \
--train-solutions data/arc-agi_training_solutions.json \
--out neural_guidance_model.json

submit:
$(PY) -u arc_submit.py --out $(OUT)

eval_public:
BATCH=$(BATCH) OUT=$(OUT) bash scripts/eval_public.sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Insert tabs so Make targets are runnable

All recipe lines in the new Makefile start at column 1 rather than being prefixed by a tab. GNU Make requires each command in a rule to begin with a tab; otherwise make eval_public (and the other targets advertised in README) fail immediately with “missing separator”. Each command line under deps, train, submit, and eval_public needs a leading tab.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants