-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add public evaluation runner #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add public evaluation runner #16
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughAdds a public evaluation runner workflow: a new shell script for chunked ARC evaluation with memory/thread guards, accompanying Makefile targets, documentation updates (README, AGENTS updates, HANDOFF runbook), and a test validating the script’s end-to-end execution on a minimal dataset. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant SH as scripts/eval_public.sh
participant SC as sitecustomize.py (auto)
participant Py as Python Runner
participant SV as arc_solver.solver
participant FS as Submission JSON
participant GT as Public Solutions (optional)
Dev->>SH: Run eval_public (BATCH, OUT)
SH->>SH: Configure env & thread caps
SH->>SC: Ensure sitecustomize.py present
SH->>Py: Launch chunked evaluation
Py->>Py: Load evaluation challenges
loop For each chunk
Py->>SV: solve_task(task)
alt success
SV-->>Py: Predicted grid(s)
Py->>FS: Append {task_id, outputs}
else error
Py->>FS: Append {task_id, outputs: []}
end
end
Py-->>SH: Completed with results
opt If solutions available
SH->>GT: Compare predictions vs. ground truth
GT-->>SH: Public score printed
end
SH-->>Dev: Path to submission and logs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
| for tid in chunk: | ||
| task = E[tid] | ||
| try: | ||
| pred = solve_task(task) # returns list-of-test-grids (or a single grid) | ||
| if pred and isinstance(pred[0], (list, tuple)) and pred and isinstance(pred[0][0], (list, tuple)): | ||
| # single 2D grid -> wrap | ||
| if all(isinstance(r,(list,tuple)) and r and isinstance(r[0],(int,float)) for r in pred): | ||
| pred = [pred] | ||
| all_preds.append({"task_id": tid, "outputs": pred}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Handle solver results without KeyError
The chunk loop assumes solve_task returns a list and immediately indexes pred[0]. arc_solver.solver.solve_task actually returns a dict ({"attempt_1": ..., "attempt_2": ...}), so pred[0] raises KeyError and the except path records an empty output for every task. This makes the runner claim 0/N solved and produces a submission full of empty outputs. The script should extract the list of attempts from the dict instead of treating the result as a sequence.
Useful? React with 👍 / 👎.
| deps: | ||
| $(PY) -m pip install -r requirements.txt | ||
|
|
||
| train: | ||
| $(PY) -u tools/build_memory.py --train_json data/arc-agi_training_challenges.json | ||
| $(PY) -u tools/train_guidance_on_arc.py \ | ||
| --train-challenges data/arc-agi_training_challenges.json \ | ||
| --train-solutions data/arc-agi_training_solutions.json \ | ||
| --out neural_guidance_model.json | ||
|
|
||
| submit: | ||
| $(PY) -u arc_submit.py --out $(OUT) | ||
|
|
||
| eval_public: | ||
| BATCH=$(BATCH) OUT=$(OUT) bash scripts/eval_public.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Insert tabs so Make targets are runnable
All recipe lines in the new Makefile start at column 1 rather than being prefixed by a tab. GNU Make requires each command in a rule to begin with a tab; otherwise make eval_public (and the other targets advertised in README) fail immediately with “missing separator”. Each command line under deps, train, submit, and eval_public needs a leading tab.
Useful? React with 👍 / 👎.
Summary
Testing
BATCH=1 OUT=submission/full_submission.json bash scripts/eval_public.shpytest -q(fails: ModuleNotFoundError: No module named 'arc_solver.enhanced_solver', ModuleNotFoundError: No module named 'scipy')pytest tests/test_eval_public_script.py -qhttps://chatgpt.com/codex/tasks/task_e_68c5498b028c8322bc22e845ad6e8c06
Summary by CodeRabbit