Conversation
023f509 to
b892629
Compare
b892629 to
63ea730
Compare
mdarcy220
left a comment
There was a problem hiding this comment.
The one thing that worries me about adding these params to the suite tasks is that it introduces a new potential source of discrepancies (where two runs on supposedly the "same" task have different task params). I think it would be ideal if we used the parent tasks for the single-task demos (e.g. core_bench, super, sqa instead of core_bench_validation, super_validation, sqa_dev), so we keep the configurable stuff in the parent tasks and the suite tasks represent a fully-specified/instantiated task configuration.
| --solver astabench/solvers/react/basic_agent.py@instantiated_basic_agent \ | ||
| --model openai/gpt-4.1-nano \ | ||
| --limit 1 \ | ||
| -T limit=1 \ |
There was a problem hiding this comment.
Last I checked, inspect eval-set (and by extension astabench eval) cannot take a -T param (this was the original the reason for having the separate _validation and _test tasks)
There was a problem hiding this comment.
In my testing, the -T parameter was passed through to the task initializer
There was a problem hiding this comment.
Another reason we defined separate val/test tasks was that it wasn't possible to pass through different task params to different tasks
Modify
core_benchto only download the capsules that will be used. (The--limitoption is not currently forwarded to the task initializer).To actually pass this limit to the task on the CLI, you need to use
-T limit=, so add that to thedemo.shscripts.Alternative to #93