Refactor eval configs: add EvalRunConfig and separate run-level vs env-level settings #737
+163
−161
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
env_id,env_args,num_examples,rollouts_per_example) so TOML / env pyproject settings are decoupled from run-wide options.Description
EvalRunConfigpydantic model and reducedEvalConfigto env-only fields (env_id,env_args,num_examples,rollouts_per_example) inverifiers/types.py.run_evaluationnow accepts(env_config: EvalConfig, run_config: EvalRunConfig)andrun_multi_evaluationaccepts anEvalRunConfig, and result-path construction now takes both run and env config viaget_eval_results_path(run_config, env_config)inverifiers/utils/path_utils.pyandverifiers/utils/eval_utils.py.verifiers/scripts/eval.pyso per-env TOML settings buildEvalConfigobjects, while run-level settings (model, endpoints, sampling args, concurrency, saving, headers, etc.) are collected into a singleEvalRunConfigconstructed from CLI flags and endpoint registry lookup.docs/evaluation.mdanddocs/reference.mdto reflect the new TOML scope (per-env fields) and the newEvalRunConfigtype and precedence: TOML controls only per-env fields and CLI controls run-level options.Testing
pytestor CI run was requested).Codex Task
Note
Refactors evaluation configuration to separate per-environment settings from run-level options and enable cleaner multi-env runs.
EvalRunConfigand reducesEvalConfigto env-only fields (env_id,env_args,num_examples,rollouts_per_example)verifiers/scripts/eval.py) to build per-envEvalConfigfrom TOML/CLI and a single run-wideEvalRunConfig(model, endpoints, sampling, concurrency, saving, headers)run_evaluation(env_config, run_config)andrun_multi_evaluation(run_config); result paths now useget_eval_results_path(run_config, env_config)MultiEvalConfig; migrates imports/usages accordinglydocs/evaluation.md,docs/reference.md) to clarify per-env TOML scope, precedence, and new config typesWritten by Cursor Bugbot for commit ea99227. This will update automatically on new commits. Configure here.