Refactor eval configs: add EvalRunConfig and separate run-level vs env-level settings #737

willccbb · 2026-01-16T03:27:15Z

Narrow the per-environment config to only the fields that come from TOML (env_id, env_args, num_examples, rollouts_per_example) so TOML / env pyproject settings are decoupled from run-wide options.
Allow multiple environments per run while keeping run-level settings (model, sampling, concurrency, saving) shared and overridable from the CLI.
Simplify configuration precedence and make CLI overrides for run-level options unambiguous.

Introduced a new EvalRunConfig pydantic model and reduced EvalConfig to env-only fields (env_id, env_args, num_examples, rollouts_per_example) in verifiers/types.py.
Updated evaluation runner signatures and behavior: run_evaluation now accepts (env_config: EvalConfig, run_config: EvalRunConfig) and run_multi_evaluation accepts an EvalRunConfig, and result-path construction now takes both run and env config via get_eval_results_path(run_config, env_config) in verifiers/utils/path_utils.py and verifiers/utils/eval_utils.py.
Reworked CLI resolution in verifiers/scripts/eval.py so per-env TOML settings build EvalConfig objects, while run-level settings (model, endpoints, sampling args, concurrency, saving, headers, etc.) are collected into a single EvalRunConfig constructed from CLI flags and endpoint registry lookup.
Updated documentation in docs/evaluation.md and docs/reference.md to reflect the new TOML scope (per-env fields) and the new EvalRunConfig type and precedence: TOML controls only per-env fields and CLI controls run-level options.
Adjusted imports, logging messages, and save logic to use the new config split and preserved existing behavior for parallel multi-env runs.

No automated tests were executed as part of this change (no pytest or CI run was requested).

Note

Refactors evaluation configuration to separate per-environment settings from run-level options and enable cleaner multi-env runs.

Introduces EvalRunConfig and reduces EvalConfig to env-only fields (env_id, env_args, num_examples, rollouts_per_example)
Reworks CLI (verifiers/scripts/eval.py) to build per-env EvalConfig from TOML/CLI and a single run-wide EvalRunConfig (model, endpoints, sampling, concurrency, saving, headers)
Updates runner APIs: run_evaluation(env_config, run_config) and run_multi_evaluation(run_config); result paths now use get_eval_results_path(run_config, env_config)
Removes MultiEvalConfig; migrates imports/usages accordingly
Updates docs (docs/evaluation.md, docs/reference.md) to clarify per-env TOML scope, precedence, and new config types

^{Written by Cursor Bugbot for commit ea99227. This will update automatically on new commits. Configure here.}

Refine eval configs for run vs env settings

ea99227

willccbb added the codex label Jan 16, 2026 — with ChatGPT Codex Connector

Provide feedback