Skip to content

Conversation

@willccbb
Copy link
Member

@willccbb willccbb commented Jan 16, 2026

Motivation

  • Narrow the per-environment config to only the fields that come from TOML (env_id, env_args, num_examples, rollouts_per_example) so TOML / env pyproject settings are decoupled from run-wide options.
  • Allow multiple environments per run while keeping run-level settings (model, sampling, concurrency, saving) shared and overridable from the CLI.
  • Simplify configuration precedence and make CLI overrides for run-level options unambiguous.

Description

  • Introduced a new EvalRunConfig pydantic model and reduced EvalConfig to env-only fields (env_id, env_args, num_examples, rollouts_per_example) in verifiers/types.py.
  • Updated evaluation runner signatures and behavior: run_evaluation now accepts (env_config: EvalConfig, run_config: EvalRunConfig) and run_multi_evaluation accepts an EvalRunConfig, and result-path construction now takes both run and env config via get_eval_results_path(run_config, env_config) in verifiers/utils/path_utils.py and verifiers/utils/eval_utils.py.
  • Reworked CLI resolution in verifiers/scripts/eval.py so per-env TOML settings build EvalConfig objects, while run-level settings (model, endpoints, sampling args, concurrency, saving, headers, etc.) are collected into a single EvalRunConfig constructed from CLI flags and endpoint registry lookup.
  • Updated documentation in docs/evaluation.md and docs/reference.md to reflect the new TOML scope (per-env fields) and the new EvalRunConfig type and precedence: TOML controls only per-env fields and CLI controls run-level options.
  • Adjusted imports, logging messages, and save logic to use the new config split and preserved existing behavior for parallel multi-env runs.

Testing

  • No automated tests were executed as part of this change (no pytest or CI run was requested).

Codex Task


Note

Refactors evaluation configuration to separate per-environment settings from run-level options and enable cleaner multi-env runs.

  • Introduces EvalRunConfig and reduces EvalConfig to env-only fields (env_id, env_args, num_examples, rollouts_per_example)
  • Reworks CLI (verifiers/scripts/eval.py) to build per-env EvalConfig from TOML/CLI and a single run-wide EvalRunConfig (model, endpoints, sampling, concurrency, saving, headers)
  • Updates runner APIs: run_evaluation(env_config, run_config) and run_multi_evaluation(run_config); result paths now use get_eval_results_path(run_config, env_config)
  • Removes MultiEvalConfig; migrates imports/usages accordingly
  • Updates docs (docs/evaluation.md, docs/reference.md) to clarify per-env TOML scope, precedence, and new config types

Written by Cursor Bugbot for commit ea99227. This will update automatically on new commits. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants