multi-env evals config #734

mikasenghaas · 2026-01-15T15:38:37Z

Description

This PR implements evaluating multiple environments in parallel via vf-eval. For more details check the updated docs.

This PR is mainly concerned with the config system. Cosmetic updates will be shipped separately, e.g see #735

Examples

By default, we still evaluate a single env with no changes to the interface

uv run vf-eval gsm8k -n5 -r3

To configure multi-environment training, specify a comma-separated list of env ids

uv run vf-eval gsm8k,alphabet-sort -n5 -r3

Note, that all environments use their default configuration. Since CLI arguments apply to all enviroments one can only change values for all environments at the same time. To have more fine-grained configurability, check below.

To configure multi-environment training with (potentially) different arguments for each specify a path to a TOML config file

uv run vf-eval configs/evals/debug.toml -n5 -r3

# configs/local/vf-eval/debug.toml
[[env]]
id = "gsm8k"
num_examples = 1
rollouts_per_example = 1

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Introduces parallel multi-environment evaluation and a more flexible CLI.

CLI positional env_id_or_path now accepts a single env ID, a comma-separated list, or a TOML file; per-env settings resolve with precedence: TOML > CLI > env defaults > global
New MultiEvalConfig and run_multi_evaluation() execute all envs concurrently; refactors single-run flow and centralizes result printing/performance reporting
Adds TOML helpers is_toml_config() and load_toml_config() with validation; simplifies print_results and moves event loop lag monitoring to multi-run; reduces lag monitor log level to debug
Removes print_results from EvalConfig; retains existing flags/behavior for single-env runs
Expands docs with multi-env usage and precedence; adds example config configs/evals/debug.toml
Adds comprehensive tests covering CLI parsing, TOML loading/validation, multi-env config merging, and precedence

^{Written by Cursor Bugbot for commit c4d690d. This will update automatically on new commits. Configure here.}

verifiers/utils/eval_utils.py

verifiers/scripts/eval.py

verifiers/utils/eval_utils.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-16T10:46:41Z

verifiers/scripts/eval.py

+        raw_multi_env_config = [{"env_id": env_id} for env_id in env_ids]
+    else:
+        # single-eval env
+        raw_multi_env_config = [{"env_id": args.env_id_or_path}]


Missing TOML path gives confusing module-not-found error

Low Severity

When a user provides a path ending in .toml but the file doesn't exist (e.g., typo in path like config/debug.toml instead of configs/debug.toml), is_toml_config returns False because Path.is_file() fails. The code then falls through to treating the path as an environment ID, causing a confusing "module not found" error instead of "TOML config file not found". Since no valid environment ID would end in .toml, paths with this extension should check for file existence and give a clear error message when missing.

Additional Locations (1)

verifiers/utils/eval_utils.py#L73-L76

mikasenghaas added 4 commits January 15, 2026 13:10

simple multi eval scaffolding via toml config

7fba751

add debug config

3072cf8

demote to debug log

3343f3b

move around logs

a80e9ac

mikasenghaas mentioned this pull request Jan 15, 2026

eval tui #735

Draft

13 tasks

mikasenghaas added 10 commits January 15, 2026 16:12

fix tests

63279d4

support comma-separated list

d976669

fix precedence

d23210b

minor

f34fee0

fix schema validation

73d2dcc

minor fix

d499fa8

update tests

cfc9ca0

add unit tests

084e684

revert pbar desc

f39c27e

update docs

3c361e5

mikasenghaas requested a review from willccbb January 15, 2026 17:22

mikasenghaas marked this pull request as ready for review January 15, 2026 17:23

mikasenghaas changed the title ~~multi-env evals~~ multi-env evals config Jan 15, 2026

cursor bot reviewed Jan 15, 2026

View reviewed changes

verifiers/utils/eval_utils.py Outdated Show resolved Hide resolved

verifiers/scripts/eval.py Outdated Show resolved Hide resolved

verifiers/scripts/eval.py Outdated Show resolved Hide resolved

mikasenghaas added 3 commits January 15, 2026 17:30

typo

9f8bb55

fix mutation

98ed4b6

validation for env ids

1c6a73e

cursor bot reviewed Jan 15, 2026

View reviewed changes

verifiers/utils/eval_utils.py Show resolved Hide resolved

mikasenghaas added 2 commits January 15, 2026 21:31

fix resolution issue

501a638

move debug config

c4d690d

cursor bot reviewed Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi-env evals config #734

multi-env evals config #734

mikasenghaas commented Jan 15, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

multi-env evals config #734

Are you sure you want to change the base?

multi-env evals config #734

Conversation

mikasenghaas commented Jan 15, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Examples

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Missing TOML path gives confusing module-not-found error

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Jan 15, 2026 •

edited by cursor bot

Loading