Skip to content

Commit 2e63c8a

Browse files
committed
feat: add multi-run orchestration — repeat, multi-agent, batch, and compare (v1.8.0-alpha.1)
- Extract evalRunSingle() from monolithic RunE for reusable eval execution - Add --repeat N flag for statistical evaluation with aggregate stats - Support comma-separated --agent/--model/--reasoning for quick A/B comparison - Add 'sanity batch --config runs.toml' for complex multi-run configurations - Add 'sanity compare' command for side-by-side result comparison - Update Road to v2 documentation with v1.8.0 multi-run section - Include multi-run implementation plan documentation
1 parent 9c0e0b4 commit 2e63c8a

File tree

9 files changed

+2940
-538
lines changed

9 files changed

+2940
-538
lines changed

docs/ROAD-TO-V2-Overhaul.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Road to v2.x.x Overhaul: What's Changed Since v1.6.1
22

3-
If you want the short version: `1.7.x` is a fairness and reliability release. We reduced information leakage during eval, made infrastructure failures easier to reason about, and tightened task contracts where hidden requirements were too implicit.
3+
If you want the short version: `1.7.x` is a fairness and reliability release. We reduced information leakage during eval, made infrastructure failures easier to reason about, and tightened task contracts where hidden requirements were too implicit. `1.8.x` adds multi-run orchestration for statistical rigor and cross-agent comparison.
44

55
## Why we changed anything
66

@@ -9,7 +9,7 @@ By the time we reached `v1.6.1`, we had enough evaluation history to see recurri
99
- transient agent or infrastructure problems were being mixed with normal failures,
1010
- and a few tasks required behavior that was technically tested but not clearly stated.
1111

12-
`1.7.x` is the pass where we cleaned those up.
12+
`1.7.x` is the pass where we cleaned those up. `1.8.x` builds on that foundation to make multi-run evaluation a first-class workflow.
1313

1414
## What changed in evaluation behavior
1515

@@ -70,6 +70,24 @@ The `--use-mcp-tools` prompt was rewritten to be minimal and neutral. The previo
7070

7171
A new `mcp_prompt` config field was added to `AgentConfig`, allowing per-agent MCP tool guidance (e.g., telling Gemini to use `@web` search). This is appended under an `AGENT-SPECIFIC TOOLS:` header when `--use-mcp-tools` is set.
7272

73+
## What changed in v1.8.0 — multi-run orchestration
74+
75+
Single eval runs are useful for quick checks, but meaningful comparison requires repetition and side-by-side analysis. `1.8.0` adds three complementary features that share a common orchestration layer:
76+
77+
**`--repeat N` flag.** Run the same eval configuration N times to measure variance. Each repeat gets its own subdirectory under a multi-run parent, and the harness produces aggregate statistics (mean, stddev, min, max) for pass rate and weighted score across repeats.
78+
79+
**Comma-separated multi-agent.** Quick A/B comparison from the CLI without batch files. `--agent codex,opencode --model gpt-5.2,kimi-k2.5` broadcasts shared flags across agents and produces a multi-run directory with per-agent subdirectories and a comparison summary. The `--repeat` flag composes with multi-agent: `--agent codex,opencode --repeat 3` produces 6 total runs.
80+
81+
**`sanity batch --config runs.toml`.** For complex multi-run configurations with per-run overrides, shared defaults, and repeat support. The batch file is TOML with a `[shared]` section for defaults and `[[runs]]` entries for per-run overrides.
82+
83+
**`sanity compare <dir1> <dir2> ...`.** Load summaries from multiple eval result directories and produce a side-by-side comparison table. Works with any combination of single-run and multi-run directories.
84+
85+
The internal change that enables all of this is the extraction of the monolithic `evalCmd.RunE` into a reusable `evalRunSingle()` function. The top-level `RunE` is now a thin orchestration wrapper that parses multi-agent specs, manages the multi-run directory structure, and delegates individual runs.
86+
87+
Multi-run directories use a `multi-run.json` config and `multi-run-state.json` for resume support. Interrupted multi-runs can be resumed with `--resume`, which detects completed sub-runs and continues from where it left off.
88+
89+
Phase 5 (parallel runs with `--parallel-runs`) is deferred to a future release.
90+
7391
## Compatibility and comparing old runs
7492

7593
`1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode:

0 commit comments

Comments
 (0)