You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Extract evalRunSingle() from monolithic RunE for reusable eval execution
- Add --repeat N flag for statistical evaluation with aggregate stats
- Support comma-separated --agent/--model/--reasoning for quick A/B comparison
- Add 'sanity batch --config runs.toml' for complex multi-run configurations
- Add 'sanity compare' command for side-by-side result comparison
- Update Road to v2 documentation with v1.8.0 multi-run section
- Include multi-run implementation plan documentation
Copy file name to clipboardExpand all lines: docs/ROAD-TO-V2-Overhaul.md
+20-2Lines changed: 20 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Road to v2.x.x Overhaul: What's Changed Since v1.6.1
2
2
3
-
If you want the short version: `1.7.x` is a fairness and reliability release. We reduced information leakage during eval, made infrastructure failures easier to reason about, and tightened task contracts where hidden requirements were too implicit.
3
+
If you want the short version: `1.7.x` is a fairness and reliability release. We reduced information leakage during eval, made infrastructure failures easier to reason about, and tightened task contracts where hidden requirements were too implicit.`1.8.x` adds multi-run orchestration for statistical rigor and cross-agent comparison.
4
4
5
5
## Why we changed anything
6
6
@@ -9,7 +9,7 @@ By the time we reached `v1.6.1`, we had enough evaluation history to see recurri
9
9
- transient agent or infrastructure problems were being mixed with normal failures,
10
10
- and a few tasks required behavior that was technically tested but not clearly stated.
11
11
12
-
`1.7.x` is the pass where we cleaned those up.
12
+
`1.7.x` is the pass where we cleaned those up.`1.8.x` builds on that foundation to make multi-run evaluation a first-class workflow.
13
13
14
14
## What changed in evaluation behavior
15
15
@@ -70,6 +70,24 @@ The `--use-mcp-tools` prompt was rewritten to be minimal and neutral. The previo
70
70
71
71
A new `mcp_prompt` config field was added to `AgentConfig`, allowing per-agent MCP tool guidance (e.g., telling Gemini to use `@web` search). This is appended under an `AGENT-SPECIFIC TOOLS:` header when `--use-mcp-tools` is set.
72
72
73
+
## What changed in v1.8.0 — multi-run orchestration
74
+
75
+
Single eval runs are useful for quick checks, but meaningful comparison requires repetition and side-by-side analysis. `1.8.0` adds three complementary features that share a common orchestration layer:
76
+
77
+
**`--repeat N` flag.** Run the same eval configuration N times to measure variance. Each repeat gets its own subdirectory under a multi-run parent, and the harness produces aggregate statistics (mean, stddev, min, max) for pass rate and weighted score across repeats.
78
+
79
+
**Comma-separated multi-agent.** Quick A/B comparison from the CLI without batch files. `--agent codex,opencode --model gpt-5.2,kimi-k2.5` broadcasts shared flags across agents and produces a multi-run directory with per-agent subdirectories and a comparison summary. The `--repeat` flag composes with multi-agent: `--agent codex,opencode --repeat 3` produces 6 total runs.
80
+
81
+
**`sanity batch --config runs.toml`.** For complex multi-run configurations with per-run overrides, shared defaults, and repeat support. The batch file is TOML with a `[shared]` section for defaults and `[[runs]]` entries for per-run overrides.
82
+
83
+
**`sanity compare <dir1> <dir2> ...`.** Load summaries from multiple eval result directories and produce a side-by-side comparison table. Works with any combination of single-run and multi-run directories.
84
+
85
+
The internal change that enables all of this is the extraction of the monolithic `evalCmd.RunE` into a reusable `evalRunSingle()` function. The top-level `RunE` is now a thin orchestration wrapper that parses multi-agent specs, manages the multi-run directory structure, and delegates individual runs.
86
+
87
+
Multi-run directories use a `multi-run.json` config and `multi-run-state.json` for resume support. Interrupted multi-runs can be resumed with `--resume`, which detects completed sub-runs and continues from where it left off.
88
+
89
+
Phase 5 (parallel runs with `--parallel-runs`) is deferred to a future release.
90
+
73
91
## Compatibility and comparing old runs
74
92
75
93
`1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode:
0 commit comments