You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ROAD-TO-V2-Overhaul.md
+46-2Lines changed: 46 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -84,7 +84,7 @@ Single eval runs are useful for quick checks, but meaningful comparison requires
84
84
85
85
The internal change that enables all of this is the extraction of the monolithic `evalCmd.RunE` into a reusable `evalRunSingle()` function. The top-level `RunE` is now a thin orchestration wrapper that parses multi-agent specs, manages the multi-run directory structure, and delegates individual runs.
86
86
87
-
Multi-run directories use a `multi-run.json` config and `multi-run-state.json` for resume support. Interrupted multi-runs can be resumed with `--resume`, which detects completed sub-runs and continues from where it left off.
87
+
Multi-run directories use a `multi-run-config.json` config and `multi-run-state.json` for resume support. Interrupted multi-runs can be resumed with `--resume`, which detects completed sub-runs and continues from where it left off.
88
88
89
89
Phase 5 (parallel runs with `--parallel-runs`) is deferred to a future release.
90
90
@@ -100,6 +100,40 @@ Operationally, this means ARM users get deterministic behavior:
100
100
- either a matching image runs normally,
101
101
- or the harness exits with an actionable platform mismatch message telling you to publish/build the needed architecture or override image config.
102
102
103
+
## What changed in v1.8.0-alpha.4 — output contracts and auditability
104
+
105
+
This pre-release tightened output invariants so resume/verification tooling and downstream parsers can rely on stable files even in edge cases.
106
+
107
+
-`agent.log` now gets a deterministic `HARNESS:` timeout footer when the agent times out.
108
+
-`validation.log` is now always written and always ends with a `HARNESS:` footer (`command`, `exit_code`, `duration_seconds`, `timed_out`, optional `run_error`), even when raw validator output is empty.
109
+
-`run-config.json`, `summary.json`, and `submission.json` now emit key boolean/counter fields explicitly even when false/zero (for example `use_mcp_tools`, `disable_mcp`, `no_sandbox`, `legacy`, and retry counters). This avoids schema drift across runs.
110
+
- Validation commands were normalized for several tasks so task contracts, execution behavior, and log metadata stay aligned.
111
+
112
+
The net effect is better auditability and fewer ambiguous "empty log" outcomes.
113
+
114
+
## What changed in v1.8.0-alpha.5 and alpha.6 — sandbox hardening
115
+
116
+
The sandbox model moved from a narrow writable-dir override to an explicit compatibility allowlist plus stricter masking.
117
+
118
+
- Added `[sandbox] shared_readwrite_dirs` and `[sandbox] shared_readonly_dirs` to configuration, with broad defaults covering common auth/cache/toolchain locations.
119
+
- Bubblewrap setup now masks non-allowlisted top-level directories under `$HOME`, then bind-mounts allowlisted paths with explicit read/write or read-only intent.
120
+
- Denylist masking was expanded for sensitive host paths (including `tasks`, `eval-results`, and `sessions`), with extra entries configurable via `[sandbox] readable_denylist`.
121
+
- Fixed a symlink canonicalization regression in denylist handling so masked paths remain masked even when traversed through symlinked parents.
122
+
123
+
This keeps agents functional (expected config/cache access still works) while reducing exposure to unrelated host data.
124
+
125
+
## What changed after v1.8.0-alpha.6 — failure taxonomy, telemetry, and resumable externals
126
+
127
+
Recent `1.8.x` work focused on separating model failures from provider/auth/infra failures and making that visible in outputs.
- External failures (`auth`, `quota_exhausted`, `infra`) are now treated as resumable skips: they are excluded from pass/fail scoring for that run, task artifacts are cleaned, and the harness prints a `--resume` command to retry only those tasks.
131
+
- Added early-stop protection when quota exhaustion repeats: after 5 consecutive `quota_exhausted` outcomes, eval stops and preserves resumable state instead of burning more requests.
132
+
- Summary/submission/report outputs now include richer counters (`auth_affected_tasks`, `infra_affected_tasks`, `total_infra_retries`, plus quota counters) and report-level failure-class breakdown tables.
133
+
- Added behavior telemetry from agent logs (self-test command count, toolchain install attempts, out-of-workspace read attempts) with confidence flags so analysis can distinguish strict parsing from fallback heuristics.
134
+
135
+
The practical outcome is cleaner benchmarking: weighted scores reflect task-solving ability, while external instability is tracked separately and recoverably.
136
+
103
137
## Compatibility and comparing old runs
104
138
105
139
`1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode:
0 commit comments