Skip to content

Commit c962f0a

Browse files
committed
docs: Add v1.8.x release notes covering output contracts, sandbox hardening, failure taxonomy, and telemetry to ROAD-TO-V2-Overhaul.md.
1 parent d027b85 commit c962f0a

File tree

1 file changed

+46
-2
lines changed

1 file changed

+46
-2
lines changed

docs/ROAD-TO-V2-Overhaul.md

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ Single eval runs are useful for quick checks, but meaningful comparison requires
8484

8585
The internal change that enables all of this is the extraction of the monolithic `evalCmd.RunE` into a reusable `evalRunSingle()` function. The top-level `RunE` is now a thin orchestration wrapper that parses multi-agent specs, manages the multi-run directory structure, and delegates individual runs.
8686

87-
Multi-run directories use a `multi-run.json` config and `multi-run-state.json` for resume support. Interrupted multi-runs can be resumed with `--resume`, which detects completed sub-runs and continues from where it left off.
87+
Multi-run directories use a `multi-run-config.json` config and `multi-run-state.json` for resume support. Interrupted multi-runs can be resumed with `--resume`, which detects completed sub-runs and continues from where it left off.
8888

8989
Phase 5 (parallel runs with `--parallel-runs`) is deferred to a future release.
9090

@@ -100,6 +100,40 @@ Operationally, this means ARM users get deterministic behavior:
100100
- either a matching image runs normally,
101101
- or the harness exits with an actionable platform mismatch message telling you to publish/build the needed architecture or override image config.
102102

103+
## What changed in v1.8.0-alpha.4 — output contracts and auditability
104+
105+
This pre-release tightened output invariants so resume/verification tooling and downstream parsers can rely on stable files even in edge cases.
106+
107+
- `agent.log` now gets a deterministic `HARNESS:` timeout footer when the agent times out.
108+
- `validation.log` is now always written and always ends with a `HARNESS:` footer (`command`, `exit_code`, `duration_seconds`, `timed_out`, optional `run_error`), even when raw validator output is empty.
109+
- `run-config.json`, `summary.json`, and `submission.json` now emit key boolean/counter fields explicitly even when false/zero (for example `use_mcp_tools`, `disable_mcp`, `no_sandbox`, `legacy`, and retry counters). This avoids schema drift across runs.
110+
- Validation commands were normalized for several tasks so task contracts, execution behavior, and log metadata stay aligned.
111+
112+
The net effect is better auditability and fewer ambiguous "empty log" outcomes.
113+
114+
## What changed in v1.8.0-alpha.5 and alpha.6 — sandbox hardening
115+
116+
The sandbox model moved from a narrow writable-dir override to an explicit compatibility allowlist plus stricter masking.
117+
118+
- Added `[sandbox] shared_readwrite_dirs` and `[sandbox] shared_readonly_dirs` to configuration, with broad defaults covering common auth/cache/toolchain locations.
119+
- Bubblewrap setup now masks non-allowlisted top-level directories under `$HOME`, then bind-mounts allowlisted paths with explicit read/write or read-only intent.
120+
- Denylist masking was expanded for sensitive host paths (including `tasks`, `eval-results`, and `sessions`), with extra entries configurable via `[sandbox] readable_denylist`.
121+
- Fixed a symlink canonicalization regression in denylist handling so masked paths remain masked even when traversed through symlinked parents.
122+
123+
This keeps agents functional (expected config/cache access still works) while reducing exposure to unrelated host data.
124+
125+
## What changed after v1.8.0-alpha.6 — failure taxonomy, telemetry, and resumable externals
126+
127+
Recent `1.8.x` work focused on separating model failures from provider/auth/infra failures and making that visible in outputs.
128+
129+
- Added `FailureClass` to per-task results: `none`, `quota_recoverable`, `quota_exhausted`, `auth`, `infra`, `integrity`, `validation_error`, `validation_timeout`.
130+
- External failures (`auth`, `quota_exhausted`, `infra`) are now treated as resumable skips: they are excluded from pass/fail scoring for that run, task artifacts are cleaned, and the harness prints a `--resume` command to retry only those tasks.
131+
- Added early-stop protection when quota exhaustion repeats: after 5 consecutive `quota_exhausted` outcomes, eval stops and preserves resumable state instead of burning more requests.
132+
- Summary/submission/report outputs now include richer counters (`auth_affected_tasks`, `infra_affected_tasks`, `total_infra_retries`, plus quota counters) and report-level failure-class breakdown tables.
133+
- Added behavior telemetry from agent logs (self-test command count, toolchain install attempts, out-of-workspace read attempts) with confidence flags so analysis can distinguish strict parsing from fallback heuristics.
134+
135+
The practical outcome is cleaner benchmarking: weighted scores reflect task-solving ability, while external instability is tracked separately and recoverably.
136+
103137
## Compatibility and comparing old runs
104138

105139
`1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode:
@@ -116,10 +150,20 @@ This document covers:
116150
- baseline: `v1.6.1`
117151
- through: current `HEAD` on `main`
118152

119-
Main commits in range:
153+
Notable commits in range (chronological):
120154
- `a3a2758`
121155
- `c69c19e`
122156
- `5e972ea`
123157
- `f192caf`
124158
- `3567905`
125159
- `ba25c9a`
160+
- `a17ac5c`
161+
- `9c0e0b4`
162+
- `2e63c8a`
163+
- `0dab2c5`
164+
- `6a67453`
165+
- `7e2d943`
166+
- `6e6993f`
167+
- `8b8f55d`
168+
- `6f61a69`
169+
- `d027b85`

0 commit comments

Comments
 (0)