|
| 1 | +# Road to v2.x.x Overhaul: What's Changed Since v1.6.1 |
| 2 | + |
| 3 | +If you want the short version: `1.7.x` is a fairness and reliability release. We reduced information leakage during eval, made infrastructure failures easier to reason about, and tightened task contracts where hidden requirements were too implicit. |
| 4 | + |
| 5 | +## Why we changed anything |
| 6 | + |
| 7 | +By the time we reached `v1.6.1`, we had enough evaluation history to see recurring issues: |
| 8 | +- some outcomes were influenced by workspace visibility rather than pure task solving, |
| 9 | +- transient agent or infrastructure problems were being mixed with normal failures, |
| 10 | +- and a few tasks required behavior that was technically tested but not clearly stated. |
| 11 | + |
| 12 | +`1.7.x` is the pass where we cleaned those up. |
| 13 | + |
| 14 | +## What changed in evaluation behavior |
| 15 | + |
| 16 | +The biggest shift is workspace isolation during `sanity eval`. |
| 17 | + |
| 18 | +Agents now run in isolated temporary workspaces under `/tmp`, and then the harness copies the resulting code back into `eval-results` for validation. In practical terms, agents cannot inspect sibling tasks, prior eval outputs, or their own running log stream while solving. |
| 19 | + |
| 20 | +This is the key fairness change in `v1.7.0`. It removes a class of accidental side-channel advantages and makes comparisons more defensible. |
| 21 | + |
| 22 | +We also moved `agent.log` placement so it lives in task output directories (`eval-results/.../<task>/agent.log`) rather than inside the active workspace. |
| 23 | + |
| 24 | +## What changed in failure handling |
| 25 | + |
| 26 | +Another major improvement is how we handle infra-style failures. |
| 27 | + |
| 28 | +In older behavior, empty or broken agent runs could look like regular task failures. In `1.7.x` we detect these cases explicitly, retry appropriately, and keep resume flow clear. The end result is that pass/fail numbers better reflect model behavior, not random execution flakiness. |
| 29 | + |
| 30 | +Resume messaging and bookkeeping were also improved so interrupted or infra-affected runs are easier to continue safely. |
| 31 | + |
| 32 | +## What changed in prompts and test stability |
| 33 | + |
| 34 | +We updated eval prompts to include explicit toolchain versions. That was a practical fix for frequent version mismatch mistakes, especially in ecosystems with fast API churn. |
| 35 | + |
| 36 | +We also removed machine-dependent wall-clock assertions from hidden tests in `rust/regex-lite`. The task is still challenging, but now runtime pressure is enforced by harness/container timeouts instead of host-specific timing thresholds. |
| 37 | + |
| 38 | +## What changed in task specs |
| 39 | + |
| 40 | +Several low-pass tasks were failing for reasons that looked more like under-specified requirements than true capability gaps. |
| 41 | +For those tasks, we tightened textual contracts in stubs and comments so hidden expectations are inferable without giving away implementation details. |
| 42 | + |
| 43 | +Tasks updated in this pass: |
| 44 | +- `tasks/typescript/promise-pool/promise_pool.ts` |
| 45 | +- `tasks/dart/future-pool/lib/future_pool.dart` |
| 46 | +- `tasks/typescript/csv-lite/csv.ts` |
| 47 | +- `tasks/rust/macros/lib.rs` |
| 48 | +- `tasks/kotlin/channel-multiplexer/src/main/kotlin/ChannelMultiplexer.kt` |
| 49 | +- `tasks/dart/reactive-cache/lib/reactive_cache.dart` |
| 50 | +- `tasks/dart/isolate-pool/lib/isolate_pool.dart` |
| 51 | +- `tasks/kotlin/flow-processor/src/main/kotlin/FlowProcessor.kt` |
| 52 | +- `tasks/zig/arena-allocator/arena.zig` |
| 53 | +- `tasks/zig/comptime-json/json.zig` |
| 54 | + |
| 55 | +The intent here was high-signal evaluation: remove "mind-reading" requirements, but do not turn tasks into copy-paste exercises. |
| 56 | + |
| 57 | +## Compatibility and comparing old runs |
| 58 | + |
| 59 | +`1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode: |
| 60 | + |
| 61 | +```bash |
| 62 | +./sanity eval --legacy |
| 63 | +``` |
| 64 | + |
| 65 | +Use default mode for current evaluations. Use legacy mode only when you need apples-to-apples historical comparison. |
| 66 | + |
| 67 | +## Commit range |
| 68 | + |
| 69 | +This document covers: |
| 70 | +- baseline: `v1.6.1` |
| 71 | +- through: current `HEAD` on `main` |
| 72 | + |
| 73 | +Main commits in range: |
| 74 | +- `a3a2758` |
| 75 | +- `c69c19e` |
| 76 | +- `5e972ea` |
| 77 | +- `f192caf` |
| 78 | +- `3567905` |
| 79 | +- `ba25c9a` |
0 commit comments