Skip to content

Commit e95f236

Browse files
committed
docs: introduce ROAD-TO-V2-Overhaul documentation, update README to link it, and remove a redundant variable capture in eval_prompt_test.go.
1 parent ba25c9a commit e95f236

File tree

3 files changed

+84
-1
lines changed

3 files changed

+84
-1
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ A lightweight evaluation harness for coding agents that runs high-signal, compac
2323
- [How It Works](#how-it-works)
2424
- [Output](#output)
2525
- [Architecture](#architecture)
26+
- [Version History](#version-history)
2627
- [Contributing](#contributing)
2728
- [License](#license)
2829

@@ -280,6 +281,10 @@ sanityharness/
280281

281282
See [docs/DEVELOPMENT.md](docs/DEVELOPMENT.md) for architecture details.
282283

284+
## Version History
285+
286+
For a full summary of all changes since `v1.6.1` (the entire `1.7.x` line and further), see [docs/ROAD-TO-V2-Overhaul.md](docs/ROAD-TO-V2-Overhaul.md).
287+
283288
## Contributing
284289

285290
Contributions are welcome! Please see [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md) for guidelines.

docs/ROAD-TO-V2-Overhaul.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Road to v2.x.x Overhaul: What's Changed Since v1.6.1
2+
3+
If you want the short version: `1.7.x` is a fairness and reliability release. We reduced information leakage during eval, made infrastructure failures easier to reason about, and tightened task contracts where hidden requirements were too implicit.
4+
5+
## Why we changed anything
6+
7+
By the time we reached `v1.6.1`, we had enough evaluation history to see recurring issues:
8+
- some outcomes were influenced by workspace visibility rather than pure task solving,
9+
- transient agent or infrastructure problems were being mixed with normal failures,
10+
- and a few tasks required behavior that was technically tested but not clearly stated.
11+
12+
`1.7.x` is the pass where we cleaned those up.
13+
14+
## What changed in evaluation behavior
15+
16+
The biggest shift is workspace isolation during `sanity eval`.
17+
18+
Agents now run in isolated temporary workspaces under `/tmp`, and then the harness copies the resulting code back into `eval-results` for validation. In practical terms, agents cannot inspect sibling tasks, prior eval outputs, or their own running log stream while solving.
19+
20+
This is the key fairness change in `v1.7.0`. It removes a class of accidental side-channel advantages and makes comparisons more defensible.
21+
22+
We also moved `agent.log` placement so it lives in task output directories (`eval-results/.../<task>/agent.log`) rather than inside the active workspace.
23+
24+
## What changed in failure handling
25+
26+
Another major improvement is how we handle infra-style failures.
27+
28+
In older behavior, empty or broken agent runs could look like regular task failures. In `1.7.x` we detect these cases explicitly, retry appropriately, and keep resume flow clear. The end result is that pass/fail numbers better reflect model behavior, not random execution flakiness.
29+
30+
Resume messaging and bookkeeping were also improved so interrupted or infra-affected runs are easier to continue safely.
31+
32+
## What changed in prompts and test stability
33+
34+
We updated eval prompts to include explicit toolchain versions. That was a practical fix for frequent version mismatch mistakes, especially in ecosystems with fast API churn.
35+
36+
We also removed machine-dependent wall-clock assertions from hidden tests in `rust/regex-lite`. The task is still challenging, but now runtime pressure is enforced by harness/container timeouts instead of host-specific timing thresholds.
37+
38+
## What changed in task specs
39+
40+
Several low-pass tasks were failing for reasons that looked more like under-specified requirements than true capability gaps.
41+
For those tasks, we tightened textual contracts in stubs and comments so hidden expectations are inferable without giving away implementation details.
42+
43+
Tasks updated in this pass:
44+
- `tasks/typescript/promise-pool/promise_pool.ts`
45+
- `tasks/dart/future-pool/lib/future_pool.dart`
46+
- `tasks/typescript/csv-lite/csv.ts`
47+
- `tasks/rust/macros/lib.rs`
48+
- `tasks/kotlin/channel-multiplexer/src/main/kotlin/ChannelMultiplexer.kt`
49+
- `tasks/dart/reactive-cache/lib/reactive_cache.dart`
50+
- `tasks/dart/isolate-pool/lib/isolate_pool.dart`
51+
- `tasks/kotlin/flow-processor/src/main/kotlin/FlowProcessor.kt`
52+
- `tasks/zig/arena-allocator/arena.zig`
53+
- `tasks/zig/comptime-json/json.zig`
54+
55+
The intent here was high-signal evaluation: remove "mind-reading" requirements, but do not turn tasks into copy-paste exercises.
56+
57+
## Compatibility and comparing old runs
58+
59+
`1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode:
60+
61+
```bash
62+
./sanity eval --legacy
63+
```
64+
65+
Use default mode for current evaluations. Use legacy mode only when you need apples-to-apples historical comparison.
66+
67+
## Commit range
68+
69+
This document covers:
70+
- baseline: `v1.6.1`
71+
- through: current `HEAD` on `main`
72+
73+
Main commits in range:
74+
- `a3a2758`
75+
- `c69c19e`
76+
- `5e972ea`
77+
- `f192caf`
78+
- `3567905`
79+
- `ba25c9a`

internal/cli/eval_prompt_test.go

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,6 @@ func TestBuildAgentPromptIncludesToolchainInfo(t *testing.T) {
112112
}
113113

114114
for _, tc := range tests {
115-
tc := tc
116115
t.Run(tc.name, func(t *testing.T) {
117116
t.Parallel()
118117

0 commit comments

Comments
 (0)