Skip to content

Commit c7e303c

Browse files
that-github-userunknownclaude
authored
Major README update: all commands, Copeland scoring, new features (#143)
* Major README update: add all commands, Copeland scoring, new features README was severely outdated — missing Copeland scoring, evaluate, undo, clean, config, compare, stats filters, --retry, --file, --scoring, --whitespace-insensitive, --no-color, --output-format, --preview, --dry-run, Bedrock support, and technical report link. Added: Commands section with all 10 commands and flags, Scoring section explaining Copeland pairwise, updated example output with Copeland table, updated comparison table, technical reports link. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Major README update + fix .git file deletion during agent runs README: Add all 10 commands, Copeland scoring section, all new flags (--scoring, --retry, --file, --whitespace-insensitive, --no-color, --output-format, --threshold, --test-timeout, --dry-run, --preview), Bedrock support, technical report link, updated example output. Runner: Backup .git pointer file before spawning agent, restore it after agent completes if deleted. Fixes the critical dogfooding bug where long-running Opus agents would lose the worktree git context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: unknown <that-github-user@github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 72bf6fa commit c7e303c

File tree

2 files changed

+137
-27
lines changed

2 files changed

+137
-27
lines changed

README.md

Lines changed: 117 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,14 @@
1414
<p align="center">
1515
<a href="#quick-start">Quick Start</a> &middot;
1616
<a href="#how-it-works">How It Works</a> &middot;
17+
<a href="#commands">Commands</a> &middot;
1718
<a href="CONTRIBUTING.md">Contributing</a> &middot;
1819
<a href="#references">References</a>
1920
</p>
2021

2122
---
2223

23-
Run N parallel Claude Code agents on the same task, then select the best result via test execution and convergence analysis. Based on the principle that **the aggregate of independent attempts outperforms any single attempt** — proven in [ensemble ML](https://en.wikipedia.org/wiki/Ensemble_learning), [superforecasting](https://en.wikipedia.org/wiki/Superforecasting), and [LLM code generation research](#references).
24+
Run N parallel Claude Code agents on the same task, then select the best result via test execution and **Copeland pairwise scoring**. Based on the principle that **the aggregate of independent attempts outperforms any single attempt** — proven in [ensemble ML](https://en.wikipedia.org/wiki/Ensemble_learning), [superforecasting](https://en.wikipedia.org/wiki/Superforecasting), and [LLM code generation research](#references).
2425

2526
## Quick start
2627

@@ -36,8 +37,18 @@ thinktank run "fix the authentication bypass"
3637
# Run 5 agents with test verification
3738
thinktank run "fix the race condition" -n 5 -t "npm test"
3839

40+
# Read prompt from a file (avoids shell expansion issues)
41+
thinktank run -f task.md -n 5 -t "npm test"
42+
43+
# Pipe prompt from stdin
44+
echo "refactor the parser" | thinktank run -n 3
45+
3946
# Apply the best result
4047
thinktank apply
48+
49+
# Set persistent defaults
50+
thinktank config set attempts 5
51+
thinktank config set model opus
4152
```
4253

4354
Requires [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code) installed and authenticated.
@@ -82,9 +93,17 @@ Use `--model` to select a Claude model: `sonnet` (default), `opus`, `haiku`, or
8293
2. Each agent independently solves the task (no shared context = true independence)
8394
3. Runs your **test suite** on each result
8495
4. Analyzes **convergence** — did the agents agree on an approach?
85-
5. **Recommends** the best candidate (tests passing + consensus + smallest diff)
96+
5. **Recommends** the best candidate via Copeland pairwise scoring
8697
6. You review and `thinktank apply`
8798

99+
## Scoring
100+
101+
The default scoring method is **Copeland pairwise ranking**. Every agent is compared head-to-head against every other agent across four criteria: tests passed, convergence group size, minimal file scope, and test files contributed. The agent that wins the most pairwise matchups is recommended.
102+
103+
An alternative `--scoring weighted` method is also available, which assigns point values to tests (100), convergence (50), and diff size (10). A third method, **Borda count** (rank aggregation), is available for comparison via `thinktank evaluate`.
104+
105+
Use `thinktank evaluate` to compare how all three scoring methods rank your results. See [docs/scoring-evaluation.md](docs/scoring-evaluation.md) for the full analysis.
106+
88107
## Why this works
89108

90109
Every model ever benchmarked shows **pass@5 >> pass@1**. The gap between "one attempt" and "best of five" is one of the largest free reliability gains in AI coding. But no tool exposes this — until now.
@@ -102,30 +121,92 @@ The key insight: **parallel attempts cost more tokens but not more time.** All a
102121
- **High-stakes changes** — auth, payments, security, data migrations
103122
- **Ambiguous tasks** — multiple valid approaches, need to see the spread
104123
- **Complex refactors** — many files, easy to miss something
105-
- **Unfamiliar codebases**agents might go the wrong direction
124+
- **Unfamiliar codebases**multiple attempts reduce the chance of going down the wrong path
106125

107-
## Usage
126+
## Commands
108127

109-
```bash
110-
# Run with defaults (3 agents, sonnet model)
111-
thinktank run "add rate limiting to the API"
128+
### `thinktank run [prompt]`
112129

113-
# Run 5 agents with test verification
114-
thinktank run "fix the race condition in the cache layer" -n 5 -t "npm test"
130+
Run N parallel agents on a task.
115131

116-
# Use a specific model
117-
thinktank run "migrate callbacks to async/await" --model opus -n 3
132+
| Flag | Description |
133+
|------|-------------|
134+
| `-n, --attempts <N>` | Number of parallel agents (default: 3, max: 20) |
135+
| `-f, --file <path>` | Read prompt from a file |
136+
| `-t, --test-cmd <cmd>` | Test command to verify results |
137+
| `--test-timeout <sec>` | Timeout for test command in seconds (default: 120, max: 600) |
138+
| `--timeout <sec>` | Timeout per agent in seconds (default: 600, max: 1800) |
139+
| `--model <model>` | Claude model: sonnet, opus, haiku, or full ID |
140+
| `-r, --runner <name>` | AI coding tool to use (default: claude-code) |
141+
| `--scoring <method>` | Scoring method: `copeland` (default) or `weighted` |
142+
| `--threshold <number>` | Convergence clustering similarity threshold, 0.0–1.0 (default: 0.3) |
143+
| `--whitespace-insensitive` | Ignore whitespace in convergence comparison |
144+
| `--retry` | Re-run only failed/timed-out agents from the last run |
145+
| `--output-format <fmt>` | Output format: `text` (default) or `json` |
146+
| `--no-color` | Disable colored output |
147+
| `--verbose` | Show detailed agent output |
118148

119-
# Apply the recommended result
120-
thinktank apply
149+
### `thinktank apply`
150+
151+
Apply the recommended agent's changes to your working tree.
152+
153+
| Flag | Description |
154+
|------|-------------|
155+
| `-a, --agent <N>` | Apply a specific agent's result instead of the recommended one |
156+
| `-p, --preview` | Show the diff without applying |
157+
| `-d, --dry-run` | Same as `--preview` (alias) |
158+
159+
### `thinktank undo`
160+
161+
Reverse the last applied diff.
162+
163+
### `thinktank list [run-number]`
164+
165+
List all past runs, or show details for a specific run.
166+
167+
### `thinktank compare <agentA> <agentB>`
121168

122-
# Apply a specific agent's result
123-
thinktank apply --agent 2
169+
Compare two agents' results side by side.
124170

125-
# View the last run's results
126-
thinktank list
171+
### `thinktank stats`
172+
173+
Show aggregate statistics across all runs.
174+
175+
| Flag | Description |
176+
|------|-------------|
177+
| `--model <name>` | Filter to runs using a specific model |
178+
| `--since <date>` | Show runs from this date onward (ISO 8601) |
179+
| `--until <date>` | Show runs up to this date (ISO 8601) |
180+
| `--passed-only` | Only runs where at least one agent passed tests |
181+
182+
### `thinktank evaluate`
183+
184+
Compare scoring methods (weighted vs Copeland vs Borda) across all runs to see how they differ in recommendations.
185+
186+
### `thinktank clean`
187+
188+
Remove thinktank worktrees and branches. Add `--all` to also delete `.thinktank/` run history.
189+
190+
### `thinktank config set|get|list`
191+
192+
View and update persistent configuration (stored in `.thinktank/config.json`).
193+
194+
```bash
195+
thinktank config set attempts 5 # persistent default
196+
thinktank config set model opus
197+
thinktank config get attempts
198+
thinktank config list # show all values
127199
```
128200

201+
Available keys: `attempts`, `model`, `timeout`, `runner`, `threshold`, `testTimeout`.
202+
203+
## Pre-flight checks
204+
205+
Before spawning agents, thinktank validates the environment:
206+
207+
1. **Disk space** — warns if there isn't enough room for N worktrees
208+
2. **Test suite** — if `--test-cmd` is set, runs the tests once on the main branch to verify the suite passes before spending tokens on parallel agents
209+
129210
## Example output
130211

131212
```
@@ -152,21 +233,27 @@ Convergence
152233
Strong consensus — 3/5 agents changed the same files
153234
Files: src/middleware/auth.ts, tests/auth.test.ts
154235
155-
Agents [3]: ████░░░░░░░░░░░░░░░░ 20%
156-
Divergent approach — 1/5 agents went a different direction
157-
Files: src/middleware/auth.ts, src/utils/jwt.ts, tests/auth.test.ts
236+
Copeland Pairwise Scoring
237+
────────────────────────────────────────────────────────────
238+
Agent Tests Converge Scope TestCov Copeland
239+
──────────────────────────────────────────────────────────
240+
>> #1 +1 +2 0 +1 +4
241+
#2 +1 +2 0 0 +3
242+
#3 +1 -3 -4 0 -6
243+
#4 -4 -3 +4 -1 -4
244+
#5 +1 +2 0 0 +3
158245
159-
Recommended: Agent #1 (highest score based on tests + convergence + diff size)
246+
Recommended: Agent #1 (Copeland winner)
160247
```
161248

162249
## How it compares
163250

164-
| Approach | Reliability | Cost | Speed |
165-
|----------|-------------|------|-------|
166-
| Single Claude Code run | pass@1 | 1x | Fastest |
167-
| **thinktank (N=3)** | **~pass@3** | **3x** | **Same wall time** |
168-
| **thinktank (N=5)** | **~pass@5** | **5x** | **Same wall time** |
169-
| Manual retry loop | pass@k (sequential) | kx | k × slower |
251+
| Approach | Reliability | Cost | Speed | Selection |
252+
|----------|-------------|------|-------|-----------|
253+
| Single Claude Code run | pass@1 | 1x | Fastest | N/A |
254+
| **thinktank (N=3)** | **~pass@3** | **3x** | **Same wall time** | **Copeland pairwise** |
255+
| **thinktank (N=5)** | **~pass@5** | **5x** | **Same wall time** | **Copeland pairwise** |
256+
| Manual retry loop | pass@k (sequential) | kx | k × slower | Manual |
170257

171258
## References
172259

@@ -183,3 +270,6 @@ Convergence
183270
### Ensemble theory
184271
- *Superforecasting* — Tetlock & Gardner. The aggregate of independent forecasters consistently beats individuals.
185272
- *The Wisdom of Crowds* — Surowiecki. Independent estimates, when aggregated, converge on truth.
273+
274+
### Technical reports
275+
- [Scoring Method Evaluation](docs/scoring-evaluation.md) — Copeland vs Weighted vs Borda across 21 runs. Key finding: Copeland and Borda agree 86%, weighted disagrees ~40%.

src/runners/claude-code.ts

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
import { spawn } from "node:child_process";
2+
import { readFile, writeFile } from "node:fs/promises";
3+
import { join } from "node:path";
24
import type { AgentResult } from "../types.js";
35
import { getDiff, getDiffStats } from "../utils/git.js";
46
import type { Runner, RunnerOptions } from "./base.js";
@@ -20,6 +22,15 @@ export const claudeCodeRunner: Runner = {
2022
async run(id: number, opts: RunnerOptions): Promise<AgentResult> {
2123
const start = Date.now();
2224

25+
// Backup the .git pointer file — agents sometimes delete it during long runs
26+
const gitFilePath = join(opts.worktreePath, ".git");
27+
let gitFileBackup: string | null = null;
28+
try {
29+
gitFileBackup = await readFile(gitFilePath, "utf-8");
30+
} catch {
31+
// Not a worktree or .git is a directory — skip backup
32+
}
33+
2334
return new Promise((resolve) => {
2435
let output = "";
2536
let error = "";
@@ -86,6 +97,15 @@ export const claudeCodeRunner: Runner = {
8697
if (settled) return;
8798
settled = true;
8899

100+
// Restore .git file if the agent deleted it during execution
101+
if (gitFileBackup) {
102+
try {
103+
await readFile(gitFilePath, "utf-8");
104+
} catch {
105+
await writeFile(gitFilePath, gitFileBackup).catch(() => {});
106+
}
107+
}
108+
89109
const duration = Date.now() - start;
90110
const diff = await getDiff(opts.worktreePath);
91111
const stats = await getDiffStats(opts.worktreePath);

0 commit comments

Comments
 (0)