You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run N parallel Claude Code agents on the same task, then select the best result via test execution and convergence analysis. Based on the principle that **the aggregate of independent attempts outperforms any single attempt** — proven in [ensemble ML](https://en.wikipedia.org/wiki/Ensemble_learning), [superforecasting](https://en.wikipedia.org/wiki/Superforecasting), and [LLM code generation research](#references).
24
+
Run N parallel Claude Code agents on the same task, then select the best result via test execution and **Copeland pairwise scoring**. Based on the principle that **the aggregate of independent attempts outperforms any single attempt** — proven in [ensemble ML](https://en.wikipedia.org/wiki/Ensemble_learning), [superforecasting](https://en.wikipedia.org/wiki/Superforecasting), and [LLM code generation research](#references).
24
25
25
26
## Quick start
26
27
@@ -36,8 +37,18 @@ thinktank run "fix the authentication bypass"
36
37
# Run 5 agents with test verification
37
38
thinktank run "fix the race condition" -n 5 -t "npm test"
38
39
40
+
# Read prompt from a file (avoids shell expansion issues)
41
+
thinktank run -f task.md -n 5 -t "npm test"
42
+
43
+
# Pipe prompt from stdin
44
+
echo"refactor the parser"| thinktank run -n 3
45
+
39
46
# Apply the best result
40
47
thinktank apply
48
+
49
+
# Set persistent defaults
50
+
thinktank config set attempts 5
51
+
thinktank config set model opus
41
52
```
42
53
43
54
Requires [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code) installed and authenticated.
@@ -82,9 +93,17 @@ Use `--model` to select a Claude model: `sonnet` (default), `opus`, `haiku`, or
82
93
2. Each agent independently solves the task (no shared context = true independence)
83
94
3. Runs your **test suite** on each result
84
95
4. Analyzes **convergence** — did the agents agree on an approach?
85
-
5.**Recommends** the best candidate (tests passing + consensus + smallest diff)
96
+
5.**Recommends** the best candidate via Copeland pairwise scoring
86
97
6. You review and `thinktank apply`
87
98
99
+
## Scoring
100
+
101
+
The default scoring method is **Copeland pairwise ranking**. Every agent is compared head-to-head against every other agent across four criteria: tests passed, convergence group size, minimal file scope, and test files contributed. The agent that wins the most pairwise matchups is recommended.
102
+
103
+
An alternative `--scoring weighted` method is also available, which assigns point values to tests (100), convergence (50), and diff size (10). A third method, **Borda count** (rank aggregation), is available for comparison via `thinktank evaluate`.
104
+
105
+
Use `thinktank evaluate` to compare how all three scoring methods rank your results. See [docs/scoring-evaluation.md](docs/scoring-evaluation.md) for the full analysis.
106
+
88
107
## Why this works
89
108
90
109
Every model ever benchmarked shows **pass@5 >> pass@1**. The gap between "one attempt" and "best of five" is one of the largest free reliability gains in AI coding. But no tool exposes this — until now.
@@ -102,30 +121,92 @@ The key insight: **parallel attempts cost more tokens but not more time.** All a
102
121
-**High-stakes changes** — auth, payments, security, data migrations
103
122
-**Ambiguous tasks** — multiple valid approaches, need to see the spread
104
123
-**Complex refactors** — many files, easy to miss something
105
-
-**Unfamiliar codebases** — agents might go the wrong direction
124
+
-**Unfamiliar codebases** — multiple attempts reduce the chance of going down the wrong path
106
125
107
-
## Usage
126
+
## Commands
108
127
109
-
```bash
110
-
# Run with defaults (3 agents, sonnet model)
111
-
thinktank run "add rate limiting to the API"
128
+
### `thinktank run [prompt]`
112
129
113
-
# Run 5 agents with test verification
114
-
thinktank run "fix the race condition in the cache layer" -n 5 -t "npm test"
130
+
Run N parallel agents on a task.
115
131
116
-
# Use a specific model
117
-
thinktank run "migrate callbacks to async/await" --model opus -n 3
132
+
| Flag | Description |
133
+
|------|-------------|
134
+
|`-n, --attempts <N>`| Number of parallel agents (default: 3, max: 20) |
135
+
|`-f, --file <path>`| Read prompt from a file |
136
+
|`-t, --test-cmd <cmd>`| Test command to verify results |
137
+
|`--test-timeout <sec>`| Timeout for test command in seconds (default: 120, max: 600) |
138
+
|`--timeout <sec>`| Timeout per agent in seconds (default: 600, max: 1800) |
139
+
|`--model <model>`| Claude model: sonnet, opus, haiku, or full ID |
140
+
|`-r, --runner <name>`| AI coding tool to use (default: claude-code) |
141
+
|`--scoring <method>`| Scoring method: `copeland` (default) or `weighted`|
0 commit comments