Run N parallel Claude Code agents on the same task, each in an isolated worktree, then select/merge the best result using test execution as the oracle.
Academic backing:
- AlphaCode: 1M samples → filter → top 10. Competitive programming performance.
- CodeT: Dual execution agreement (code + generated tests) for ranking.
- MBR-Exec: Majority voting on execution output, not code text.
- pass@k research: pass@5 dramatically beats pass@1 for every model tested.
- Superforecasting principle: aggregate of independent attempts beats any single attempt.
The gap: Nobody has productized this for real-world software engineering (only competitive programming benchmarks). No AI coding tool has ensemble/parallel mode.
AlphaCode had test cases. Real-world tasks often don't have comprehensive tests. What do we use?
Options:
a) Existing test suite — run npm test / pytest after each attempt. Pick the attempt(s) that pass.
b) AI-generated tests — before running the task, generate test cases for the EXPECTED behavior. Use those to filter. (CodeT approach)
c) Consensus/convergence — if 4/5 agents changed the same files in similar ways, that's likely correct.
d) Multi-signal scoring — combine: tests pass + diff size (smaller = better) + no new warnings + convergence with other attempts.
e) Human selection — present the top 2-3 candidates, human picks.
Best approach for v0.1: (a) + (c) + (e). Run tests, check convergence, present candidates to human.
Option A: N separate claude CLI processes
- Spawn N
claude -p "task" --output-format jsonprocesses - Each in its own git worktree
- Fully isolated, truly parallel
- PRO: Simple, uses existing CLI
- CON: Each process pays full context-loading cost (reads codebase N times)
Option B: N subagents within one Claude Code session
- Use Claude Code's native Agent tool with
isolation: "worktree" - Spawn from a parent orchestrator session
- PRO: Native to Claude Code, could share initial codebase understanding
- CON: Subagents may not be fully independent (shared context could reduce diversity)
Option C: Hybrid — one planning agent, N execution agents
- Agent 0 reads the codebase and creates a task brief
- Agents 1-N each receive the brief and execute independently in worktrees
- PRO: Amortizes codebase reading cost, agents are still independent in execution
- CON: More complex orchestration
Best for v0.1: Option A (simplest, most isolated, most diverse results). Option C for v0.2.
After N runs complete:
- Show which passed tests, which didn't
- Show convergence map (which agents made similar changes)
- Show diff size comparison
- Let user inspect any candidate's diff
- Apply the selected candidate (or merge elements from multiple)
If a typical Claude Code task costs $0.50:
- 3 parallel runs = $1.50
- 5 parallel runs = $2.50
Is this worth it? Depends on the task:
- Fixing a production bug at 2am? Absolutely.
- Adding a button? Probably not.
The tool should help users choose: "This task has high complexity/risk, consider running ensemble mode."
- High-stakes changes (auth, payments, security)
- Ambiguous tasks (multiple valid approaches, need to see the spread)
- Complex refactors (many files, easy to miss something)
- Unfamiliar codebases (agent might go wrong direction)
- When tests exist (oracle is available for free)
When is it LEAST valuable?
- Simple, mechanical changes
- Tasks with one obvious approach
- Codebases with no test suite (no oracle)
ensemble "fix the authentication bypass" --attempts 5
- Spawns 5 parallel Claude Code processes in worktrees
- Waits for all to complete
- Runs tests on each result
- Presents ranked candidates
- User picks one, it gets applied to main branch
consensus "add rate limiting to the API" --spread 5
- Same mechanism but framed around CONVERGENCE
- Output emphasizes: "4/5 agents used token bucket algorithm. 1 used sliding window."
- Convergence = confidence signal
- Divergence = the task is ambiguous, needs clarification
verify "refactor auth middleware to use sessions" --alternatives 3
- Framed as: "I already have Claude Code's solution, but I want to VERIFY it by seeing if independent agents arrive at the same answer"
- More defensive positioning: not "get better code" but "validate the code you got"
- All of the above, PLUS:
- Collect anonymized ensemble data (convergence rates, pass rates, cost)
- Build a dataset of "what kinds of tasks benefit most from ensemble"
- Community contributes findings
- This becomes the RESEARCH angle — the open-source project isn't just a tool, it's advancing the field
| Name | Pitch | Vibe |
|---|---|---|
ensemble |
Direct, descriptive | Academic |
consensus |
Emphasizes convergence | Collaborative |
quorum |
"A quorum of agents agrees" | Clever, memorable |
swarm |
Multiple agents working together | Overused in AI |
spread |
Like ensemble forecasting / running the spread | Financial/forecasting |
chorus |
Multiple voices, one harmony | Musical, distinctive |
council |
"A council of agents deliberated" | Authoritative |
thinktank |
Already the repo name, fits "multiple minds" | Perfect? |
thinktank actually fits this concept better than the benchmark concept. A think tank IS an ensemble of independent thinkers producing recommendations. "Run a thinktank on your coding task" = spawn N agents, get the consensus.
- Worktree isolation is native —
isolation: "worktree"in Agent tool - Headless CLI mode —
claude -penables scripted parallel runs - Subagent architecture — designed for spawning child agents
- Cost visibility — token usage is trackable per session
- Hooks system — can instrument pre/post session for data collection
No other AI coding tool has ALL of these. Cursor can't spawn parallel isolated instances. Copilot doesn't have headless mode. This is a Claude Code structural advantage.
It might — and that could be GOOD. If Anthropic adds --ensemble 5 to Claude Code, the project wins by being the proof-of-concept that drove the feature. In the meantime:
- The ensemble orchestration + selection logic is non-trivial
- The convergence analysis is a novel contribution
- The research data (what tasks benefit from ensemble) is the moat
- Technical moat: Medium — the orchestration is buildable by others, but the selection/merge algorithms and heuristics improve with usage data
- Data moat: HIGH — anonymized ensemble data (convergence rates by task type, pass rates by attempt count, optimal N for different complexities) doesn't exist ANYWHERE
- Community moat: Medium-high — contributors share findings, selection strategies, merge algorithms
- Research moat: HIGH — first to publish real-world ensemble coding results (not competitive programming)
Counter: You could, but: (a) you wouldn't run them in parallel (2x wall time), (b) you wouldn't have structured comparison, (c) you wouldn't have convergence analysis, (d) you wouldn't run tests automatically, (e) you wouldn't have historical data on what works. The tool makes the obvious-but-tedious thing effortless.
Counter: (a) Users choose when to use ensemble mode — it's opt-in for high-stakes tasks, (b) even N=3 dramatically improves reliability per the research, (c) $1.50 vs $0.50 for a fix that otherwise takes 30 min of debugging is cheap, (d) the cost of deploying a bad fix is much higher than 2 extra API calls.
Counter: True, but: (a) convergence alone is valuable even without tests, (b) the tool can auto-generate lightweight acceptance tests before running (CodeT approach), (c) for repos WITH tests, this is immediately valuable with zero setup.
Counter: (a) Temperature variation creates real diversity in approach, (b) research shows even same-model samples produce meaningfully different solutions, (c) the codebase exploration order varies (which files Claude reads first), creating path-dependent diversity, (d) future: could mix models (Claude + GPT) for true diversity.
Counter: (a) They optimized for competitive programming with perfect test oracles — real coding is harder and they may not see it as their lane, (b) they don't have an agent coding product, (c) the infrastructure (worktrees, CLI, hooks) didn't exist until Claude Code matured, (d) being second to market with a better implementation is fine — nobody has done it PERIOD.
Verdict: Survives all attacks. The weakest point is the test-oracle dependency, but convergence analysis partially compensates even without tests.
- Usefulness: 4.5/5 (anyone doing high-stakes AI coding wants more reliable results)
- Moat: 4.5/5 (ensemble data + research findings + Claude Code structural advantage)
- Shippability: 4/5 (CLI wrapper + parallel spawning + test runner + diff comparison)
- Community potential: 4.5/5 (contribute selection strategies, share ensemble data, publish findings)
- Total: 17.5/20
But wait — there's an adjustment. This idea has something none of the previous 34 had:
ACADEMIC VALIDATION + ZERO COMPETITORS + PLATFORM STRUCTURAL FIT
The research says this works. Nobody has built it. And Claude Code is the only platform where it's naturally possible. That combination is unique in this entire ideation process.
Adjusted score: 19/20 — the highest of any idea explored.
Hour 1-2: Scaffold
- TypeScript CLI project
thinktank run "task description" --attempts N- Configuration: default N, test command, worktree directory
Hour 3-4: Parallel Runner
- Spawn N
claude -p "task"processes in separate git worktrees - Capture output, timing, exit codes
- Wait for all to complete (with timeout)
Hour 5-6: Evaluator
- For each completed attempt:
- Run test suite, capture pass/fail
- Generate git diff
- Compute diff stats (files changed, lines added/removed)
- Compare diffs across attempts for convergence
Hour 7-8: Reporter + Selection
- Display results table (pass/fail, diff size, convergence)
- Show convergence map ("Agents 1,3,5 took similar approach; Agent 2 diverged")
- Let user select which to apply
- Apply selected diff to main worktree
End of Day 1: Working CLI that runs N parallel Claude Code agents and helps you pick the best result.