Major README update: all commands, Copeland scoring, new features (#143)

that-github-user · unknown · claude · web-flow · commit c7e303c0d209 · 2026-03-28T20:52:13.000-07:00
* Major README update: add all commands, Copeland scoring, new features

README was severely outdated — missing Copeland scoring, evaluate,
undo, clean, config, compare, stats filters, --retry, --file,
--scoring, --whitespace-insensitive, --no-color, --output-format,
--preview, --dry-run, Bedrock support, and technical report link.

Added: Commands section with all 10 commands and flags, Scoring section
explaining Copeland pairwise, updated example output with Copeland table,
updated comparison table, technical reports link.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Major README update + fix .git file deletion during agent runs

README: Add all 10 commands, Copeland scoring section, all new flags
(--scoring, --retry, --file, --whitespace-insensitive, --no-color,
--output-format, --threshold, --test-timeout, --dry-run, --preview),
Bedrock support, technical report link, updated example output.

Runner: Backup .git pointer file before spawning agent, restore it
after agent completes if deleted. Fixes the critical dogfooding bug
where long-running Opus agents would lose the worktree git context.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: unknown &lt;that-github-user@github.com&gt;
Co-authored-by: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -14,13 +14,14 @@
 <p align="center">
   <a href="#quick-start">Quick Start</a> &middot;
   <a href="#how-it-works">How It Works</a> &middot;
+  <a href="#commands">Commands</a> &middot;
   <a href="CONTRIBUTING.md">Contributing</a> &middot;
   <a href="#references">References</a>
 </p>
 
 ---
 
-Run N parallel Claude Code agents on the same task, then select the best result via test execution and convergence analysis. Based on the principle that **the aggregate of independent attempts outperforms any single attempt** — proven in [ensemble ML](https://en.wikipedia.org/wiki/Ensemble_learning), [superforecasting](https://en.wikipedia.org/wiki/Superforecasting), and [LLM code generation research](#references).
+Run N parallel Claude Code agents on the same task, then select the best result via test execution and **Copeland pairwise scoring**. Based on the principle that **the aggregate of independent attempts outperforms any single attempt** — proven in [ensemble ML](https://en.wikipedia.org/wiki/Ensemble_learning), [superforecasting](https://en.wikipedia.org/wiki/Superforecasting), and [LLM code generation research](#references).
 
 ## Quick start
 
@@ -36,8 +37,18 @@ thinktank run "fix the authentication bypass"
 # Run 5 agents with test verification
 thinktank run "fix the race condition" -n 5 -t "npm test"
 
+# Read prompt from a file (avoids shell expansion issues)
+thinktank run -f task.md -n 5 -t "npm test"
+
+# Pipe prompt from stdin
+echo "refactor the parser" | thinktank run -n 3
+
 # Apply the best result
 thinktank apply
+
+# Set persistent defaults
+thinktank config set attempts 5
+thinktank config set model opus
 ```
 
 Requires [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code) installed and authenticated.
@@ -82,9 +93,17 @@ Use `--model` to select a Claude model: `sonnet` (default), `opus`, `haiku`, or
 2. Each agent independently solves the task (no shared context = true independence)
 3. Runs your **test suite** on each result
 4. Analyzes **convergence** — did the agents agree on an approach?
-5. **Recommends** the best candidate (tests passing + consensus + smallest diff)
+5. **Recommends** the best candidate via Copeland pairwise scoring
 6. You review and `thinktank apply`
 
+## Scoring
+
+The default scoring method is **Copeland pairwise ranking**. Every agent is compared head-to-head against every other agent across four criteria: tests passed, convergence group size, minimal file scope, and test files contributed. The agent that wins the most pairwise matchups is recommended.
+
+An alternative `--scoring weighted` method is also available, which assigns point values to tests (100), convergence (50), and diff size (10). A third method, **Borda count** (rank aggregation), is available for comparison via `thinktank evaluate`.
+
+Use `thinktank evaluate` to compare how all three scoring methods rank your results. See [docs/scoring-evaluation.md](docs/scoring-evaluation.md) for the full analysis.
+
 ## Why this works
 
 Every model ever benchmarked shows **pass@5 >> pass@1**. The gap between "one attempt" and "best of five" is one of the largest free reliability gains in AI coding. But no tool exposes this — until now.
@@ -102,30 +121,92 @@ The key insight: **parallel attempts cost more tokens but not more time.** All a
 - **High-stakes changes** — auth, payments, security, data migrations
 - **Ambiguous tasks** — multiple valid approaches, need to see the spread
 - **Complex refactors** — many files, easy to miss something
-- **Unfamiliar codebases** — agents might go the wrong direction
+- **Unfamiliar codebases** — multiple attempts reduce the chance of going down the wrong path
 
-## Usage
+## Commands
 
-```bash
-# Run with defaults (3 agents, sonnet model)
-thinktank run "add rate limiting to the API"
+### `thinktank run [prompt]`
 
-# Run 5 agents with test verification
-thinktank run "fix the race condition in the cache layer" -n 5 -t "npm test"
+Run N parallel agents on a task.
 
-# Use a specific model
-thinktank run "migrate callbacks to async/await" --model opus -n 3
+| Flag | Description |
+|------|-------------|
+| `-n, --attempts <N>` | Number of parallel agents (default: 3, max: 20) |
+| `-f, --file <path>` | Read prompt from a file |
+| `-t, --test-cmd <cmd>` | Test command to verify results |
+| `--test-timeout <sec>` | Timeout for test command in seconds (default: 120, max: 600) |
+| `--timeout <sec>` | Timeout per agent in seconds (default: 600, max: 1800) |
+| `--model <model>` | Claude model: sonnet, opus, haiku, or full ID |
+| `-r, --runner <name>` | AI coding tool to use (default: claude-code) |
+| `--scoring <method>` | Scoring method: `copeland` (default) or `weighted` |
+| `--threshold <number>` | Convergence clustering similarity threshold, 0.0–1.0 (default: 0.3) |
+| `--whitespace-insensitive` | Ignore whitespace in convergence comparison |
+| `--retry` | Re-run only failed/timed-out agents from the last run |
+| `--output-format <fmt>` | Output format: `text` (default) or `json` |
+| `--no-color` | Disable colored output |
+| `--verbose` | Show detailed agent output |
 
-# Apply the recommended result
-thinktank apply
+### `thinktank apply`
+
+Apply the recommended agent's changes to your working tree.
+
+| Flag | Description |
+|------|-------------|
+| `-a, --agent <N>` | Apply a specific agent's result instead of the recommended one |
+| `-p, --preview` | Show the diff without applying |
+| `-d, --dry-run` | Same as `--preview` (alias) |
+
+### `thinktank undo`
+
+Reverse the last applied diff.
+
+### `thinktank list [run-number]`
+
+List all past runs, or show details for a specific run.
+
+### `thinktank compare <agentA> <agentB>`
 
-# Apply a specific agent's result
-thinktank apply --agent 2
+Compare two agents' results side by side.
 
-# View the last run's results
-thinktank list
+### `thinktank stats`
+
+Show aggregate statistics across all runs.
+
+| Flag | Description |
+|------|-------------|
+| `--model <name>` | Filter to runs using a specific model |
+| `--since <date>` | Show runs from this date onward (ISO 8601) |
+| `--until <date>` | Show runs up to this date (ISO 8601) |
+| `--passed-only` | Only runs where at least one agent passed tests |
+
+### `thinktank evaluate`
+
+Compare scoring methods (weighted vs Copeland vs Borda) across all runs to see how they differ in recommendations.
+
+### `thinktank clean`
+
+Remove thinktank worktrees and branches. Add `--all` to also delete `.thinktank/` run history.
+
+### `thinktank config set|get|list`
+
+View and update persistent configuration (stored in `.thinktank/config.json`).
+
+```bash
+thinktank config set attempts 5    # persistent default
+thinktank config set model opus
+thinktank config get attempts
+thinktank config list              # show all values
 ```
 
+Available keys: `attempts`, `model`, `timeout`, `runner`, `threshold`, `testTimeout`.
+
+## Pre-flight checks
+
+Before spawning agents, thinktank validates the environment:
+
+1. **Disk space** — warns if there isn't enough room for N worktrees
+2. **Test suite** — if `--test-cmd` is set, runs the tests once on the main branch to verify the suite passes before spending tokens on parallel agents
+
 ## Example output
 
 ```
@@ -152,21 +233,27 @@ Convergence
   Strong consensus — 3/5 agents changed the same files
   Files: src/middleware/auth.ts, tests/auth.test.ts
 
-  Agents [3]:       ████░░░░░░░░░░░░░░░░ 20%
-  Divergent approach — 1/5 agents went a different direction
-  Files: src/middleware/auth.ts, src/utils/jwt.ts, tests/auth.test.ts
+Copeland Pairwise Scoring
+────────────────────────────────────────────────────────────
+  Agent   Tests     Converge  Scope     TestCov   Copeland
+  ──────────────────────────────────────────────────────────
+>> #1     +1        +2        0         +1        +4
+  #2      +1        +2        0         0         +3
+  #3      +1        -3        -4        0         -6
+  #4      -4        -3        +4        -1        -4
+  #5      +1        +2        0         0         +3
 
-  Recommended: Agent #1 (highest score based on tests + convergence + diff size)
+  Recommended: Agent #1 (Copeland winner)
 ```
 
 ## How it compares
 
-| Approach | Reliability | Cost | Speed |
-|----------|-------------|------|-------|
-| Single Claude Code run | pass@1 | 1x | Fastest |
-| **thinktank (N=3)** | **~pass@3** | **3x** | **Same wall time** |
-| **thinktank (N=5)** | **~pass@5** | **5x** | **Same wall time** |
-| Manual retry loop | pass@k (sequential) | kx | k × slower |
+| Approach | Reliability | Cost | Speed | Selection |
+|----------|-------------|------|-------|-----------|
+| Single Claude Code run | pass@1 | 1x | Fastest | N/A |
+| **thinktank (N=3)** | **~pass@3** | **3x** | **Same wall time** | **Copeland pairwise** |
+| **thinktank (N=5)** | **~pass@5** | **5x** | **Same wall time** | **Copeland pairwise** |
+| Manual retry loop | pass@k (sequential) | kx | k × slower | Manual |
 
 ## References
 
@@ -183,3 +270,6 @@ Convergence
 ### Ensemble theory
 - *Superforecasting* — Tetlock & Gardner. The aggregate of independent forecasters consistently beats individuals.
 - *The Wisdom of Crowds* — Surowiecki. Independent estimates, when aggregated, converge on truth.
+
+### Technical reports
+- [Scoring Method Evaluation](docs/scoring-evaluation.md) — Copeland vs Weighted vs Borda across 21 runs. Key finding: Copeland and Borda agree 86%, weighted disagrees ~40%.
diff --git a/src/runners/claude-code.ts b/src/runners/claude-code.ts
@@ -1,4 +1,6 @@
 import { spawn } from "node:child_process";
+import { readFile, writeFile } from "node:fs/promises";
+import { join } from "node:path";
 import type { AgentResult } from "../types.js";
 import { getDiff, getDiffStats } from "../utils/git.js";
 import type { Runner, RunnerOptions } from "./base.js";
@@ -20,6 +22,15 @@ export const claudeCodeRunner: Runner = {
   async run(id: number, opts: RunnerOptions): Promise<AgentResult> {
     const start = Date.now();
 
+    // Backup the .git pointer file — agents sometimes delete it during long runs
+    const gitFilePath = join(opts.worktreePath, ".git");
+    let gitFileBackup: string | null = null;
+    try {
+      gitFileBackup = await readFile(gitFilePath, "utf-8");
+    } catch {
+      // Not a worktree or .git is a directory — skip backup
+    }
+
     return new Promise((resolve) => {
       let output = "";
       let error = "";
@@ -86,6 +97,15 @@ export const claudeCodeRunner: Runner = {
         if (settled) return;
         settled = true;
 
+        // Restore .git file if the agent deleted it during execution
+        if (gitFileBackup) {
+          try {
+            await readFile(gitFilePath, "utf-8");
+          } catch {
+            await writeFile(gitFilePath, gitFileBackup).catch(() => {});
+          }
+        }
+
         const duration = Date.now() - start;
         const diff = await getDiff(opts.worktreePath);
         const stats = await getDiffStats(opts.worktreePath);