thinktank project — March 2026 (updated with n=73 dataset, 102 total runs)
thinktank runs N parallel AI coding agents on the same task and must recommend the "best" result. We evaluate three recommendation scoring methods — Weighted Sum, Copeland Pairwise, and Borda Count — across 73 usable ensemble coding runs (102 total) spanning 6 task types, 2 programming languages, and 5 distinct codebases. We find that Copeland and Borda converge on the same recommendation 96% of the time, while the Weighted Sum disagrees with Copeland 32% of the time. Cochran's Q test confirms the agreement rates differ significantly across method pairs (Q=23.4, p<0.0001). Cliff's delta indicates a small but real effect (d=0.181). These results support Copeland as the default scoring method, though the Wilcoxon signed-rank test finds no systematic ranking shift (p=0.99), suggesting the methods diverge primarily on the top-1 recommendation rather than on full rankings.
When multiple AI agents independently solve the same coding task, the system must select which solution to present to the user. This is a multi-criteria decision problem: an agent's solution should be judged on correctness (tests pass), consensus (convergence with other agents), efficiency (change scope), and test coverage contribution.
No single criterion suffices — an agent might pass all tests but produce an overly complex diff, or converge with other agents on a suboptimal approach.
Weighted Sum Each criterion is assigned a point value. Total score = sum of points. Highest score wins.
| Criterion | Points | Rationale |
|---|---|---|
| Tests pass | 100 | Correctness is paramount |
| Convergence | 0–50 | group_similarity × 50 |
| Diff size | 0–10 | Only penalizes outliers >2× median |
Copeland Pairwise (social choice theory) Compare every pair of agents head-to-head on four criteria (tests passed, convergence group size, non-test files changed, test files changed). For each pair, the agent winning more criteria gets +1, the loser gets −1. Highest cumulative score wins. Test file criterion is capped at 3 files to prevent gaming.
Borda Count (rank aggregation) Rank agents on each criterion independently. Sum the ranks. Lowest total rank wins. This is equivalent to asking "across all criteria, which agent is consistently near the top?"
| Property | Weighted | Copeland | Borda |
|---|---|---|---|
| Scale-independent | No — 100/50/10 weights are arbitrary | Yes — only ordinal comparisons | Yes — only ranks |
| Condorcet winner | Not guaranteed | Guaranteed | Not guaranteed |
| Sensitive to weight choice | Yes | No | No |
| Handles non-transitive preferences | Poorly | Well | Moderately |
We analyzed 73 usable ensemble coding runs (from 102 total) collected across multiple development sessions. The dataset spans:
| Task type | Language | Codebase | Runs |
|---|---|---|---|
| Feature development | TypeScript | thinktank main | ~25 |
| Bug fixes | TypeScript | thinktank main | ~10 |
| Refactoring | TypeScript | thinktank main | ~8 |
| A* pathfinding | Python, TypeScript | examples/astar-python, examples/astar | ~10 |
| ML regression | Python | examples/ml-regression | ~6 |
| ML classification | Python | examples/ml-classification | ~6 |
| Error handling, CLI features | TypeScript | thinktank main | ~4 |
Ensemble sizes: 2-agent (4 runs), 3-agent (12 runs), 4-agent (1 run), 5-agent (56 runs).
Note: An earlier version of this dataset included 11 runs with exit-127 test failures (test script not found in agent clones). These produced contaminated scoring data — agents were penalized for "test failures" when the test infrastructure didn't exist. These runs were identified and excluded after adding exit-127 detection to the test runner (see Section 4.4).
Inclusion criteria:
- At least 2 agents produced non-trivial diffs
- Scoring data available for all agents in the run
For each run, all three scoring methods re-scored the same set of agent results. We recorded which agent each method would recommend and computed:
- Pairwise agreement rates — how often do two methods pick the same agent?
- Unanimous agreement — how often do all three methods agree?
- Disagreement patterns — when methods disagree, which groupings form?
- Cochran's Q test — tests whether the three pairwise agreement rates are equal (generalizes McNemar's test to 3+ treatments)
- Wilcoxon signed-rank test — tests whether Weighted and Copeland produce systematically different agent rankings (not just different top-1 picks)
- Cliff's delta — effect size measure for the rank differences between methods
- Spearman rank correlation — per-run correlation between Weighted and Copeland full rankings
- The majority of runs are from a single codebase (thinktank itself); cross-project runs (A*, ML regression, ML classification) add diversity but are a minority
- No ground truth for "which agent is actually best" — we compare methods against each other, not against an oracle
- The Borda implementation uses a simplified ranking (tied ranks get first-available position, not averaged)
- Runs without test commands generate scoring data with less discriminative power on the test-pass criterion
| Comparison | n=21 (original) | n=73 (updated) |
|---|---|---|
| All three unanimous | 11/21 (52%) | 44/73 (60%) |
| Weighted = Copeland | 13/21 (62%) | 52/73 (71%) |
| Weighted = Borda | 12/21 (57%) | 46/73 (63%) |
| Copeland = Borda | 18/21 (86%) | 61/73 (84%) |
Note: With stored per-agent scores (n=53 subset), Copeland-Borda agreement is 96.2% (51/53).
| Ensemble size | Runs | W=C | C=B |
|---|---|---|---|
| 2-agent | 4 | 100% | 100% |
| 3-agent | 12 | 70% | 100% |
| 5-agent | 56 | 66% | 95% |
Disagreement concentrates in larger ensembles. With 2–3 agents, Copeland and Borda always agree. With 5 agents, W=C disagreement reaches 34%, while C=B stays at 95%.
Tests whether the three pairwise agreement rates (W=C, W=B, C=B) differ significantly.
- Q = 23.4, df = 2, p < 0.0001
- Significant: The agreement rates are not equal. Copeland-Borda agreement (96%) is significantly higher than Weighted-Copeland agreement (68%) and Weighted-Borda agreement (72%).
Tests whether Weighted and Copeland produce systematically different rankings (across all agents in all runs, not just the top-1 pick).
- n = 83 non-zero rank differences (from 236 total paired observations; 153 ties)
- W+ = 1740.5, W- = 1745.5
- z = -0.011, p = 0.99
- Not significant: The methods do not systematically rank agents differently across the full ranking. They diverge on the top-1 recommendation but produce similar overall orderings.
This combination — significant Cochran's Q but non-significant Wilcoxon — means the methods agree on which agents are generally good or bad, but disagree on which is best. The top-1 recommendation is the contentious decision.
Measures the magnitude of rank differences between Weighted and Copeland.
- 49 positive differences (Weighted ranks agent lower) vs 34 negative (Copeland ranks lower)
- d = 0.181 (small effect)
- Thresholds: negligible < 0.147, small < 0.33, medium < 0.474, large ≥ 0.474
The effect is small but real and consistent across sample sizes (0.183 at n=36, 0.181 at n=53).
Per-run correlation between Weighted and Copeland full rankings (n=53 runs with stored score data):
- Mean ρ = 0.613
- Median ρ = 1.000 (most runs have perfect agreement)
- Min = -0.700, Max = 1.000
- 60% of runs have perfect correlation; 25% have low or negative correlation
The bimodal distribution — most runs perfectly correlated, a meaningful minority anti-correlated — explains the paradox of "methods usually agree, but when they disagree it's dramatically."
Copeland-Borda concordance is very strong. At 96% (n=53 with stored scores) or 84% (n=73 from evaluate), two mathematically independent methods — pairwise tournament and rank aggregation — converge on the same recommendation. This increased from the original 86% (n=21) after cleaning contaminated exit-127 runs from the dataset.
Weighted is the outlier. Weighted Sum disagrees with Copeland ~32% of the time. Cochran's Q confirms this is statistically significant (p<0.0001). The disagreement is driven by:
- Weight sensitivity: The 100/50/10 point allocation over-emphasizes test pass/fail. Two agents that both pass tests are differentiated only by convergence (50 pts) and diff outlier penalty (0–10 pts), creating thin margins.
- Scale distortion: Weighted conflates ordinal preferences with cardinal magnitudes. A 4-point gap may reflect arbitrary weight choices rather than meaningful quality.
- Ensemble size effect: 5-agent runs produce 34% W≠C disagreement vs 0% for 2-agent runs. More agents create more ranking permutations where weight sensitivity matters.
The methods diverge on top-1, not on overall ranking. The Wilcoxon non-significance (p=0.99) combined with Cochran's Q significance (p<0.0001) reveals a nuanced picture: all three methods generally agree on which agents are good and which are bad, but disagree on which single agent is best. Since thinktank must pick one agent to recommend, this top-1 divergence is the decision that matters.
Based on these findings, thinktank defaults to Copeland scoring:
- Theoretically principled: Copeland is Condorcet-consistent and scale-independent
- Empirically validated: 96% agreement with Borda (n=53 with stored scores, 84% at n=73) across diverse tasks and languages
- No arbitrary weights: Eliminates the 100/50/10 point allocation debate
- Transparent: Each criterion is a clear "win" or "loss" — easier for users to understand why an agent was recommended
- Statistically supported: Cochran's Q (p<0.0001) confirms Copeland-Borda agreement is significantly higher than Weighted-Copeland agreement
Weighted Sum remains useful when users want to explicitly emphasize one criterion (e.g., "I only care about tests passing"). The --scoring weighted flag remains available.
- Ground truth validation: Use LLM-as-judge or human evaluation to independently rate solution quality, then correlate with each scoring method's recommendations
- Multi-codebase evaluation: Expand the A* and ML examples to more languages and problem domains
- Kendall's W concordance: With controlled N=5 runs, compute inter-method concordance coefficient
- Multi-model ensembles: Test whether Claude + GPT + Gemini ensembles produce different scoring dynamics than single-model ensembles
- Ensemble test generation: Use thinktank to generate test suites before running implementations, catching bad assertions via convergence (see Section 4.4)
During development of the A* pathfinding examples, a single agent wrote a test asserting the shortest maze path was 13 steps. The actual shortest path is 9. This incorrect test became the oracle for 13+ thinktank runs, causing every correctly-implemented A* solution to appear as "failed." We initially diagnosed this as correlated model failure — all agents converging on the same wrong approach.
In reality, every agent was correct and the test was wrong. The signal was there: all agents passed 6/7 tests, failing only on test_maze. A single failing test across all agents should have prompted investigation of the test, not the agents.
This experience led to two changes:
- Exit-127 detection in the test runner — distinguishes "test command not found" from "tests ran and failed," preventing false penalties in Copeland scoring
- Issue #159: Ensemble test generation — using thinktank to write test suites via ensemble, where convergence analysis catches disagreements in expected values before they become the oracle
- Merlin, V., & Valognes, F. (2004). "On the coincidence of Condorcet and Borda winners." Theory and Decision, 57(3), 249–273.
- Tetlock, P., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
- Kambhampati, S. (2024). "LLMs Can't Plan, But Can Help Planning." arXiv:2402.01817.
- Li, Y., et al. (2022). "Competition-Level Code Generation with AlphaCode." Science, 378(6624).
- Arrow, K. J. (1951). Social Choice and Individual Values. Wiley.
- Chen, M., et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374.
- Romano, J., Kromrey, J. D., Coraggio, J., & Skowronek, J. (2006). "Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen's d?" AERA Annual Meeting.
# From the thinktank repository with .thinktank/ run data:
thinktank evaluate
# For the full statistical analysis:
python scripts/scoring-analysis.pyThe evaluate command re-scores all past runs with all three methods and displays the comparison table. The Python script adds Wilcoxon, Cliff's delta, Cochran's Q, and Spearman correlation tests.
Total run files: 102
Usable for scoring: 73 (72%)
Excluded: 29 (no diffs, single agent, missing scoring data, or exit-127)
By task type:
TypeScript feature dev: ~25 runs
TypeScript bug fixes: ~10 runs
TypeScript refactoring: ~8 runs
A* pathfinding (Py/TS): ~10 runs
ML regression (Python): ~6 runs
ML classification (Py): ~6 runs
CLI/error handling: ~4 runs
By ensemble size:
2-agent: 4 runs
3-agent: 12 runs
4-agent: 1 run
5-agent: 56 runs
Cochran's Q test: Generalization of McNemar's test to k>2 matched groups. Each run is a block; the three treatments are whether each method pair agrees. Q follows chi-squared with k-1 df under H₀.
Wilcoxon signed-rank test: Non-parametric test for paired samples. We pair each agent's Weighted rank with its Copeland rank across all runs. Normal approximation used for n=71 non-zero differences.
Cliff's delta: Non-parametric effect size. Computed as (n_concordant - n_discordant) / n_total on the paired rank differences. Thresholds per Romano et al. (2006): negligible < 0.147, small < 0.33, medium < 0.474, large ≥ 0.474.