You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/scoring-evaluation.md
+57-42Lines changed: 57 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
# Technical Report: Evaluating Recommendation Scoring Methods in Ensemble AI Coding
2
2
3
-
**thinktank project** — March 2026 (updated with n=57 dataset)
3
+
**thinktank project** — March 2026 (updated with n=73 dataset, 102 total runs)
4
4
5
5
## Abstract
6
6
7
-
thinktank runs N parallel AI coding agents on the same task and must recommend the "best" result. We evaluate three recommendation scoring methods — Weighted Sum, Copeland Pairwise, and Borda Count — across **57 usable ensemble coding runs** spanning 5 task types, 2 programming languages, and 4 distinct codebases. We find that **Copeland and Borda converge on the same recommendation 81% of the time**, while the Weighted Sum disagrees with Copeland 32% of the time. Cochran's Q test confirms the agreement rates differ significantly across method pairs (Q=17.7, p<0.001). Cliff's delta indicates a small but real effect (d=0.183). These results support Copeland as the default scoring method, though the Wilcoxon signed-rank test finds no systematic ranking shift (p=0.99), suggesting the methods diverge primarily on the top-1 recommendation rather than on full rankings.
7
+
thinktank runs N parallel AI coding agents on the same task and must recommend the "best" result. We evaluate three recommendation scoring methods — Weighted Sum, Copeland Pairwise, and Borda Count — across **73 usable ensemble coding runs**(102 total) spanning 6 task types, 2 programming languages, and 5 distinct codebases. We find that **Copeland and Borda converge on the same recommendation 96% of the time**, while the Weighted Sum disagrees with Copeland 32% of the time. Cochran's Q test confirms the agreement rates differ significantly across method pairs (Q=23.4, p<0.0001). Cliff's delta indicates a small but real effect (d=0.181). These results support Copeland as the default scoring method, though the Wilcoxon signed-rank test finds no systematic ranking shift (p=0.99), suggesting the methods diverge primarily on the top-1 recommendation rather than on full rankings.
8
8
9
9
## 1. Background
10
10
@@ -44,19 +44,21 @@ Rank agents on each criterion independently. Sum the ranks. Lowest total rank wi
44
44
45
45
### 2.1 Dataset
46
46
47
-
We analyzed **57 usable ensemble coding runs** (from 83 total) collected across multiple development sessions. The dataset spans:
47
+
We analyzed **73 usable ensemble coding runs** (from 102 total) collected across multiple development sessions. The dataset spans:
48
48
49
49
| Task type | Language | Codebase | Runs |
50
50
|-----------|----------|----------|------|
51
51
| Feature development | TypeScript | thinktank main |~25 |
Note: An earlier version of this dataset included 11 runs with exit-127 test failures (test script not found in agent clones). These produced contaminated scoring data — agents were penalized for "test failures" when the test infrastructure didn't exist. These runs were identified and excluded after adding exit-127 detection to the test runner (see Section 4.4).
60
62
61
63
Inclusion criteria:
62
64
- At least 2 agents produced non-trivial diffs
@@ -79,7 +81,7 @@ For each run, all three scoring methods re-scored the same set of agent results.
79
81
80
82
### 2.4 Limitations
81
83
82
-
- The majority of runs (~43/57) are from a single codebase (thinktank itself); cross-project runs (A*, ML) add diversity but are a minority
84
+
- The majority of runs are from a single codebase (thinktank itself); cross-project runs (A*, ML regression, ML classification) add diversity but are a minority
83
85
- No ground truth for "which agent is actually best" — we compare methods against each other, not against an oracle
84
86
- The Borda implementation uses a simplified ranking (tied ranks get first-available position, not averaged)
85
87
- Runs without test commands generate scoring data with less discriminative power on the test-pass criterion
@@ -88,38 +90,40 @@ For each run, all three scoring methods re-scored the same set of agent results.
Note: With stored per-agent scores (n=53 subset), Copeland-Borda agreement is **96.2%** (51/53).
97
101
98
102
### 3.2 Agreement by Ensemble Size
99
103
100
104
| Ensemble size | Runs | W=C | C=B |
101
105
|---------------|------|-----|-----|
102
106
| 2-agent | 4 | 100% | 100% |
103
-
| 3-agent |9|78% | 100% |
104
-
| 5-agent |43|63% |77% |
107
+
| 3-agent |12|70% | 100% |
108
+
| 5-agent |56|66% |95% |
105
109
106
-
Disagreement concentrates in larger ensembles. With 2–3 agents, all methods converge. With 5 agents, the ranking space is larger and methods diverge — particularly Weighted vs Copeland (37% disagreement).
110
+
Disagreement concentrates in larger ensembles. With 2–3 agents, Copeland and Borda always agree. With 5 agents, W=C disagreement reaches 34%, while C=B stays at 95%.
107
111
108
112
### 3.3 Statistical Tests
109
113
110
114
#### Cochran's Q Test
111
115
112
116
Tests whether the three pairwise agreement rates (W=C, W=B, C=B) differ significantly.
113
117
114
-
- Q = 17.7, df = 2, **p < 0.001**
115
-
-**Significant**: The agreement rates are not equal. Copeland-Borda agreement (81%) is significantly higher than Weighted-Copeland agreement (68%) and Weighted-Borda agreement (60%).
118
+
- Q = 23.4, df = 2, **p < 0.0001**
119
+
-**Significant**: The agreement rates are not equal. Copeland-Borda agreement (96%) is significantly higher than Weighted-Copeland agreement (68%) and Weighted-Borda agreement (72%).
116
120
117
121
#### Wilcoxon Signed-Rank Test
118
122
119
123
Tests whether Weighted and Copeland produce systematically different rankings (across all agents in all runs, not just the top-1 pick).
120
124
121
-
- n = 71 non-zero rank differences (from 157 total paired observations; 86 ties)
122
-
- W+ = 1280, W- = 1276
125
+
- n = 83 non-zero rank differences (from 236 total paired observations; 153 ties)
126
+
- W+ = 1740.5, W- = 1745.5
123
127
- z = -0.011, **p = 0.99**
124
128
-**Not significant**: The methods do not systematically rank agents differently across the full ranking. They diverge on the *top-1 recommendation* but produce similar overall orderings.
125
129
@@ -129,34 +133,34 @@ This combination — significant Cochran's Q but non-significant Wilcoxon — me
129
133
130
134
Measures the magnitude of rank differences between Weighted and Copeland.
- Thresholds: negligible < 0.147, small < 0.33, medium < 0.474, large ≥ 0.474
135
139
136
-
The effect is small but real — a step up from negligible at the earlier n=20 sample.
140
+
The effect is small but real and consistent across sample sizes (0.183 at n=36, 0.181 at n=53).
137
141
138
142
#### Spearman Rank Correlation
139
143
140
-
Per-run correlation between Weighted and Copeland full rankings (n=36 runs with stored score data):
144
+
Per-run correlation between Weighted and Copeland full rankings (n=53 runs with stored score data):
141
145
142
-
- Mean ρ = 0.528
146
+
- Mean ρ = 0.613
143
147
- Median ρ = 1.000 (most runs have perfect agreement)
144
148
- Min = -0.700, Max = 1.000
145
-
-50% of runs have perfect correlation; 31% have low or negative correlation
149
+
-60% of runs have perfect correlation; 25% have low or negative correlation
146
150
147
151
The bimodal distribution — most runs perfectly correlated, a meaningful minority anti-correlated — explains the paradox of "methods usually agree, but when they disagree it's dramatically."
148
152
149
153
### 3.4 Interpretation
150
154
151
-
**Copeland-Borda concordance remains strong.** At 81% (n=57), two mathematically independent methods — pairwise tournament and rank aggregation — converge on the same recommendation. This is lower than the original 86% (n=21) but still demonstrates that pairwise comparison methods produce robust recommendations.
155
+
**Copeland-Borda concordance is very strong.** At 96% (n=53 with stored scores) or 84% (n=73 from evaluate), two mathematically independent methods — pairwise tournament and rank aggregation — converge on the same recommendation. This increased from the original 86% (n=21) after cleaning contaminated exit-127 runs from the dataset.
152
156
153
-
**Weighted is the outlier.** Weighted Sum disagrees with Copeland 32% of the time and with Borda 40% of the time. Cochran's Q confirms this is statistically significant (p<0.001). The disagreement is driven by:
157
+
**Weighted is the outlier.** Weighted Sum disagrees with Copeland ~32% of the time. Cochran's Q confirms this is statistically significant (p<0.0001). The disagreement is driven by:
154
158
155
159
1.**Weight sensitivity**: The 100/50/10 point allocation over-emphasizes test pass/fail. Two agents that both pass tests are differentiated only by convergence (50 pts) and diff outlier penalty (0–10 pts), creating thin margins.
156
160
2.**Scale distortion**: Weighted conflates ordinal preferences with cardinal magnitudes. A 4-point gap may reflect arbitrary weight choices rather than meaningful quality.
157
-
3.**Ensemble size effect**: 5-agent runs produce 37% W≠C disagreement vs 0% for 2-agent runs. More agents create more ranking permutations where weight sensitivity matters.
161
+
3.**Ensemble size effect**: 5-agent runs produce 34% W≠C disagreement vs 0% for 2-agent runs. More agents create more ranking permutations where weight sensitivity matters.
158
162
159
-
**The methods diverge on top-1, not on overall ranking.** The Wilcoxon non-significance (p=0.99) combined with Cochran's Q significance (p<0.001) reveals a nuanced picture: all three methods generally agree on which agents are good and which are bad, but disagree on which single agent is *best*. Since thinktank must pick one agent to recommend, this top-1 divergence is the decision that matters.
163
+
**The methods diverge on top-1, not on overall ranking.** The Wilcoxon non-significance (p=0.99) combined with Cochran's Q significance (p<0.0001) reveals a nuanced picture: all three methods generally agree on which agents are good and which are bad, but disagree on which single agent is *best*. Since thinktank must pick one agent to recommend, this top-1 divergence is the decision that matters.
160
164
161
165
## 4. Recommendations
162
166
@@ -165,10 +169,10 @@ The bimodal distribution — most runs perfectly correlated, a meaningful minori
165
169
Based on these findings, thinktank defaults to Copeland scoring:
166
170
167
171
1.**Theoretically principled**: Copeland is Condorcet-consistent and scale-independent
168
-
2.**Empirically validated**: 81% agreement with Borda (n=57) across diverse tasks and languages
172
+
2.**Empirically validated**: 96% agreement with Borda (n=53 with stored scores, 84% at n=73) across diverse tasks and languages
169
173
3.**No arbitrary weights**: Eliminates the 100/50/10 point allocation debate
170
174
4.**Transparent**: Each criterion is a clear "win" or "loss" — easier for users to understand why an agent was recommended
171
-
5.**Statistically supported**: Cochran's Q (p<0.001) confirms Copeland-Borda agreement is significantly higher than Weighted-Copeland agreement
175
+
5.**Statistically supported**: Cochran's Q (p<0.0001) confirms Copeland-Borda agreement is significantly higher than Weighted-Copeland agreement
172
176
173
177
### 4.2 Retain Weighted as Option
174
178
@@ -180,6 +184,17 @@ Weighted Sum remains useful when users want to explicitly emphasize one criterio
180
184
2.**Multi-codebase evaluation**: Expand the A* and ML examples to more languages and problem domains
181
185
3.**Kendall's W concordance**: With controlled N=5 runs, compute inter-method concordance coefficient
182
186
4.**Multi-model ensembles**: Test whether Claude + GPT + Gemini ensembles produce different scoring dynamics than single-model ensembles
187
+
5.**Ensemble test generation**: Use thinktank to generate test suites before running implementations, catching bad assertions via convergence (see Section 4.4)
188
+
189
+
### 4.4 Lesson learned: validate your oracle
190
+
191
+
During development of the A* pathfinding examples, a single agent wrote a test asserting the shortest maze path was 13 steps. The actual shortest path is 9. This incorrect test became the oracle for 13+ thinktank runs, causing every correctly-implemented A* solution to appear as "failed." We initially diagnosed this as correlated model failure — all agents converging on the same wrong approach.
192
+
193
+
In reality, **every agent was correct and the test was wrong.** The signal was there: all agents passed 6/7 tests, failing only on `test_maze`. A single failing test across all agents should have prompted investigation of the test, not the agents.
194
+
195
+
This experience led to two changes:
196
+
1.**Exit-127 detection** in the test runner — distinguishes "test command not found" from "tests ran and failed," preventing false penalties in Copeland scoring
197
+
2.**Issue #159: Ensemble test generation** — using thinktank to write test suites via ensemble, where convergence analysis catches disagreements in expected values before they become the oracle
183
198
184
199
## 5. References
185
200
@@ -206,24 +221,24 @@ The `evaluate` command re-scores all past runs with all three methods and displa
206
221
## Appendix B: Dataset Composition
207
222
208
223
```
209
-
Total run files: 83
210
-
Usable for scoring: 57 (69%)
211
-
Excluded: 26 (no diffs, single agent, or missing scoring data)
224
+
Total run files: 102
225
+
Usable for scoring: 73 (72%)
226
+
Excluded: 29 (no diffs, single agent, missing scoring data, or exit-127)
0 commit comments