Skip to content

Commit 6fcb11a

Browse files
Fix accuracy ceiling: split-half bias correction, bootstrap CIs (#19)
* Fix accuracy ceiling methodology: split-half bias correction, CIs, rename MCTS→MC The old MC conditional ceiling (7.92% with 32 rollouts) was substantially inflated by the max-over-noisy-estimates bias. This commit: Rust (engine/src/random.rs, lib.rs): - Split-half bias correction: rollouts are divided into two independent halves — A selects the argmax move, B evaluates it — producing a downward-biased estimate that brackets the true ceiling with the upward-biased naive estimate. - Add game_idx field to PositionCeiling for clustered bootstrap. - Add progress logging via atomic counter (~5% increments to stderr). Python (scripts/compute_theoretical_ceiling.py): - Bootstrap 95% CIs clustered by game (not position) to account for within-game correlation. - Report both naive and corrected estimates with bias bracket width. - Rename MCTS→MC throughout. Results (5000 games, 128 rollouts, 5% sample rate, 59,969 positions): - Unconditional: 6.52% [6.44, 6.61] - MC naive (biased up): 7.34% [7.24, 7.43] (was 7.92% at 32 rollouts) - MC corrected (biased down): 6.67% [6.58, 6.76] - Bracket width: 0.66pp Docs: rewrite ACCURACY_CEILING.md with final numbers, honest limitations, bracket notation. Add CEILING_POSTMORTEM.md with full comparison. Update model cards, README, CLAUDE.md. * Remove fallback defaults in load_ceilings, move monitoring script to local/ load_ceilings() now fails loudly if conditional_corrected_ceiling or n_rollouts are missing from the JSON artifact, consistent with the project's policy of never silently falling back to misleading numbers.
1 parent a05ef6e commit 6fcb11a

File tree

12 files changed

+624
-137
lines changed

12 files changed

+624
-137
lines changed

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: `--min-eva
160160
uv run python scripts/compute_theoretical_ceiling.py
161161
```
162162

163-
Computes upper bounds on top-1 accuracy for random games: unconditional (E[1/N_legal] = 6.43%), naive-conditioned (1-ply filter = 6.44%), MCTS-conditioned (32 rollouts = 7.92%). CPU-intensive.
163+
Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive.
164164

165165
### Export to HuggingFace
166166

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ Despite training exclusively on random games, PAWN develops rich internal repres
122122

123123
The model also achieves >99.8% legal move rate on the base and large variants, correctly identifying legal moves from move history alone.
124124

125-
The [theoretical accuracy ceiling](docs/ACCURACY_CEILING.md) for random game prediction is 6.43% (unconditional) to 7.92% (MCTS-conditioned on outcome). All three models exceed the unconditional ceiling, confirming they learn structure beyond move legality.
125+
The [theoretical accuracy ceiling](docs/ACCURACY_CEILING.md) for random game prediction is 6.52% (unconditional). The MC-conditioned ceiling (Bayes-optimal with outcome knowledge) is estimated at [6.67%, 7.34%] via split-half bias correction. All three models exceed the unconditional ceiling, confirming they exploit the outcome token to make non-uniform predictions.
126126

127127
## Adapter Methods
128128

cards/hf_model_card.md.j2

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -88,13 +88,12 @@ This is the **{{ variant_label }}** variant ({{ params }} parameters). PAWN is d
8888

8989
### Accuracy Ratios
9090

91-
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
91+
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
9292

9393
| Ceiling | Ratio |
9494
|---------|-------|
9595
| Unconditioned (E\[1/N_legal\] = {{ "%.2f"|format(uncond_ceiling) }}%) | {{ uncond_ratio }}% |
96-
| Naive-conditioned (1-ply filter = {{ "%.2f"|format(naive_ceiling) }}%) | {{ naive_ratio }}% |
97-
| Bayes-optimal conditioned (MCTS, 32 rollouts = {{ "%.2f"|format(mcts_ceiling) }}%) | {{ mcts_ratio }}% |
96+
| Bayes-optimal conditioned (MC, {{ n_rollouts }} rollouts = \[{{ "%.2f"|format(mc_corrected_ceiling) }}, {{ "%.2f"|format(mc_naive_ceiling) }}\]%) | {{ mc_corrected_ratio }}–{{ mc_naive_ratio }}% |
9897
{% if probes %}
9998

10099
## Probe Results

cards/model/pawn-base.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -88,13 +88,12 @@ This is the **base (default)** variant (~35.8M parameters). PAWN is designed as
8888

8989
### Accuracy Ratios
9090

91-
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
91+
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
9292

9393
| Ceiling | Ratio |
9494
|---------|-------|
95-
| Unconditioned (E\[1/N_legal\] = 6.43%) | 109% |
96-
| Naive-conditioned (1-ply filter = 6.44%) | 109% |
97-
| Bayes-optimal conditioned (MCTS, 32 rollouts = 7.92%) | 89% |
95+
| Unconditioned (E\[1/N_legal\] = 6.52%) | 105% |
96+
| Bayes-optimal conditioned (MC, 128 rollouts = \[6.67, 7.34\]%) | 94–103% |
9897

9998

10099
## Probe Results

cards/model/pawn-large.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -88,13 +88,12 @@ This is the **large** variant (~68.4M parameters). PAWN is designed as a frozen
8888

8989
### Accuracy Ratios
9090

91-
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
91+
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
9292

9393
| Ceiling | Ratio |
9494
|---------|-------|
95-
| Unconditioned (E\[1/N_legal\] = 6.43%) | 108% |
96-
| Naive-conditioned (1-ply filter = 6.44%) | 108% |
97-
| Bayes-optimal conditioned (MCTS, 32 rollouts = 7.92%) | 88% |
95+
| Unconditioned (E\[1/N_legal\] = 6.52%) | 106% |
96+
| Bayes-optimal conditioned (MC, 128 rollouts = \[6.67, 7.34\]%) | 95–104% |
9897

9998

10099
## Probe Results

cards/model/pawn-small.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -88,13 +88,12 @@ This is the **small** variant (~9.5M parameters). PAWN is designed as a frozen b
8888

8989
### Accuracy Ratios
9090

91-
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
91+
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
9292

9393
| Ceiling | Ratio |
9494
|---------|-------|
95-
| Unconditioned (E\[1/N_legal\] = 6.43%) | 105% |
96-
| Naive-conditioned (1-ply filter = 6.44%) | 105% |
97-
| Bayes-optimal conditioned (MCTS, 32 rollouts = 7.92%) | 85% |
95+
| Unconditioned (E\[1/N_legal\] = 6.52%) | 103% |
96+
| Bayes-optimal conditioned (MC, 128 rollouts = \[6.67, 7.34\]%) | 92–101% |
9897

9998

10099
## Probe Results

docs/ACCURACY_CEILING.md

Lines changed: 141 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,17 @@ ceiling — no model, however large, can exceed it.
66

77
## Three ceilings
88

9-
### Unconditional ceiling: E[1/N_legal] = 6.43%
9+
### Unconditional ceiling: E[1/N_legal] = 6.52%
1010

1111
At each position, the move is drawn uniformly from N legal moves. The best
12-
a predictor can do without any context is pick one at random: accuracy = 1/N.
13-
Averaged over all positions in random games, this gives **6.43%**.
12+
a predictor can do is pick one at random: accuracy = 1/N. Averaged over all
13+
positions in random games, this gives **6.52%** (95% CI [6.44, 6.61]).
1414

15-
A model that exceeds this ceiling has learned something beyond just "which
16-
moves are legal"it has learned to estimate the number of legal moves at
17-
each position and bias predictions toward positions with fewer options.
15+
A model that exceeds this ceiling has learned to use the outcome token to
16+
make non-uniform predictionsassigning higher probability to moves that
17+
are more consistent with the known game outcome.
1818

19-
### Naive conditional ceiling: 6.44%
19+
### Naive conditional ceiling: 6.53%
2020

2121
A zero-cost analytical estimate of outcome conditioning. At each position,
2222
legal moves that lead to an immediate terminal state with a *different*
@@ -26,95 +26,168 @@ This barely exceeds the unconditional ceiling (1.00x boost) because
2626
immediate terminal states are rare — most moves at most positions lead to
2727
non-terminal continuations, so the filter has almost nothing to exclude.
2828

29-
### MCTS conditional ceiling: 7.92%
29+
### MC conditional ceiling: [6.67%, 7.34%]
3030

3131
The full Monte Carlo estimate. At each sampled position, every legal move is
32-
tried and 32 random continuations are played out to estimate
32+
tried and random continuations are played out to estimate
3333
P(outcome | move, history). The Bayes-optimal predictor picks the move most
34-
consistent with the known outcome.
34+
consistent with the known outcome:
35+
36+
P(m_i | outcome, history) = P(outcome | m_i, history) / Σ_j P(outcome | m_j, history)
37+
38+
The ceiling at each position is max_i P(m_i | outcome, history), and the
39+
overall ceiling is the mean over all positions.
3540

3641
PAWN's input sequence begins with an outcome token (`WHITE_CHECKMATES`,
3742
`STALEMATE`, `PLY_LIMIT`, etc.). This leaks information about the game's
3843
trajectory, making some moves more predictable:
3944

40-
- **Checkmate games**: The final move must deliver checkmate. Knowing this
41-
raises the ceiling at the last ply from ~5% to ~14%.
45+
- **Checkmate games**: The final move must deliver checkmate — constraining
46+
on the last few plies.
4247
- **Ply limit games**: Knowing the game lasts 255 plies constrains the move
4348
distribution slightly.
4449
- **Stalemate games**: The final position has no legal moves but isn't check
45-
— very constraining on late moves.
50+
— constraining on late moves.
51+
52+
#### Bias bracket
53+
54+
The MC conditional ceiling is reported as a bracket, not a point estimate.
55+
The naive Monte Carlo estimator (max of noisy P̂(outcome | m_i) estimates)
56+
is **biased upward** because the `max` operator preferentially selects
57+
whichever estimate had favorable noise (Jensen's inequality:
58+
E[max X̂_i] ≥ max E[_i]).
59+
60+
To bound this bias, we use a **split-half** correction: rollouts are split
61+
into two independent halves (A and B). Half A selects the argmax move;
62+
half B evaluates it. This breaks the selection-evaluation feedback loop,
63+
producing an estimate that is **biased downward** (sometimes A picks the
64+
wrong argmax). The true ceiling lies between the two:
65+
66+
corrected (biased down) ≤ true ceiling ≤ naive (biased up)
4667

47-
## Adjusted accuracy
68+
With 128 rollouts per move, the bracket is **0.66pp** wide. The corrected
69+
and naive 95% CIs do not overlap, confirming the bias is real and
70+
non-negligible at this rollout count.
4871

49-
| Metric | Value |
50-
|--------|-------|
51-
| Unconditional ceiling (E[1/N_legal]) | 6.43% |
52-
| Naive conditional ceiling (1-ply filter) | 6.44% |
53-
| MCTS conditional ceiling (32 rollouts) | 7.92% |
54-
| Conditioning boost (naive) | 1.00x |
55-
| Conditioning boost (MCTS) | 1.23x |
72+
## Summary
73+
74+
| Metric | Value | 95% CI |
75+
|--------|-------|--------|
76+
| Unconditional ceiling (E[1/N_legal]) | 6.52% | [6.44, 6.61] |
77+
| Naive conditional ceiling (1-ply filter) | 6.53% | [6.44, 6.61] |
78+
| MC conditional ceiling (naive est.) | 7.34% | [7.24, 7.43] |
79+
| MC conditional ceiling (corrected) | 6.67% | [6.58, 6.76] |
80+
| Bias bracket width | 0.66pp | |
81+
| Conditioning boost (naive) | 1.12x | |
82+
| Conditioning boost (corrected) | 1.02x | |
5683

5784
For a model with top-1 accuracy A:
5885

59-
- **Adjusted (unconditional)** = A / 6.43% — measures how much the model
60-
has learned about chess legality. Values > 100% mean it has learned
61-
structure beyond just legal moves.
62-
- **Adjusted (naive conditional)** = A / 6.44% — essentially the same as
63-
unconditional; confirms that 1-ply lookahead explains almost none of the
64-
outcome conditioning benefit.
65-
- **Adjusted (MCTS conditional)** = A / 7.92% — measures how close the
86+
- **Adjusted (unconditional)** = A / 6.52% — measures how much the model
87+
has learned beyond predicting uniformly over legal moves. Values > 100%
88+
mean the model exploits the outcome token to make non-uniform predictions.
89+
- **Adjusted (MC conditional)** = A / ceiling — measures how close the
6690
model is to the Bayes-optimal predictor with perfect outcome knowledge.
67-
This is the tighter bound.
91+
Report against both the naive and corrected estimates for transparency.
6892

6993
### Final model results (100K steps)
7094

71-
| Variant | Top-1 | vs Uncond | vs Naive Cond | vs MCTS Cond |
72-
|---------|-------|-----------|---------------|--------------|
73-
| large (68M) | 6.94% | 108% | 108% | 88% |
74-
| base (36M) | 6.86% | 107% | 107% | 87% |
75-
| small (10M) | 6.73% | 105% | 105% | 85% |
95+
| Variant | Top-1 | vs Uncond | vs MC Naive | vs MC Corrected |
96+
|---------|-------|-----------|-------------|-----------------|
97+
| large (68M) | 6.94% | 106% | 95% | 104% |
98+
| base (36M) | 6.86% | 105% | 94% | 103% |
99+
| small (10M) | 6.73% | 103% | 92% | 101% |
76100

77-
All models exceed the unconditional and naive conditional ceilings,
78-
confirming they learn chess structure beyond move legality. The large and
79-
base models reach 87-88% of the MCTS conditional ceiling.
101+
All models exceed the unconditional ceiling, confirming they exploit the
102+
outcome token. Against the MC corrected ceiling, all models appear to be
103+
at or slightly above the estimate — expected, since the corrected estimate
104+
is biased downward. The true ceiling lies somewhere in the bracket, and
105+
the models are close to it.
80106

81107
## Per-outcome breakdown
82108

83-
| Outcome | Uncond | Naive Cond | MCTS Cond | Positions |
84-
|---------|--------|------------|-----------|-----------|
85-
| White checkmated | 5.26% | 5.26% | 13.79% | 328 |
86-
| Black checkmated | 5.02% | 5.02% | 13.64% | 388 |
87-
| Stalemate | 7.22% | 7.22% | 18.67% | 125 |
88-
| Insufficient material | 7.17% | 7.17% | 18.61% | 256 |
89-
| Ply limit | 6.51% | 6.51% | 6.97% | 8,618 |
90-
91-
The naive conditional ceiling equals the unconditional ceiling across all
92-
outcome types — the 1-ply filter never fires in practice. The MCTS ceiling
93-
shows the real conditioning benefit: decisive outcomes (checkmate, stalemate,
94-
insufficient material) get a 2.6x boost, while ply limit games — the vast
95-
majority — show only 1.07x because knowing the game goes the distance
96-
provides minimal per-move information.
109+
| Outcome | Uncond | MC Naive | MC Corrected | Bracket | n |
110+
|---------|--------|----------|--------------|---------|---|
111+
| White checkmated | 5.44% | 9.63% | 6.34% | 3.29pp | 2,167 |
112+
| Black checkmated | 5.05% | 9.25% | 6.07% | 3.18pp | 2,382 |
113+
| Stalemate | 7.98% | 14.05% | 8.50% | 5.55pp | 1,029 |
114+
| Insufficient material | 7.18% | 12.35% | 8.03% | 4.32pp | 1,651 |
115+
| Ply limit | 6.59% | 6.87% | 6.63% | 0.24pp | 52,740 |
116+
117+
The bias bracket is narrow (0.24pp) for ply-limit games, which make up 88%
118+
of positions — most legal moves lead to ply-limit regardless, so outcome
119+
probabilities are all near 0.9 and the max/sum ratio is close to 1/N. The
120+
bracket is wide (3-5pp) for decisive outcomes, where a few moves have high
121+
P(outcome | m_i) and the rest are near zero, making the max sensitive to
122+
noise.
123+
124+
The conditioning benefit for decisive outcomes is real but modest. Taking
125+
the corrected estimates at face value:
126+
127+
| Outcome | Corrected / Unconditional |
128+
|---------|--------------------------|
129+
| White checkmated | 1.17x |
130+
| Black checkmated | 1.20x |
131+
| Stalemate | 1.07x |
132+
| Insufficient material | 1.12x |
133+
| Ply limit | 1.01x |
134+
135+
However, the per-outcome corrected estimates are noisy for rare outcomes
136+
(especially stalemate, n=1,029) and should be interpreted cautiously.
137+
138+
## Limitations
139+
140+
- **The bias bracket is too wide for strong quantitative claims.** At 128
141+
rollouts, we can say the model is between 92-104% of the Bayes-optimal
142+
ceiling (depending on which estimate is used), but not pin it down more
143+
precisely. A 256- or 512-rollout run would narrow this.
144+
- **The MC ceiling is an estimate, not exact.** Both the naive and corrected
145+
estimators have known biases. The true ceiling lies between them, but the
146+
exact value is unknown without infinite rollouts.
147+
- **Per-outcome estimates are noisy for rare outcomes.** Checkmate and
148+
stalemate positions have large per-position variance with 128 rollouts.
149+
Stratified sampling (oversampling rare outcomes) would improve precision.
150+
- **The ceiling assumes perfect outcome knowledge.** The model must *learn*
151+
P(outcome | move, history) from data, so achievable accuracy for a finite
152+
model is somewhat below the theoretical ceiling.
153+
- **Other sources of signal are not accounted for.** The model may exploit
154+
sequential structure in random games (e.g., position-dependent move
155+
popularity, game-length correlations) beyond what the outcome token
156+
provides. The ceiling analysis does not isolate this.
97157

98158
## Reproducing
99159

100160
```bash
101-
# Default: 2000 games, 32 rollouts/move, 2% sample rate
102-
uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069
161+
# Moderate precision: 5000 games, 128 rollouts/move (64 per half), 5% sample rate
162+
uv run python scripts/compute_theoretical_ceiling.py \
163+
--n-games 5000 --rollouts 128 --sample-rate 0.05 \
164+
--model-accuracy 0.069
103165

104-
# Higher precision (slower)
105-
uv run python scripts/compute_theoretical_ceiling.py --n-games 10000 --rollouts 64 --sample-rate 0.05
166+
# Quick check (low precision, ~2 min)
167+
uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069
106168
```
107169

108-
Results are saved to `data/theoretical_ceiling.json`.
109-
110-
## Caveats
111-
112-
- The MCTS ceiling is an estimate, not exact. With more rollouts and higher
113-
sample rates, the estimate improves but computation time increases
114-
quadratically.
115-
- The ceiling assumes the model has perfect knowledge of P(outcome | move,
116-
history). In practice, the model must learn this from data, so the
117-
achievable accuracy for a finite model is somewhat below the ceiling.
118-
- Game length information is implicit in the outcome token (e.g., PLY_LIMIT
119-
implies 255 plies). A model could theoretically use position in the
120-
sequence to estimate remaining game length, further improving predictions.
170+
Results are saved to `data/theoretical_ceiling.json` and include bootstrap
171+
95% CIs clustered by game. Runtime: ~38 min on 16-core CPU for the moderate
172+
configuration.
173+
174+
## Methodology
175+
176+
The computation (implemented in Rust in `engine/src/random.rs`) works as
177+
follows:
178+
179+
1. Generate N random games (uniform legal move selection).
180+
2. Sample a fraction of positions across all games.
181+
3. At each sampled position with K legal moves and known outcome O:
182+
- **Unconditional**: ceiling = 1/K.
183+
- **Naive conditional**: try each move; if it immediately terminates with
184+
outcome ≠ O, prune it. Ceiling = 1/(K - pruned).
185+
- **MC conditional**: for each legal move, play R/2 random continuations
186+
from two independent seeds. Estimate P(O | m_i) from each half.
187+
Naive estimate = max(P̂_combined) / Σ P̂_combined.
188+
Corrected = P̂_B[argmax(P̂_A)] / Σ P̂_combined.
189+
4. Report means with bootstrap 95% CIs (resampled by game to account for
190+
within-game correlation).
191+
192+
See [CEILING_POSTMORTEM.md](CEILING_POSTMORTEM.md) for a detailed comparison
193+
with the original computation and discussion of implications.

0 commit comments

Comments
 (0)