You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Fix accuracy ceiling methodology: split-half bias correction, CIs, rename MCTS→MC
The old MC conditional ceiling (7.92% with 32 rollouts) was substantially
inflated by the max-over-noisy-estimates bias. This commit:
Rust (engine/src/random.rs, lib.rs):
- Split-half bias correction: rollouts are divided into two independent
halves — A selects the argmax move, B evaluates it — producing a
downward-biased estimate that brackets the true ceiling with the
upward-biased naive estimate.
- Add game_idx field to PositionCeiling for clustered bootstrap.
- Add progress logging via atomic counter (~5% increments to stderr).
Python (scripts/compute_theoretical_ceiling.py):
- Bootstrap 95% CIs clustered by game (not position) to account for
within-game correlation.
- Report both naive and corrected estimates with bias bracket width.
- Rename MCTS→MC throughout.
Results (5000 games, 128 rollouts, 5% sample rate, 59,969 positions):
- Unconditional: 6.52% [6.44, 6.61]
- MC naive (biased up): 7.34% [7.24, 7.43] (was 7.92% at 32 rollouts)
- MC corrected (biased down): 6.67% [6.58, 6.76]
- Bracket width: 0.66pp
Docs: rewrite ACCURACY_CEILING.md with final numbers, honest limitations,
bracket notation. Add CEILING_POSTMORTEM.md with full comparison. Update
model cards, README, CLAUDE.md.
* Remove fallback defaults in load_ceilings, move monitoring script to local/
load_ceilings() now fails loudly if conditional_corrected_ceiling or
n_rollouts are missing from the JSON artifact, consistent with the
project's policy of never silently falling back to misleading numbers.
Copy file name to clipboardExpand all lines: CLAUDE.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -160,7 +160,7 @@ MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: `--min-eva
160
160
uv run python scripts/compute_theoretical_ceiling.py
161
161
```
162
162
163
-
Computes upper bounds on top-1 accuracy for random games: unconditional (E[1/N_legal] = 6.43%), naive-conditioned (1-ply filter = 6.44%), MCTS-conditioned (32 rollouts = 7.92%). CPU-intensive.
163
+
Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive.
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,7 +122,7 @@ Despite training exclusively on random games, PAWN develops rich internal repres
122
122
123
123
The model also achieves >99.8% legal move rate on the base and large variants, correctly identifying legal moves from move history alone.
124
124
125
-
The [theoretical accuracy ceiling](docs/ACCURACY_CEILING.md) for random game prediction is 6.43% (unconditional) to 7.92% (MCTS-conditioned on outcome). All three models exceed the unconditional ceiling, confirming they learn structure beyond move legality.
125
+
The [theoretical accuracy ceiling](docs/ACCURACY_CEILING.md) for random game prediction is 6.52% (unconditional). The MC-conditioned ceiling (Bayes-optimal with outcome knowledge) is estimated at [6.67%, 7.34%] via split-half bias correction. All three models exceed the unconditional ceiling, confirming they exploit the outcome token to make non-uniform predictions.
Copy file name to clipboardExpand all lines: cards/hf_model_card.md.j2
+2-3Lines changed: 2 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -88,13 +88,12 @@ This is the **{{ variant_label }}** variant ({{ params }} parameters). PAWN is d
88
88
89
89
### Accuracy Ratios
90
90
91
-
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
91
+
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
Copy file name to clipboardExpand all lines: cards/model/pawn-base.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,13 +88,12 @@ This is the **base (default)** variant (~35.8M parameters). PAWN is designed as
88
88
89
89
### Accuracy Ratios
90
90
91
-
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
91
+
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
Copy file name to clipboardExpand all lines: cards/model/pawn-large.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,13 +88,12 @@ This is the **large** variant (~68.4M parameters). PAWN is designed as a frozen
88
88
89
89
### Accuracy Ratios
90
90
91
-
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
91
+
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
Copy file name to clipboardExpand all lines: cards/model/pawn-small.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,13 +88,12 @@ This is the **small** variant (~9.5M parameters). PAWN is designed as a frozen b
88
88
89
89
### Accuracy Ratios
90
90
91
-
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
91
+
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
0 commit comments