thomas-schweich
diff --git a/‎CLAUDE.md‎
Lines changed: 1 addition & 1 deletion b/‎CLAUDE.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎cards/hf_model_card.md.j2‎
Lines changed: 2 additions & 3 deletions b/‎cards/hf_model_card.md.j2‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎cards/model/pawn-base.md‎
Lines changed: 3 additions & 4 deletions b/‎cards/model/pawn-base.md‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎cards/model/pawn-large.md‎
Lines changed: 3 additions & 4 deletions b/‎cards/model/pawn-large.md‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎cards/model/pawn-small.md‎
Lines changed: 3 additions & 4 deletions b/‎cards/model/pawn-small.md‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎docs/ACCURACY_CEILING.md‎
Lines changed: 141 additions & 68 deletions b/‎docs/ACCURACY_CEILING.md‎
Lines changed: 141 additions & 68 deletions
@@ -160,7 +160,7 @@ MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: `--min-eva
 uv run python scripts/compute_theoretical_ceiling.py
 ```
 
-Computes upper bounds on top-1 accuracy for random games: unconditional (E[1/N_legal] = 6.43%), naive-conditioned (1-ply filter = 6.44%), MCTS-conditioned (32 rollouts = 7.92%). CPU-intensive.
+Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive.
 
 ### Export to HuggingFace
 
 
@@ -122,7 +122,7 @@ Despite training exclusively on random games, PAWN develops rich internal repres
 
 The model also achieves >99.8% legal move rate on the base and large variants, correctly identifying legal moves from move history alone.
 
-The [theoretical accuracy ceiling](docs/ACCURACY_CEILING.md) for random game prediction is 6.43% (unconditional) to 7.92% (MCTS-conditioned on outcome). All three models exceed the unconditional ceiling, confirming they learn structure beyond move legality.
+The [theoretical accuracy ceiling](docs/ACCURACY_CEILING.md) for random game prediction is 6.52% (unconditional). The MC-conditioned ceiling (Bayes-optimal with outcome knowledge) is estimated at [6.67%, 7.34%] via split-half bias correction. All three models exceed the unconditional ceiling, confirming they exploit the outcome token to make non-uniform predictions.
 
 ## Adapter Methods
 
 
@@ -88,13 +88,12 @@ This is the **{{ variant_label }}** variant ({{ params }} parameters). PAWN is d
 
 ### Accuracy Ratios
 
-PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
+PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
 
 | Ceiling | Ratio |
 |---------|-------|
 | Unconditioned (E\[1/N_legal\] = {{ "%.2f"|format(uncond_ceiling) }}%) | {{ uncond_ratio }}% |
-| Naive-conditioned (1-ply filter = {{ "%.2f"|format(naive_ceiling) }}%) | {{ naive_ratio }}% |
-| Bayes-optimal conditioned (MCTS, 32 rollouts = {{ "%.2f"|format(mcts_ceiling) }}%) | {{ mcts_ratio }}% |
+| Bayes-optimal conditioned (MC, {{ n_rollouts }} rollouts = \[{{ "%.2f"|format(mc_corrected_ceiling) }}, {{ "%.2f"|format(mc_naive_ceiling) }}\]%) | {{ mc_corrected_ratio }}–{{ mc_naive_ratio }}% |
 {% if probes %}
 
 ## Probe Results
 
@@ -88,13 +88,12 @@ This is the **base (default)** variant (~35.8M parameters). PAWN is designed as
 
 ### Accuracy Ratios
 
-PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
+PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
 
 | Ceiling | Ratio |
 |---------|-------|
-| Unconditioned (E\[1/N_legal\] = 6.43%) | 109% |
-| Naive-conditioned (1-ply filter = 6.44%) | 109% |
-| Bayes-optimal conditioned (MCTS, 32 rollouts = 7.92%) | 89% |
+| Unconditioned (E\[1/N_legal\] = 6.52%) | 105% |
+| Bayes-optimal conditioned (MC, 128 rollouts = \[6.67, 7.34\]%) | 94–103% |
 
 
 ## Probe Results
 
@@ -88,13 +88,12 @@ This is the **large** variant (~68.4M parameters). PAWN is designed as a frozen
 
 ### Accuracy Ratios
 
-PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
+PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
 
 | Ceiling | Ratio |
 |---------|-------|
-| Unconditioned (E\[1/N_legal\] = 6.43%) | 108% |
-| Naive-conditioned (1-ply filter = 6.44%) | 108% |
-| Bayes-optimal conditioned (MCTS, 32 rollouts = 7.92%) | 88% |
+| Unconditioned (E\[1/N_legal\] = 6.52%) | 106% |
+| Bayes-optimal conditioned (MC, 128 rollouts = \[6.67, 7.34\]%) | 95–104% |
 
 
 ## Probe Results
 
@@ -88,13 +88,12 @@ This is the **small** variant (~9.5M parameters). PAWN is designed as a frozen b
 
 ### Accuracy Ratios
 
-PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md).
+PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
 
 | Ceiling | Ratio |
 |---------|-------|
-| Unconditioned (E\[1/N_legal\] = 6.43%) | 105% |
-| Naive-conditioned (1-ply filter = 6.44%) | 105% |
-| Bayes-optimal conditioned (MCTS, 32 rollouts = 7.92%) | 85% |
+| Unconditioned (E\[1/N_legal\] = 6.52%) | 103% |
+| Bayes-optimal conditioned (MC, 128 rollouts = \[6.67, 7.34\]%) | 92–101% |
 
 
 ## Probe Results
 
@@ -6,17 +6,17 @@ ceiling — no model, however large, can exceed it.
 
 ## Three ceilings
 
-### Unconditional ceiling: E[1/N_legal] = 6.43%
+### Unconditional ceiling: E[1/N_legal] = 6.52%
 
 At each position, the move is drawn uniformly from N legal moves. The best
-a predictor can do without any context is pick one at random: accuracy = 1/N.
-Averaged over all positions in random games, this gives **6.43%**.
+a predictor can do is pick one at random: accuracy = 1/N. Averaged over all
+positions in random games, this gives **6.52%** (95% CI [6.44, 6.61]).
 
-A model that exceeds this ceiling has learned something beyond just "which
-moves are legal" — it has learned to estimate the number of legal moves at
-each position and bias predictions toward positions with fewer options.
+A model that exceeds this ceiling has learned to use the outcome token to
+make non-uniform predictions — assigning higher probability to moves that
+are more consistent with the known game outcome.
 
-### Naive conditional ceiling: 6.44%
+### Naive conditional ceiling: 6.53%
 
 A zero-cost analytical estimate of outcome conditioning. At each position,
 legal moves that lead to an immediate terminal state with a *different*
@@ -26,95 +26,168 @@ This barely exceeds the unconditional ceiling (1.00x boost) because
 immediate terminal states are rare — most moves at most positions lead to
 non-terminal continuations, so the filter has almost nothing to exclude.
 
-### MCTS conditional ceiling: 7.92%
+### MC conditional ceiling: [6.67%, 7.34%]
 
 The full Monte Carlo estimate. At each sampled position, every legal move is
-tried and 32 random continuations are played out to estimate
+tried and random continuations are played out to estimate
 P(outcome | move, history). The Bayes-optimal predictor picks the move most
-consistent with the known outcome.
+consistent with the known outcome:
+
+    P(m_i | outcome, history) = P(outcome | m_i, history) / Σ_j P(outcome | m_j, history)
+
+The ceiling at each position is max_i P(m_i | outcome, history), and the
+overall ceiling is the mean over all positions.
 
 PAWN's input sequence begins with an outcome token (`WHITE_CHECKMATES`,
 `STALEMATE`, `PLY_LIMIT`, etc.). This leaks information about the game's
 trajectory, making some moves more predictable:
 
-- **Checkmate games**: The final move must deliver checkmate. Knowing this
-  raises the ceiling at the last ply from ~5% to ~14%.
+- **Checkmate games**: The final move must deliver checkmate — constraining
+  on the last few plies.
 - **Ply limit games**: Knowing the game lasts 255 plies constrains the move
   distribution slightly.
 - **Stalemate games**: The final position has no legal moves but isn't check
-  — very constraining on late moves.
+  — constraining on late moves.
+
+#### Bias bracket
+
+The MC conditional ceiling is reported as a bracket, not a point estimate.
+The naive Monte Carlo estimator (max of noisy P̂(outcome | m_i) estimates)
+is **biased upward** because the `max` operator preferentially selects
+whichever estimate had favorable noise (Jensen's inequality:
+E[max X̂_i] ≥ max E[X̂_i]).
+
+To bound this bias, we use a **split-half** correction: rollouts are split
+into two independent halves (A and B). Half A selects the argmax move;
+half B evaluates it. This breaks the selection-evaluation feedback loop,
+producing an estimate that is **biased downward** (sometimes A picks the
+wrong argmax). The true ceiling lies between the two:
+
+    corrected (biased down)  ≤  true ceiling  ≤  naive (biased up)
 
-## Adjusted accuracy
+With 128 rollouts per move, the bracket is **0.66pp** wide. The corrected
+and naive 95% CIs do not overlap, confirming the bias is real and
+non-negligible at this rollout count.
 
-| Metric | Value |
-|--------|-------|
-| Unconditional ceiling (E[1/N_legal]) | 6.43% |
-| Naive conditional ceiling (1-ply filter) | 6.44% |
-| MCTS conditional ceiling (32 rollouts) | 7.92% |
-| Conditioning boost (naive) | 1.00x |
-| Conditioning boost (MCTS) | 1.23x |
+## Summary
+
+| Metric | Value | 95% CI |
+|--------|-------|--------|
+| Unconditional ceiling (E[1/N_legal]) | 6.52% | [6.44, 6.61] |
+| Naive conditional ceiling (1-ply filter) | 6.53% | [6.44, 6.61] |
+| MC conditional ceiling (naive est.) | 7.34% | [7.24, 7.43] |
+| MC conditional ceiling (corrected) | 6.67% | [6.58, 6.76] |
+| Bias bracket width | 0.66pp | |
+| Conditioning boost (naive) | 1.12x | |
+| Conditioning boost (corrected) | 1.02x | |
 
 For a model with top-1 accuracy A:
 
-- **Adjusted (unconditional)** = A / 6.43% — measures how much the model
-  has learned about chess legality. Values > 100% mean it has learned
-  structure beyond just legal moves.
-- **Adjusted (naive conditional)** = A / 6.44% — essentially the same as
-  unconditional; confirms that 1-ply lookahead explains almost none of the
-  outcome conditioning benefit.
-- **Adjusted (MCTS conditional)** = A / 7.92% — measures how close the
+- **Adjusted (unconditional)** = A / 6.52% — measures how much the model
+  has learned beyond predicting uniformly over legal moves. Values > 100%
+  mean the model exploits the outcome token to make non-uniform predictions.
+- **Adjusted (MC conditional)** = A / ceiling — measures how close the
   model is to the Bayes-optimal predictor with perfect outcome knowledge.
-  This is the tighter bound.
+  Report against both the naive and corrected estimates for transparency.
 
 ### Final model results (100K steps)
 
-| Variant | Top-1 | vs Uncond | vs Naive Cond | vs MCTS Cond |
-|---------|-------|-----------|---------------|--------------|
-| large (68M) | 6.94% | 108% | 108% | 88% |
-| base (36M) | 6.86% | 107% | 107% | 87% |
-| small (10M) | 6.73% | 105% | 105% | 85% |
+| Variant | Top-1 | vs Uncond | vs MC Naive | vs MC Corrected |
+|---------|-------|-----------|-------------|-----------------|
+| large (68M) | 6.94% | 106% | 95% | 104% |
+| base (36M) | 6.86% | 105% | 94% | 103% |
+| small (10M) | 6.73% | 103% | 92% | 101% |
 
-All models exceed the unconditional and naive conditional ceilings,
-confirming they learn chess structure beyond move legality. The large and
-base models reach 87-88% of the MCTS conditional ceiling.
+All models exceed the unconditional ceiling, confirming they exploit the
+outcome token. Against the MC corrected ceiling, all models appear to be
+at or slightly above the estimate — expected, since the corrected estimate
+is biased downward. The true ceiling lies somewhere in the bracket, and
+the models are close to it.
 
 ## Per-outcome breakdown
 
-| Outcome | Uncond | Naive Cond | MCTS Cond | Positions |
-|---------|--------|------------|-----------|-----------|
-| White checkmated | 5.26% | 5.26% | 13.79% | 328 |
-| Black checkmated | 5.02% | 5.02% | 13.64% | 388 |
-| Stalemate | 7.22% | 7.22% | 18.67% | 125 |
-| Insufficient material | 7.17% | 7.17% | 18.61% | 256 |
-| Ply limit | 6.51% | 6.51% | 6.97% | 8,618 |
-
-The naive conditional ceiling equals the unconditional ceiling across all
-outcome types — the 1-ply filter never fires in practice. The MCTS ceiling
-shows the real conditioning benefit: decisive outcomes (checkmate, stalemate,
-insufficient material) get a 2.6x boost, while ply limit games — the vast
-majority — show only 1.07x because knowing the game goes the distance
-provides minimal per-move information.
+| Outcome | Uncond | MC Naive | MC Corrected | Bracket | n |
+|---------|--------|----------|--------------|---------|---|
+| White checkmated | 5.44% | 9.63% | 6.34% | 3.29pp | 2,167 |
+| Black checkmated | 5.05% | 9.25% | 6.07% | 3.18pp | 2,382 |
+| Stalemate | 7.98% | 14.05% | 8.50% | 5.55pp | 1,029 |
+| Insufficient material | 7.18% | 12.35% | 8.03% | 4.32pp | 1,651 |
+| Ply limit | 6.59% | 6.87% | 6.63% | 0.24pp | 52,740 |
+
+The bias bracket is narrow (0.24pp) for ply-limit games, which make up 88%
+of positions — most legal moves lead to ply-limit regardless, so outcome
+probabilities are all near 0.9 and the max/sum ratio is close to 1/N. The
+bracket is wide (3-5pp) for decisive outcomes, where a few moves have high
+P(outcome | m_i) and the rest are near zero, making the max sensitive to
+noise.
+
+The conditioning benefit for decisive outcomes is real but modest. Taking
+the corrected estimates at face value:
+
+| Outcome | Corrected / Unconditional |
+|---------|--------------------------|
+| White checkmated | 1.17x |
+| Black checkmated | 1.20x |
+| Stalemate | 1.07x |
+| Insufficient material | 1.12x |
+| Ply limit | 1.01x |
+
+However, the per-outcome corrected estimates are noisy for rare outcomes
+(especially stalemate, n=1,029) and should be interpreted cautiously.
+
+## Limitations
+
+- **The bias bracket is too wide for strong quantitative claims.** At 128
+  rollouts, we can say the model is between 92-104% of the Bayes-optimal
+  ceiling (depending on which estimate is used), but not pin it down more
+  precisely. A 256- or 512-rollout run would narrow this.
+- **The MC ceiling is an estimate, not exact.** Both the naive and corrected
+  estimators have known biases. The true ceiling lies between them, but the
+  exact value is unknown without infinite rollouts.
+- **Per-outcome estimates are noisy for rare outcomes.** Checkmate and
+  stalemate positions have large per-position variance with 128 rollouts.
+  Stratified sampling (oversampling rare outcomes) would improve precision.
+- **The ceiling assumes perfect outcome knowledge.** The model must *learn*
+  P(outcome | move, history) from data, so achievable accuracy for a finite
+  model is somewhat below the theoretical ceiling.
+- **Other sources of signal are not accounted for.** The model may exploit
+  sequential structure in random games (e.g., position-dependent move
+  popularity, game-length correlations) beyond what the outcome token
+  provides. The ceiling analysis does not isolate this.
 
 ## Reproducing
 
 ```bash
-# Default: 2000 games, 32 rollouts/move, 2% sample rate
-uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069
+# Moderate precision: 5000 games, 128 rollouts/move (64 per half), 5% sample rate
+uv run python scripts/compute_theoretical_ceiling.py \
+    --n-games 5000 --rollouts 128 --sample-rate 0.05 \
+    --model-accuracy 0.069
 
-# Higher precision (slower)
-uv run python scripts/compute_theoretical_ceiling.py --n-games 10000 --rollouts 64 --sample-rate 0.05
+# Quick check (low precision, ~2 min)
+uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069
 ```
 
-Results are saved to `data/theoretical_ceiling.json`.
-
-## Caveats
-
-- The MCTS ceiling is an estimate, not exact. With more rollouts and higher
-  sample rates, the estimate improves but computation time increases
-  quadratically.
-- The ceiling assumes the model has perfect knowledge of P(outcome | move,
-  history). In practice, the model must learn this from data, so the
-  achievable accuracy for a finite model is somewhat below the ceiling.
-- Game length information is implicit in the outcome token (e.g., PLY_LIMIT
-  implies 255 plies). A model could theoretically use position in the
-  sequence to estimate remaining game length, further improving predictions.
+Results are saved to `data/theoretical_ceiling.json` and include bootstrap
+95% CIs clustered by game. Runtime: ~38 min on 16-core CPU for the moderate
+configuration.
+
+## Methodology
+
+The computation (implemented in Rust in `engine/src/random.rs`) works as
+follows:
+
+1. Generate N random games (uniform legal move selection).
+2. Sample a fraction of positions across all games.
+3. At each sampled position with K legal moves and known outcome O:
+   - **Unconditional**: ceiling = 1/K.
+   - **Naive conditional**: try each move; if it immediately terminates with
+     outcome ≠ O, prune it. Ceiling = 1/(K - pruned).
+   - **MC conditional**: for each legal move, play R/2 random continuations
+     from two independent seeds. Estimate P(O | m_i) from each half.
+     Naive estimate = max(P̂_combined) / Σ P̂_combined.
+     Corrected = P̂_B[argmax(P̂_A)] / Σ P̂_combined.
+4. Report means with bootstrap 95% CIs (resampled by game to account for
+   within-game correlation).
+
+See [CEILING_POSTMORTEM.md](CEILING_POSTMORTEM.md) for a detailed comparison
+with the original computation and discussion of implications.