|
| 1 | +--- |
| 2 | +library_name: pawn |
| 3 | +license: apache-2.0 |
| 4 | +base_model: |
| 5 | + - thomas-schweich/pawn-small |
| 6 | + - thomas-schweich/pawn-base |
| 7 | + - thomas-schweich/pawn-large |
| 8 | +tags: |
| 9 | + - chess |
| 10 | + - transformer |
| 11 | + - world-model |
| 12 | + - causal-lm |
| 13 | + - next-token-prediction |
| 14 | + - representation-learning |
| 15 | + - pytorch |
| 16 | + - rust |
| 17 | +model_name: PAWN-{{ variant_name }} |
| 18 | +pipeline_tag: other |
| 19 | +citation: | |
| 20 | + {% raw %}@software{schweich2026pawn, |
| 21 | + author = {Schweich, Thomas}, |
| 22 | + title = {{PAWN}: Playstyle-Agnostic World-model Network for Chess}, |
| 23 | + year = {2026}, |
| 24 | + url = {https://github.com/thomas-schweich/PAWN}, |
| 25 | + license = {Apache-2.0} |
| 26 | + }{% endraw %} |
| 27 | +model_params: {{ params_num }} |
| 28 | +d_model: {{ d_model }} |
| 29 | +n_layers: {{ n_layers }} |
| 30 | +n_heads: {{ n_heads }} |
| 31 | +d_ff: {{ d_ff }} |
| 32 | +context_length: 256 |
| 33 | +vocab_size: 4284 |
| 34 | +datasets: |
| 35 | + - random-chess-games |
| 36 | +language: |
| 37 | + - en |
| 38 | +metrics: |
| 39 | + - accuracy |
| 40 | +model-index: |
| 41 | + - name: PAWN-{{ variant_name }} |
| 42 | + results: |
| 43 | + - task: |
| 44 | + type: next-token-prediction |
| 45 | + name: Chess Move Prediction (Random Games) |
| 46 | + metrics: |
| 47 | +{% if legal_rate is not none %} |
| 48 | + - name: Legal Move Rate |
| 49 | + type: accuracy |
| 50 | + value: {{ "%.4f"|format(legal_rate / 100) }} |
| 51 | +{% endif %} |
| 52 | + - name: Top-1 Accuracy |
| 53 | + type: accuracy |
| 54 | + value: {{ "%.4f"|format(top1 / 100) }} |
| 55 | +{% if top5 is not none %} |
| 56 | + - name: Top-5 Accuracy |
| 57 | + type: accuracy |
| 58 | + value: {{ "%.4f"|format(top5 / 100) }} |
| 59 | +{% endif %} |
| 60 | + - name: Val Loss |
| 61 | + type: loss |
| 62 | + value: {{ "%.4f"|format(val_loss) }} |
| 63 | + - name: Games Seen |
| 64 | + type: other |
| 65 | + value: 25600000 |
| 66 | +--- |
| 67 | + |
| 68 | +# PAWN-{{ variant_name }} |
| 69 | + |
| 70 | +**PAWN** (Playstyle-Agnostic World-model Network for Chess) is a causal transformer trained on random chess games. It learns legal moves, board state representations, and game dynamics purely from uniformly random legal move sequences -- no strategic play, no hand-crafted features, no external game databases. |
| 71 | + |
| 72 | +This is the **{{ variant_label }}** variant ({{ params }} parameters). PAWN is designed as a frozen backbone for parameter-efficient finetuning into player models with arbitrary playstyles. |
| 73 | + |
| 74 | +**[GitHub Repository](https://github.com/thomas-schweich/PAWN)** -- full source code, training scripts, adapter implementations, and documentation. |
| 75 | + |
| 76 | +## All Variants |
| 77 | + |
| 78 | +| Variant | Parameters | Link | |
| 79 | +|---------|------------|------| |
| 80 | +| PAWN-Small | ~9.5M | [thomas-schweich/pawn-small](https://huggingface.co/thomas-schweich/pawn-small) | |
| 81 | +| PAWN (Base) | ~35.8M | [thomas-schweich/pawn-base](https://huggingface.co/thomas-schweich/pawn-base) | |
| 82 | +| PAWN-Large | ~68.4M | [thomas-schweich/pawn-large](https://huggingface.co/thomas-schweich/pawn-large) | |
| 83 | + |
| 84 | +## Headline Metrics |
| 85 | + |
| 86 | +| Metric | Value | |
| 87 | +|--------|-------| |
| 88 | +{% if legal_rate is not none %}| Legal move rate | {{ "%.2f"|format(legal_rate) }}% | |
| 89 | +{% endif %}| Top-1 accuracy | {{ "%.2f"|format(top1) }}% | |
| 90 | +{% if top5 is not none %}| Top-5 accuracy | {{ "%.2f"|format(top5) }}% | |
| 91 | +{% endif %}| Val loss | {{ "%.3f"|format(val_loss) }} | |
| 92 | + |
| 93 | +### Accuracy Ratios |
| 94 | + |
| 95 | +PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves. See [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md). |
| 96 | + |
| 97 | +| Ceiling | Ratio | |
| 98 | +|---------|-------| |
| 99 | +| Unconditioned (E\[1/N_legal\] = {{ "%.2f"|format(uncond_ceiling) }}%) | {{ uncond_ratio }}% | |
| 100 | +| Naive-conditioned (1-ply filter = {{ "%.2f"|format(naive_ceiling) }}%) | {{ naive_ratio }}% | |
| 101 | +| Bayes-optimal conditioned (MCTS, 32 rollouts = {{ "%.2f"|format(mcts_ceiling) }}%) | {{ mcts_ratio }}% | |
| 102 | +{% if probes %} |
| 103 | + |
| 104 | +## Probe Results |
| 105 | + |
| 106 | +Linear probes trained on frozen hidden states measure how well the model's internal representations encode board-level features. |
| 107 | + |
| 108 | +| Probe | Accuracy | Description | |
| 109 | +|-------|----------|-------------| |
| 110 | +{% for probe in probes -%} |
| 111 | +| {{ probe.name }} | {{ probe.result }} | {{ probe.description }} | |
| 112 | +{% endfor %} |
| 113 | +{% endif %} |
| 114 | +{% if diagnostics %} |
| 115 | + |
| 116 | +## Diagnostic Results |
| 117 | + |
| 118 | +Edge-case diagnostics measure the model's legal move rate in specific tactical situations. |
| 119 | + |
| 120 | +| Category | Positions | Legal Rate | |
| 121 | +|----------|-----------|------------| |
| 122 | +{% for diag in diagnostics -%} |
| 123 | +| {{ diag.name }} | {{ diag.n }} | {{ diag.value }} | |
| 124 | +{% endfor %} |
| 125 | +{% endif %} |
| 126 | + |
| 127 | +## Architecture |
| 128 | + |
| 129 | +| Parameter | Value | |
| 130 | +|-----------|-------| |
| 131 | +| Architecture | Decoder-only transformer | |
| 132 | +| d_model | {{ d_model }} | |
| 133 | +| Layers | {{ n_layers }} | |
| 134 | +| Attention heads | {{ n_heads }} | |
| 135 | +| Head dimension | {{ head_dim }} | |
| 136 | +| d_ff | {{ d_ff }} | |
| 137 | +| Parameters | {{ params }} | |
| 138 | +| Vocabulary | 4,284 tokens | |
| 139 | +| Context length | 256 tokens | |
| 140 | +| Normalization | Pre-norm RMSNorm | |
| 141 | +| FFN | SwiGLU (4x expansion) | |
| 142 | +| Positional encoding | Rotary (RoPE, base 10000) | |
| 143 | +| Embeddings | Factored (src + dst + promo) | |
| 144 | +| Dropout | 0.0 | |
| 145 | + |
| 146 | +## Training Details |
| 147 | + |
| 148 | +| Parameter | Value | |
| 149 | +|-----------|-------| |
| 150 | +| Training data | On-the-fly uniformly random legal games (no external dataset) | |
| 151 | +| Objective | Next-token cross-entropy (non-padding positions only) | |
| 152 | +| Total steps | 100,000 | |
| 153 | +| Batch size | 256 | |
| 154 | +| Games seen | 25,600,000 | |
| 155 | +| Learning rate | 3e-4 (cosine decay with 1,000-step warmup) | |
| 156 | +| Optimizer | AdamW (weight decay 0.01) | |
| 157 | +| Precision | Mixed (AMP) | |
| 158 | +| Hardware | NVIDIA H200 | |
| 159 | + |
| 160 | +## Usage |
| 161 | + |
| 162 | +### Loading the model |
| 163 | + |
| 164 | +```python |
| 165 | +import torch |
| 166 | +from safetensors.torch import load_file |
| 167 | +from pawn.config import CLMConfig |
| 168 | +from pawn.model import PAWNCLM |
| 169 | + |
| 170 | +cfg = CLMConfig.{{ variant_factory }}() |
| 171 | +model = PAWNCLM(cfg).cuda().eval() |
| 172 | +weights = load_file("model.safetensors", device="cuda") |
| 173 | +model.load_state_dict(weights) |
| 174 | +``` |
| 175 | + |
| 176 | +Or load directly from HuggingFace: |
| 177 | + |
| 178 | +```python |
| 179 | +from pawn.checkpoint import load_backbone_weights |
| 180 | +from pawn.config import CLMConfig |
| 181 | +from pawn.model import PAWNCLM |
| 182 | + |
| 183 | +weights, config = load_backbone_weights("thomas-schweich/pawn-{{ variant_key }}") |
| 184 | +cfg = CLMConfig.{{ variant_factory }}() |
| 185 | +model = PAWNCLM(cfg).eval() |
| 186 | +model.load_state_dict(weights) |
| 187 | +``` |
| 188 | + |
| 189 | +### Finetuning with an adapter |
| 190 | + |
| 191 | +```bash |
| 192 | +uv run python scripts/train_bottleneck.py \ |
| 193 | + --checkpoint thomas-schweich/pawn-{{ variant_key }} \ |
| 194 | + --pgn thomas-schweich/pawn-lichess-full \ |
| 195 | + --bottleneck-dim 32 --lr 1e-4 --local-checkpoints |
| 196 | +``` |
| 197 | + |
| 198 | +## Acknowledgments |
| 199 | + |
| 200 | +PAWN builds on ideas and tools from the following projects and publications: |
| 201 | + |
| 202 | +| Component | Reference | |
| 203 | +|-----------|-----------| |
| 204 | +| Transformer | [Vaswani et al., "Attention Is All You Need", NeurIPS 2017](https://arxiv.org/abs/1706.03762) | |
| 205 | +| RMSNorm | [Zhang & Sennrich, "Root Mean Square Layer Normalization", NeurIPS 2019](https://arxiv.org/abs/1910.07467) | |
| 206 | +| RoPE | [Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", 2021](https://arxiv.org/abs/2104.09864) | |
| 207 | +| SwiGLU | [Shazeer, "GLU Variants Improve Transformer", 2020](https://arxiv.org/abs/2002.05202) | |
| 208 | +| AdamW | [Loshchilov & Hutter, "Decoupled Weight Decay Regularization", ICLR 2019](https://arxiv.org/abs/1711.05101) | |
| 209 | +| Cosine schedule | [Loshchilov & Hutter, "SGDR: Stochastic Gradient Descent with Warm Restarts", ICLR 2017](https://arxiv.org/abs/1608.03983) | |
| 210 | +| Mixed precision | [Micikevicius et al., "Mixed Precision Training", ICLR 2018](https://arxiv.org/abs/1710.03740) | |
| 211 | +| Bottleneck adapters | [Houlsby et al., "Parameter-Efficient Transfer Learning for NLP", ICML 2019](https://arxiv.org/abs/1902.00751) | |
| 212 | +| LoRA | [Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models", ICLR 2022](https://arxiv.org/abs/2106.09685) | |
| 213 | +| FiLM | [Perez et al., "FiLM: Visual Reasoning with a General Conditioning Layer", AAAI 2018](https://arxiv.org/abs/1709.07871) | |
| 214 | +| RoSA | [Nikdan et al., "RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation", 2024](https://arxiv.org/abs/2401.04679) | |
| 215 | +| Linear probes | [Alain & Bengio, "Understanding Intermediate Layers Using Linear Classifier Probes", ICLR Workshop 2017](https://arxiv.org/abs/1610.01644) | |
| 216 | +| Intrinsic dimensionality | [Aghajanyan et al., "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning", ACL 2021](https://arxiv.org/abs/2012.13255) | |
| 217 | +| MAIA | [McIlroy-Young et al., "Aligning Superhuman AI with Human Behavior: Chess as a Model System", KDD 2020](https://arxiv.org/abs/2006.01855) | |
| 218 | +| AlphaZero | [Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play", Science 2018](https://arxiv.org/abs/1712.01815) | |
| 219 | +| Leela Chess Zero | [github.com/LeelaChessZero/lc0](https://github.com/LeelaChessZero/lc0) | |
| 220 | +| shakmaty | [github.com/niklasf/shakmaty](https://github.com/niklasf/shakmaty) | |
| 221 | +| PyO3 | [github.com/PyO3/pyo3](https://github.com/PyO3/pyo3) | |
| 222 | +| Lichess | [lichess.org](https://lichess.org/) / [database.lichess.org](https://database.lichess.org/) | |
| 223 | + |
| 224 | +## Citation |
| 225 | + |
| 226 | +{% raw %} |
| 227 | +```bibtex |
| 228 | +@software{schweich2026pawn, |
| 229 | + author = {Schweich, Thomas}, |
| 230 | + title = {{PAWN}: Playstyle-Agnostic World-model Network for Chess}, |
| 231 | + year = {2026}, |
| 232 | + url = {https://github.com/thomas-schweich/PAWN}, |
| 233 | + license = {Apache-2.0} |
| 234 | +} |
| 235 | +``` |
| 236 | +{% endraw %} |
| 237 | + |
| 238 | +## License |
| 239 | + |
| 240 | +Apache 2.0. See [LICENSE](https://github.com/thomas-schweich/PAWN/blob/main/LICENSE). |
0 commit comments