A small causal transformer trained on random chess games that learns legal moves, board state representations, and game dynamics purely from random legal move sequences absent any form of strategic play.
I've found PAWN to be a viable testbed for finetuning and augmentation methods at small scale. Since it is entirely unopinionated, it's a blank slate ready to be adapted, augmented, and finetuned into arbitrary player models with unique playstyles.
Finetuning PAWN has proven significantly more parameter-efficient than training new models from scratch and requires minimal compute resources.
Feel free to use PAWN in your own experiments. Note that PAWN was developed as a personal project by a single developer and his imaginary friend (Claude) and has not been published or audited. If you spot a bug or inaccuracy, please help out by creating an issue or PR.
PAWN is under active development and is not yet stable. All results are preliminary.
Important
I am actively in process of re-training the model with:
- A new vocabulary borrowed from Google DeepMind's searchless_chess project (Amortized Planning with Large-Scale Transformers: A Case Study on Chess), which doesn't include impossible moves.
- A wider 512-token context window.
The information below applies to the existing models, which use the previous architecture. The last commit from prior to these changes is tagged pre-vocab-transition. View the repository at that commit to see the implementation of the previous architecture.
Three sizes, trained for 100K steps on random games (~25.6M games each):
| Variant | d_model | Layers | Heads | Params | Top-1 | Legal Rate | Download |
|---|---|---|---|---|---|---|---|
| PAWN-Small | 256 | 8 | 4 | ~9.5M | 6.75% | 99.19% | |
| PAWN (Base) | 512 | 8 | 8 | ~35.8M | 7.02% | 99.87% | |
| PAWN-Large | 640 | 10 | 8 | ~68.4M | 6.95% | 99.89% |
All variants share the same architecture: RMSNorm, SwiGLU FFN, RoPE, factored move embeddings, and a 4278-token vocabulary covering:
- all possible (src, dst) pairs for an 8x8 grid (the chess board),
- promotion moves: 4 piece types (queen, bishop, rook, knight) x 44 eligible (source square, destination square) pairs for pawns reaching the 1st & 8th ranks,
- a token for each game outcome (
WHITE_CHECKMATES,BLACK_CHECKMATES,STALEMATE,DRAW_BY_RULE,PLY_LIMIT), - and a padding token.
PAWN learns to avoid impossible moves like a1a1 and b1a5 since they don't appear in its training examples.
Tokens are best thought of as a move in UCI notation -- coordinate pairs. They do not include any information on the type of piece, side to play, or direct board state information.
For example, e2e4 could double push the king's pawn, but the same token would be used for moving a rook from e2 to e4 in the late game. The model learns to track which type of piece is on each square at any given ply.
It also isn't told what piece types exist, what movement patterns they follow, or indeed the concept of a piece. All of that understanding comes from observation and can be isolated via linear probes (Alain & Bengio, 2016).
# Clone and build
git clone https://github.com/thomas-schweich/PAWN.git && cd PAWN
# Build the Rust chess engine (required -- handles all game logic)
cd engine && uv run --with maturin maturin develop --release && cd ..
# Install Python dependencies
uv sync --extra cu128 # NVIDIA GPU (or --extra rocm for AMD)Weights and data can be loaded directly from HuggingFace:
uv run python scripts/train_bottleneck.py \
--checkpoint thomas-schweich/pawn-base \
--pgn thomas-schweich/pawn-lichess-full \
--bottleneck-dim 32 --lr 1e-4 --local-checkpointsRandom games are generated on-the-fly; no dataset required:
uv run python scripts/train.py --variant base --local-checkpoints
# Or train all three variants simultaneously on shared data
uv run python scripts/train_all.py --local-checkpointsuv run python scripts/eval_probes.py --log-dir logs --device cuda
uv run python -m pawn.dashboard --log-dir logs # real-time monitoringThese datasets are for adapter training (behavioral cloning), not for pretraining PAWN itself. PAWN is pretrained exclusively on random legal games generated on-the-fly -- it never sees human or engine games during pretraining. The datasets below provide real gameplay data for finetuning the frozen PAWN backbone into player models that mimic specific playstyles or skill levels.
| Dataset | Games | Description | Link |
|---|---|---|---|
| Lichess Full | ~289M train + 50K val + 50K test | Rated games from Q1 2025 (all Elos), holdout from Jan 2026 | pawn-lichess-full |
| Stockfish nodes=1 | 900K train + 50K val + 50K test | NNUE self-play, 1 node/move | stockfish-nodes1 |
All datasets use the PAWN token format: pre-tokenized list[int16] move sequences, ready for training without any parsing. The Lichess dataset also includes clock annotations, Stockfish eval annotations (~8% of games), player hashes, Elo ratings, and game metadata.
Datasets load directly from HuggingFace via Polars lazy scan -- predicate pushdown on columns like white_elo and date lets you efficiently filter to specific Elo bands or time periods without downloading the full dataset.
More info: docs/ARCHITECTURE.md
PAWN is a standard decoder-only transformer trained with next-token prediction on chess move sequences. Each training example is:
[outcome] [ply_1] [ply_2] ... [ply_N] [PAD] ... [PAD]
Ply tokens use a factored embedding: each move is decomposed into source square + destination square + promotion piece, with embeddings summed. This gives the model explicit spatial structure while keeping the vocabulary compact. The context window of all variants is 256 tokens.
The model's predictions are not masked to legal moves during training; it has to determine what moves are currently legal based on the sequence of moves so far.
No attempt is made to provide the model with information about other pieces. In other words, it only thinks in moves. There is no equivalent of the multi-plane 8x8xN board representation used by e.g. AlphaZero (Silver et al., 2018) and Lc0, so any and all state representation is learned by the model internally.
Despite training exclusively on random games, PAWN develops rich internal representations. Linear probes on the base model's hidden states decode:
| Probe | Accuracy |
|---|---|
| Side to move | 100.0% |
| En passant square | 99.7% |
| Castling rights | 96.6% |
| Game phase | 90.7% |
| Piece type at square | 89.7% |
| Is check | 94.2% |
| Material count (MAE) | 6.1 |
The model also achieves >99.8% legal move rate on the base and large variants, correctly identifying legal moves from move history alone.
The theoretical accuracy ceiling for random game prediction is 6.52% (unconditional). The MC-conditioned ceiling (Bayes-optimal with outcome knowledge) is estimated at [6.67%, 7.34%] via split-half bias correction. All three models exceed the unconditional ceiling, confirming they exploit the outcome token to make non-uniform predictions.
More info: docs/ADAPTERS.md
PAWN ships with six adapter implementations for fine-tuning the frozen backbone on human game data:
| Method | Params (typical) | Accuracy (1800 Elo) | Description |
|---|---|---|---|
| Bottleneck | 131K | 41.7% | Houlsby-style residual MLP adapters |
| RoSA | configurable | -- | Gradient-informed sparse + LoRA |
| Sparse | 503K-2.7M | 40.2-44.7% | Random binary mask on frozen weights |
| LoRA | ~65K | 34.1% | Low-rank attention projection adapters |
| Hybrid | ~65K | 34.1% | LoRA + FiLM combined |
| FiLM | ~17K | 30.3% | Per-channel affine modulation* |
A 524K bottleneck adapter achieves 42.2% accuracy predicting moves by 1800-rated Lichess players, vs. 30.9% for a standalone model with the same architecture and parameter count -- an ~11 percentage point "free" accuracy lift from the frozen backbone.
pawn/
├── pawn/ # Core Python package
│ ├── config.py # Model configs (small/base/large)
│ ├── model.py # PAWN transformer
│ ├── data.py # Random game data pipeline
│ ├── lichess_data.py # Lichess/Parquet data pipeline
│ ├── trainer.py # Pretraining loop
│ ├── gpu.py # GPU auto-detection
│ ├── adapters/ # Bottleneck, LoRA, FiLM, sparse, hybrid, RoSA
│ ├── eval_suite/ # Probes, generation tests, diagnostics
│ └── dashboard/ # Solara training dashboard
├── engine/ # Rust chess engine (PyO3 bindings via shakmaty)
├── scripts/ # Training, evaluation, and data extraction
├── deploy/ # Docker, RunPod deployment, serverless handler
├── tests/ # Unit tests
└── docs/ # Architecture, training, adapter docs
PAWN includes a bundled Rust chess engine (engine/) that handles all game simulation, move generation, legal move computation, tokenization, and PGN parsing. The engine uses shakmaty under the hood, with PyO3 bindings to Python. No Python chess libraries are used.
The engine generates training data on-the-fly via chess_engine.generate_random_games(), producing well over 100 million random games per hour. It also includes enriched PGN parsing (extracting clock annotations, Stockfish evals, and headers in a single pass) and UCI engine self-play generation.
- Architecture -- model design, embeddings, training objective
- Training -- pretraining, adapter training, deployment
- Adapters -- adapter methods, results, quick start
- Accuracy Ceiling -- theoretical limits for random game prediction
*None of the existing experiments use FiLM to condition on anything. The existing FiLM experiments ask the question, 'how does FiLM perform when all parameters are learned'.
PAWN builds on ideas and tools from the following projects and publications:
@software{schweich2026pawn,
author = {Schweich, Thomas},
title = {{PAWN}: Playstyle-Agnostic World-model Network for Chess},
year = 2026,
url = {https://github.com/thomas-schweich/PAWN},
license = {Apache-2.0}
}Apache 2.0. See LICENSE.