Lightweight AlphaZero-style pipeline for a custom 4×4 stacking / tower control game: self-play generation, PUCT MCTS (with Dirichlet noise), policy–value network, supervised updates, iterative gating.
- Board: 4×4 (16 cells).
- Pieces per player: 5 squares, 5 circles, 5 arrows (arrow direction chosen on placement: up/right/down/left).
- Placement constraints (depend on previous move):
- Previous was a square at (r,c): next move must go to one of its 4 orthogonal neighbours.
- Previous was an arrow at (r,c) with direction d: next move must lie anywhere along the ray from (r,c) in direction d (inclusive) until the edge.
- Previous was a circle at (r,c): next move must (if still legal) be on the same cell.
- Fallback: if no legal cell under these constraints, you may place on any empty cell (with zero pieces). If none exist the game ends.
- Cell constraints: each cell holds at most 3 pieces; piece types are unique within a cell (≤1 square, ≤1 circle, ≤1 arrow).
- Tower & scoring: when a cell reaches 3 pieces it becomes a tower. Ownership: player with strictly more pieces there (2–1 or 3–0). Final score = number of owned towers. Outcome z ∈ {+1, 0, -1} from the current player’s perspective.
- Action encoding (96 total):
- 0–15: place square on cell i
- 16–31: place circle on cell i
- 32–95: place arrow on cell i; direction = (a−32) % 4 (0 up, 1 right, 2 down, 3 left)
1st training run (batch size 256, lr 1e-3, 1000 epochs, on random init weights):
2nd training run (batch size 256, lr 1e-3, 1000 epochs, preload tr01_best.pt):
3rd training run (batch size 512, lr 1e-3, 1000 epochs, preload tr02_best.pt):

Pipeline: 3-step training loop. Commands are single-line; add or adjust flags (e.g. --model, --mcts-sims, temperature) as you iterate.
- Self-play (produce
data/sp.npzwith(s, π, z)):
python -m src.self_play --games 100 --mcts-sims 200 --out data/sp.npz --seed 42Generates trajectories using MCTS (PUCT) per move; visit counts -> policy target; final outcome -> value targets.
- Train (fit policy & value heads):
python -m src.train --data data/sp.npz --epochs 1000 --batch-size 512 --save ckpt/tr.pt --log ckpt/tr.log --seed 42Produces cand.pt (last) and cand_best.pt (lowest loss). Use --model ckpt/best.pt to continue from previous best, or tweak AMP / device flags if needed.
- Arena (gating candidate vs best):
python -m src.arena --candidate ckpt/tr.pt --best ckpt/best.pt --eval-games 50 --mcts-sims 400 --accept-rate 0.55Deterministic matches (no temperature / noise). Promote candidate if win rate meets threshold. Then loop back to step 1 with the new best.pt.
Optional: plot training curve for a sanity check.
python ckpt/visual.py --log ckpt/train.log --smooth 7 --out curve.pngLog format: epoch N: loss=... policy=... value=... time=...s.
Play against the current model:
python -m tests.battle --model ckpt/best.pt --mcts-sims 400 --device cudaControls:
- Mouse: click a cell (highlight)
- Keys:
ssquare,ccircle,aarrow (pressarepeatedly to rotate direction 0→1→2→3) - Preview: green outline / arrow before confirming
- Enter / Space: place;
q/Esc: quit
Notes: --mcts-sims sets AI strength; without --model you play a random-initialized net (weak). Lower sims (e.g. 100) for speed; add --delay (if present) to slow display.
NPZ file fields:
s: float32 (N, C, 4, 4)p: float32 (N, 96)z: float32 (N,)
Core learning triple: (s, π, z).
- Observation: stacked planes (own/opponent occupancy per type, arrow direction one-hot, remaining piece counts, side-to-move).
- Network (PolicyValueNet): light residual CNN → 96 policy logits + scalar value v ∈ [−1,1].
- MCTS: PUCT selection Q+U; root Dirichlet noise for exploration; illegal actions masked then renormalised; value signs flipped up the path.
- Self-Play: run N simulations per move, convert visit counts to π; use temperature sampling for early moves then argmax; game end produces z.
- Training (train.py): minimise L = CE(policy_logits, π_target) + MSE(v, z); AdamW + optional AMP. Iteration is performed manually: generate new self-play data → train → arena test.
- Stability: strict legality masking, temperature cooling; (optional) you can maintain a simple replay buffer externally by concatenating past NPZ files before training to reduce distribution shift.
Technical notes / principles:
- PUCT: U ∝ P[a] * sqrt(N_total) / (1 + N[a]) ensuring principled exploration–exploitation tradeoff.
- Dirichlet root noise: prevents premature policy collapse; can be disabled for deterministic evaluation / arena.
- Value sign inversion: propagates evaluation from leaf to root with alternating perspective (zero-sum consistency).
- Manual iteration + gating: user-driven loop (self-play → train → arena) promotes a candidate only if its arena win rate ≥ threshold, preventing regressions without requiring an orchestration script.
- Loss structure: policy cross-entropy + value MSE; clean separation enables later auxiliary heads (e.g. tower ownership) without entangling core optimisation.
- Determinism hooks: unified
--seedseeds Python / NumPy / Torch; helps reproduce acceptance decisions and debugging runs. - Mixed precision (AMP): halves memory & speeds math on GPU; automatic fallback keeps CPU path simple.
- Gradient safety: norm clipping + scaler help prevent exploding updates and NaN cascades.
- Illegal action masking: logits for invalid moves removed then renormalised; guarantees π is a valid distribution and stabilises training.
- Data schema: NPZ (
s,p,z) is minimal yet extensible (extra arrays can be appended without breaking existing loaders). - Evaluation independence: arena uses deterministic argmax (no temperature / noise) to measure pure policy quality separate from exploration heuristics.
- Extensibility: modular files (rules, search, model, data gen, training, loop) allow swapping individual components (e.g. alternative network or search tweaks) without global refactors.
Potential extensions (roughly ascending sophistication):
- 8-fold symmetry augmentation (rotations / reflections).
- Replay sampling strategies: stochastic or prioritized (PER).
- Policy regularisation: KL to previous policy or temperature ramps.
- Auxiliary heads: tower ownership or remaining-move prediction.
- Search efficiency: root reuse / batched GPU inference / partial tree persistence.
- Distributed self-play: multi-process or multi-node with parameter server.
- Advanced search tuning: dynamic c_puct, progressive widening.
- Evaluation suite: Elo tracking, long-horizon stability, symmetry consistency checks.
- Network scaling: deeper residual stacks, Squeeze-Excitation / attention, mixed-head designs.
- Reliability: NaN/Inf watchdog & gradient explosion fuse.
PRs / experiments are welcome.
Concise, reproducible, extensible. Have fun. 🧠