Fix legal grid alignment, add game completion eval, benchmark improvements#54
Conversation
…ements Legal grid alignment fix: - legal_grid from compute_legal_move_masks is aligned with move_ids (legal moves at position *before* each move), but the trainer checks it against targets which are shifted by one (target[ply] = move_ids[ply+1]). Shift the grid by one ply in create_validation_set so it aligns with targets. This was causing legal_move_rate to always report 0%. Game completion eval: - New compute_game_completion() walks each game ply-by-ply checking whether the model's argmax prediction is legal. Reports: game_completion_rate (fraction of games without any illegal move), avg_pct_completion (mean fraction completed before forfeit), avg_plies_to_forfeit. - Computed on 64 val games at each eval_interval using dense token masks. Benchmark improvements: - CPU/RAM reporting now checks cgroup limits (v1 and v2) before falling back to /proc, so containers report their actual allocation instead of the host's full resources. - Default warmup iterations bumped from 3 to 10 — torch.compile needs more iterations to fully optimize, inflating timed results otherwise. Theoretical ceiling script: - Add --max-ply flag (was hardcoded to 255).
The legal grid in create_validation_set is now shifted by one ply to align with targets. The test was using input_ids as predictions, which matched the old unshifted grid. Switch to targets.
|
Claude finished @thomas-schweich's task in 2m 2s —— View job Code Review
Overall this is a clean, well-motivated PR. The legal grid fix is correct, the new eval metric is useful, and the benchmark improvements are solid. A few minor issues below.
|
- Remove dead gc_targets variable and unused n_checked counter - Rename avg_plies_to_forfeit → avg_plies_completed (completed games contribute their full game_length to the average) - Free all GPU tensors in game completion eval cleanup - Move chess_engine import to top of trainer.py - Extract shift_legal_mask() into pawn/data.py to deduplicate the np.roll + zero-fill pattern between data.py and trainer.py - Use math.ceil for fractional CPU counts in cgroup detection
|
Claude finished @thomas-schweich's task in 1m 6s —— View job Review feedback already addressedAll three actionable items from the previous review have been implemented in the "Address PR review feedback" commit — nothing left to do:
The PR looks good to merge once CI passes. |
Summary
legal_gridfromcompute_legal_move_maskswas aligned withmove_ids(position before each move), but the trainer checks it againsttargets(shifted by one). Legal/late_legal metrics were always reporting 0%. Fixed by shifting the grid increate_validation_set.compute_game_completion()walks each game ply-by-ply checking if the model's argmax prediction is legal. Reports game completion rate, average % completion, and average plies to first forfeit. Computed on 64 val games at each eval interval./proc, so containers report their actual allocation instead of the host's full resources.torch.compileneeds more iterations to fully optimize.--max-plyflag (was hardcoded to 255). 512-ply ceiling is 8.29% unconditional (vs 6.43% at 255 plies).Test plan
complete 0.750 | avg_ply 296pawn/trainer.py