Fix legal grid alignment, add game completion eval, benchmark improvements by thomas-schweich · Pull Request #54 · thomas-schweich/PAWN

thomas-schweich · 2026-04-10T03:07:22Z

Summary

Legal grid off-by-one fix: legal_grid from compute_legal_move_masks was aligned with move_ids (position before each move), but the trainer checks it against targets (shifted by one). Legal/late_legal metrics were always reporting 0%. Fixed by shifting the grid in create_validation_set.
Game completion eval: New compute_game_completion() walks each game ply-by-ply checking if the model's argmax prediction is legal. Reports game completion rate, average % completion, and average plies to first forfeit. Computed on 64 val games at each eval interval.
Benchmark container awareness: CPU/RAM reporting now checks cgroup limits (v1 and v2) before falling back to /proc, so containers report their actual allocation instead of the host's full resources.
Benchmark warmup: Default warmup iterations bumped from 3 to 10 — torch.compile needs more iterations to fully optimize.
Theoretical ceiling script: Add --max-ply flag (was hardcoded to 255). 512-ply ceiling is 8.29% unconditional (vs 6.43% at 255 plies).

Test plan

Ground-truth targets get 100% game completion rate, random predictions get 0%
Legal move rate now reads ~99.4% on a running 512-ctx training run (was 0%)
Game completion metrics appear in val output: complete 0.750 | avg_ply 296
Pyright clean on pawn/trainer.py
CI passes

…ements Legal grid alignment fix: - legal_grid from compute_legal_move_masks is aligned with move_ids (legal moves at position *before* each move), but the trainer checks it against targets which are shifted by one (target[ply] = move_ids[ply+1]). Shift the grid by one ply in create_validation_set so it aligns with targets. This was causing legal_move_rate to always report 0%. Game completion eval: - New compute_game_completion() walks each game ply-by-ply checking whether the model's argmax prediction is legal. Reports: game_completion_rate (fraction of games without any illegal move), avg_pct_completion (mean fraction completed before forfeit), avg_plies_to_forfeit. - Computed on 64 val games at each eval_interval using dense token masks. Benchmark improvements: - CPU/RAM reporting now checks cgroup limits (v1 and v2) before falling back to /proc, so containers report their actual allocation instead of the host's full resources. - Default warmup iterations bumped from 3 to 10 — torch.compile needs more iterations to fully optimize, inflating timed results otherwise. Theoretical ceiling script: - Add --max-ply flag (was hardcoded to 255).

The legal grid in create_validation_set is now shifted by one ply to align with targets. The test was using input_ids as predictions, which matched the old unshifted grid. Switch to targets.

thomas-schweich · 2026-04-10T03:40:00Z

@claude

claude · 2026-04-10T03:40:14Z

Claude finished @thomas-schweich's task in 2m 2s —— View job

Code Review

Gather context
Read changed files
Analyze changes
Post review feedback

Overall this is a clean, well-motivated PR. The legal grid fix is correct, the new eval metric is useful, and the benchmark improvements are solid. A few minor issues below.

`pawn/data.py` — legal grid shift

The fix is correct. np.roll(legal_grid, -1, axis=1) followed by zeroing the last column properly aligns legal_grid[ply] with targets[ply]. The inline comment explaining the alignment is helpful and should stay.

`pawn/trainer.py` — game completion eval

Bug: gc_targets is dead code

trainer.py:588 assigns gc_targets but it is never read anywhere in the game completion block:

gc_targets = self.val_data["targets"][:gc_n].to(self.device)  # never used

This allocates a GPU tensor unnecessarily on every eval step. Fix this →

Minor: n_checked is tracked but never used

In compute_game_completion, n_checked is incremented at line 282 but never referenced in the return value or any conditional. Either use it (e.g. as the denominator for a coverage metric) or remove it. Fix this →

Semantics of avg_plies_to_forfeit for completed games

When a game completes without forfeiting, float(gl) is appended to plies_to_forfeit. So avg_plies_to_forfeit is really "average plies completed (counting full-game completions as gl)"— a reasonable survival-analysis interpretation, but the name implies it only measures forfeits. Worth either renaming to avg_plies_completed or adding a note in the docstring that completed games contribute game_length to the average.

Incomplete cleanup

gc_input, gc_loss_mask, and gc_game_lengths are not freed after the eval. The del line at 612 only covers legal_mask_t, gc_logits, and gc_preds. Not a correctness issue, but the GPU tensors linger until the next GC pass. Fix this →

Late import

import chess_engine as engine_mod inside the validation block is inconsistent with top-level imports. Not a bug (Python caches module imports), but consider hoisting it to the top of the file alongside the other engine imports for readability — or add a comment explaining why it's local.

np.roll duplication

The -1 roll + zero-fill logic now appears identically in both data.py:269–270 and trainer.py:603–604. If the alignment convention ever changes, it needs to be updated in two places. Low risk for now, but a small shared helper (e.g. _shift_legal_grid(arr)) would prevent drift.

`scripts/benchmark.py` — cgroup detection

Clean implementation. A couple of minor observations:

Integer truncation for fractional CPUs: int(int(quota_s) / int(period_s)) truncates down (e.g. 2.5 CPUs → 2). math.ceil would be slightly more accurate for reporting, though it doesn't affect benchmark correctness.
Cpuset parsing is correct for ranges like "0-3,8-11". The if text: guard before parsing handles empty files.
Warmup bump (3→10) is well-justified given torch.compile trace behavior.

`scripts/compute_theoretical_ceiling.py`

Clean. Default of 255 preserves backward compatibility.

`tests/model/test_512_token.py`

The test update correctly reflects the new alignment — using targets as ground-truth predictions is semantically right now that legal_grid is shifted.

Summary: The core fix and new metrics are correct. The three actionable items are: (1) remove the dead gc_targets variable, (2) remove or use n_checked, (3) clarify avg_plies_to_forfeit semantics. Everything else is minor polish.
| Branch

- Remove dead gc_targets variable and unused n_checked counter - Rename avg_plies_to_forfeit → avg_plies_completed (completed games contribute their full game_length to the average) - Free all GPU tensors in game completion eval cleanup - Move chess_engine import to top of trainer.py - Extract shift_legal_mask() into pawn/data.py to deduplicate the np.roll + zero-fill pattern between data.py and trainer.py - Use math.ceil for fractional CPU counts in cgroup detection

thomas-schweich · 2026-04-10T04:35:58Z

@claude

claude · 2026-04-10T04:36:12Z

Claude finished @thomas-schweich's task in 1m 6s —— View job

Review feedback already addressed

All three actionable items from the previous review have been implemented in the "Address PR review feedback" commit — nothing left to do:

Dead gc_targets variable — removed (not present in current trainer.py)
Unused n_checked variable — removed (not present in current trainer.py)
avg_plies_to_forfeit renamed to avg_plies_completed — done (line 302/607), with docstring clarifying that completed games contribute their full game_length
GPU cleanup extended to include gc_input, gc_loss_mask, gc_game_lengths — done (line 609)

The PR looks good to merge once CI passes.
| Branch

thomas-schweich added 2 commits April 9, 2026 16:20

Fix test to use targets as ground-truth preds after legal grid shift

84a5bfd

The legal grid in create_validation_set is now shifted by one ply to align with targets. The test was using input_ids as predictions, which matched the old unshifted grid. Switch to targets.

thomas-schweich merged commit e73f6ff into main Apr 10, 2026
1 check passed

thomas-schweich deleted the fix-legal-grid-alignment-and-eval branch April 10, 2026 04:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix legal grid alignment, add game completion eval, benchmark improvements#54

Fix legal grid alignment, add game completion eval, benchmark improvements#54
thomas-schweich merged 3 commits intomainfrom
fix-legal-grid-alignment-and-eval

thomas-schweich commented Apr 10, 2026

Uh oh!

thomas-schweich commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

thomas-schweich commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomas-schweich commented Apr 10, 2026

Summary

Test plan

Uh oh!

thomas-schweich commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

pawn/data.py — legal grid shift

pawn/trainer.py — game completion eval

scripts/benchmark.py — cgroup detection

scripts/compute_theoretical_ceiling.py

tests/model/test_512_token.py

Uh oh!

thomas-schweich commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review feedback already addressed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 10, 2026 •

edited

Loading

`pawn/data.py` — legal grid shift

`pawn/trainer.py` — game completion eval

`scripts/benchmark.py` — cgroup detection

`scripts/compute_theoretical_ceiling.py`

`tests/model/test_512_token.py`

claude bot commented Apr 10, 2026 •

edited

Loading