Skip to content

Fix legal grid alignment, add game completion eval, benchmark improvements#54

Merged
thomas-schweich merged 3 commits intomainfrom
fix-legal-grid-alignment-and-eval
Apr 10, 2026
Merged

Fix legal grid alignment, add game completion eval, benchmark improvements#54
thomas-schweich merged 3 commits intomainfrom
fix-legal-grid-alignment-and-eval

Conversation

@thomas-schweich
Copy link
Copy Markdown
Owner

Summary

  • Legal grid off-by-one fix: legal_grid from compute_legal_move_masks was aligned with move_ids (position before each move), but the trainer checks it against targets (shifted by one). Legal/late_legal metrics were always reporting 0%. Fixed by shifting the grid in create_validation_set.
  • Game completion eval: New compute_game_completion() walks each game ply-by-ply checking if the model's argmax prediction is legal. Reports game completion rate, average % completion, and average plies to first forfeit. Computed on 64 val games at each eval interval.
  • Benchmark container awareness: CPU/RAM reporting now checks cgroup limits (v1 and v2) before falling back to /proc, so containers report their actual allocation instead of the host's full resources.
  • Benchmark warmup: Default warmup iterations bumped from 3 to 10 — torch.compile needs more iterations to fully optimize.
  • Theoretical ceiling script: Add --max-ply flag (was hardcoded to 255). 512-ply ceiling is 8.29% unconditional (vs 6.43% at 255 plies).

Test plan

  • Ground-truth targets get 100% game completion rate, random predictions get 0%
  • Legal move rate now reads ~99.4% on a running 512-ctx training run (was 0%)
  • Game completion metrics appear in val output: complete 0.750 | avg_ply 296
  • Pyright clean on pawn/trainer.py
  • CI passes

…ements

Legal grid alignment fix:
- legal_grid from compute_legal_move_masks is aligned with move_ids (legal
  moves at position *before* each move), but the trainer checks it against
  targets which are shifted by one (target[ply] = move_ids[ply+1]). Shift
  the grid by one ply in create_validation_set so it aligns with targets.
  This was causing legal_move_rate to always report 0%.

Game completion eval:
- New compute_game_completion() walks each game ply-by-ply checking whether
  the model's argmax prediction is legal. Reports: game_completion_rate
  (fraction of games without any illegal move), avg_pct_completion (mean
  fraction completed before forfeit), avg_plies_to_forfeit.
- Computed on 64 val games at each eval_interval using dense token masks.

Benchmark improvements:
- CPU/RAM reporting now checks cgroup limits (v1 and v2) before falling
  back to /proc, so containers report their actual allocation instead of
  the host's full resources.
- Default warmup iterations bumped from 3 to 10 — torch.compile needs
  more iterations to fully optimize, inflating timed results otherwise.

Theoretical ceiling script:
- Add --max-ply flag (was hardcoded to 255).
The legal grid in create_validation_set is now shifted by one ply to
align with targets. The test was using input_ids as predictions, which
matched the old unshifted grid. Switch to targets.
@thomas-schweich
Copy link
Copy Markdown
Owner Author

@claude

@claude
Copy link
Copy Markdown

claude bot commented Apr 10, 2026

Claude finished @thomas-schweich's task in 2m 2s —— View job


Code Review

  • Gather context
  • Read changed files
  • Analyze changes
  • Post review feedback

Overall this is a clean, well-motivated PR. The legal grid fix is correct, the new eval metric is useful, and the benchmark improvements are solid. A few minor issues below.


pawn/data.py — legal grid shift

The fix is correct. np.roll(legal_grid, -1, axis=1) followed by zeroing the last column properly aligns legal_grid[ply] with targets[ply]. The inline comment explaining the alignment is helpful and should stay.


pawn/trainer.py — game completion eval

Bug: gc_targets is dead code

trainer.py:588 assigns gc_targets but it is never read anywhere in the game completion block:

gc_targets = self.val_data["targets"][:gc_n].to(self.device)  # never used

This allocates a GPU tensor unnecessarily on every eval step. Fix this →

Minor: n_checked is tracked but never used

In compute_game_completion, n_checked is incremented at line 282 but never referenced in the return value or any conditional. Either use it (e.g. as the denominator for a coverage metric) or remove it. Fix this →

Semantics of avg_plies_to_forfeit for completed games

When a game completes without forfeiting, float(gl) is appended to plies_to_forfeit. So avg_plies_to_forfeit is really "average plies completed (counting full-game completions as gl)"— a reasonable survival-analysis interpretation, but the name implies it only measures forfeits. Worth either renaming to avg_plies_completed or adding a note in the docstring that completed games contribute game_length to the average.

Incomplete cleanup

gc_input, gc_loss_mask, and gc_game_lengths are not freed after the eval. The del line at 612 only covers legal_mask_t, gc_logits, and gc_preds. Not a correctness issue, but the GPU tensors linger until the next GC pass. Fix this →

Late import

import chess_engine as engine_mod inside the validation block is inconsistent with top-level imports. Not a bug (Python caches module imports), but consider hoisting it to the top of the file alongside the other engine imports for readability — or add a comment explaining why it's local.

np.roll duplication

The -1 roll + zero-fill logic now appears identically in both data.py:269–270 and trainer.py:603–604. If the alignment convention ever changes, it needs to be updated in two places. Low risk for now, but a small shared helper (e.g. _shift_legal_grid(arr)) would prevent drift.


scripts/benchmark.py — cgroup detection

Clean implementation. A couple of minor observations:

  • Integer truncation for fractional CPUs: int(int(quota_s) / int(period_s)) truncates down (e.g. 2.5 CPUs → 2). math.ceil would be slightly more accurate for reporting, though it doesn't affect benchmark correctness.
  • Cpuset parsing is correct for ranges like "0-3,8-11". The if text: guard before parsing handles empty files.
  • Warmup bump (3→10) is well-justified given torch.compile trace behavior.

scripts/compute_theoretical_ceiling.py

Clean. Default of 255 preserves backward compatibility.


tests/model/test_512_token.py

The test update correctly reflects the new alignment — using targets as ground-truth predictions is semantically right now that legal_grid is shifted.


Summary: The core fix and new metrics are correct. The three actionable items are: (1) remove the dead gc_targets variable, (2) remove or use n_checked, (3) clarify avg_plies_to_forfeit semantics. Everything else is minor polish.
| Branch

- Remove dead gc_targets variable and unused n_checked counter
- Rename avg_plies_to_forfeit → avg_plies_completed (completed games
  contribute their full game_length to the average)
- Free all GPU tensors in game completion eval cleanup
- Move chess_engine import to top of trainer.py
- Extract shift_legal_mask() into pawn/data.py to deduplicate the
  np.roll + zero-fill pattern between data.py and trainer.py
- Use math.ceil for fractional CPU counts in cgroup detection
@thomas-schweich
Copy link
Copy Markdown
Owner Author

@claude

@claude
Copy link
Copy Markdown

claude bot commented Apr 10, 2026

Claude finished @thomas-schweich's task in 1m 6s —— View job


Review feedback already addressed

All three actionable items from the previous review have been implemented in the "Address PR review feedback" commit — nothing left to do:

  • Dead gc_targets variable — removed (not present in current trainer.py)
  • Unused n_checked variable — removed (not present in current trainer.py)
  • avg_plies_to_forfeit renamed to avg_plies_completeddone (line 302/607), with docstring clarifying that completed games contribute their full game_length
  • GPU cleanup extended to include gc_input, gc_loss_mask, gc_game_lengthsdone (line 609)

The PR looks good to merge once CI passes.
| Branch

@thomas-schweich thomas-schweich merged commit e73f6ff into main Apr 10, 2026
1 check passed
@thomas-schweich thomas-schweich deleted the fix-legal-grid-alignment-and-eval branch April 10, 2026 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant