Dashboard style overhaul + expanded game-completion eval & stopping criteria#55
Conversation
… criteria Dashboard: - New `pawn/dashboard/theme.py` centralizes color palette, layout, and styling so charts and the Solara shell share one coherent system. - charts.py refactored to pull from the theme; titles bolded; log-scale error_rate_chart added with optional log-linear fit overlays (desaturated dashed lines, half-life shown in legend). - sol.py reorganized into sections, with the Game Integrity section pairing the error rate chart against the patience chart. - val_accuracy_chart trimmed to Top-1/Top-5 (legal/late-legal moved to the error rate chart so accuracy scale is readable). Game completion eval: - Fully vectorized via `_game_completion_chunk` + `_aggregate_game_completion`: no Python per-game loop, processes the full val set in batch_size chunks, peak memory independent of val_games. - Adds min/max/median forfeit-ply statistics across games that actually forfeited (0 if none). Surfaced in val log line as `forfeit [min-max med N]`. - Runs over the full validation set (was limited to 64 games). Stopping criteria: - Patience now also resets on improvements to game_completion_rate and avg_plies_completed, not just val_loss and late_legal_move_rate. - best_game_completion and best_avg_plies_completed persisted in checkpoint state so they survive resume. - Trainer logs `patience` and `legality_late_ply` into the training-config record so downstream consumers (dashboard) can see them. Tests updated for the theme refactor (layer_color moved to theme module, titles wrapped in <b>...</b> for bold rendering).
- Compound early stopping note now lists all four reset criteria (val_loss, late_legal, game_completion_rate, avg_plies_completed) matching the trainer change in this PR. - New "Benchmark the Pod" startup step before launching trials, so the agent has ground-truth step times, compile speedup, and concurrency scaling for *this* pod when planning. - Always use torch.compile by default; the warmup is cheap relative to the 1.5-2.2x speedup, even for short runs. - VRAM caveat removed (skill is pod-focused; pod GPUs aren't VRAM-constrained). - max_seq_len default updated to 512. - Tools reference: add lab_resume, document tag filter on lab_results, health_warning event type, and graceful-checkpoint behavior on lab_kill. - Drop the stale 15-30 min compile overhead figure; replace with the measured 10-30 s (NVIDIA) / 1-2 min (AMD) numbers. - Note that uv run works in dev images post #53. .dockerignore: un-ignore .claude/skills so the manage-pod skill ships with the dev image (the rest of .claude stays excluded).
|
@claude Please review |
|
Claude finished @thomas-schweich's task in 2m 55s —— View job PR Review
SummaryThis is a solid, well-organized PR. The vectorized game completion eval is correct and a clear improvement over the Python loop. The dashboard overhaul is cohesive and the Bug
# sol.py:446
avg_ply = _last(val, "val/avg_plies_to_forfeit") # ← key doesn't existThe trainer writes Correctness Concerns
# Before (safe): flat["model.d_model"] = ...
# After (unsafe): flat["d_model"] = ... (overwrites if training also has d_model)Worth either keeping the dot-notation or at minimum asserting no collisions in a test.
card(...) if val else (lambda: None),
Performance
Minor Issues
del gc_input, gc_loss_mask, gc_game_lengths, gc_logits, gc_preds, legal_mask_t
# ↑ legal_tokens (the numpy array from engine.compute_legal_token_masks) is missingIt'll be freed when overwritten in the next iteration, so this isn't a memory leak, but it's inconsistent with the intent of the explicit
CSS keyframe hardcodes EMERALD instead of using a variable
What's Good
|
Address PR review nit — the numpy legal_tokens array was missing from the explicit del block, so it leaked into the next loop iteration until overwritten. No functional impact (Python's GC handles it on rebind), but matches the intent of the explicit cleanup.
- Fix Game Completion KPI: metric key was val/avg_plies_to_forfeit but trainer logs val/avg_plies_completed, so the sub-label always read "—". - Restore dot-notation in _flatten_config to prevent silent collisions between nested sections sharing a field name; update the highlight order to match. - Drop lambda:None column placeholder — "Game Integrity & Stopping" now lives inside if val:, so nothing renders when val is empty. - Dashboard loads metrics once per tick and passes data= to ConfigSummary and MetricsCharts, eliminating the 3x load_metrics calls per refresh. - Split val_loss_chart (now always validation loss) from a new perplexity_chart used by the pawn pretraining branch. - Rewrite .pawn-pulse + @Keyframes pawn-pulse with color-mix so the badge background, border, and animation all derive from --pawn-success / currentColor instead of hardcoded emerald rgba. - Error-rate description corrected: late illegal is plies ≥ context/2 (second half of the context window), not the second half of games.
|
@claude please provide a follow-up review. |
|
Claude finished @thomas-schweich's task in 2m 59s —— View job Follow-up Review
All 7 issues called out in the first review were addressed in commits All Previous Issues: Addressed ✓
Residual Observations
After the split, The three test cases in
fig = charts.val_loss_chart(SAMPLE_VAL, "step", "pawn")
assert fig is not None
What Was Done WellThe The dead |
Summary
Dashboard style overhaul
Another agent did a cohesive styling pass on the Solara dashboard:
pawn/dashboard/theme.pycentralizes the color palette, Plotly layout, and typography so the chart grid and the shell share a single design system.charts.pyrefactored to pull from the theme; titles bolded; log-scaleerror_rate_chartadded with optional log-linear fit overlays (desaturated dashed lines, slope & half-life shown in the legend, e.g.s=-7.2e-06; hl=40,000).sol.pyreorganized into labeled sections with a dedicated Game Integrity & Stopping row pairing the error rate chart against the patience chart.val_accuracy_charttrimmed to Top-1 / Top-5 — legal and late-legal now live on the error rate chart where their small dynamic range is actually visible.Game completion eval — vectorized, full val set
_game_completion_chunk+_aggregate_game_completionreplace the Python per-game loop with a fully vectorized first-illegal-ply search usingtorch.gather+argmax.batch_sizechunks; peak VRAM is independent ofval_games.min_forfeit_ply,max_forfeit_ply, andmedian_forfeit_plystatistics across games that actually forfeited (0 if none). Surfaced in the val log line asforfeit [min-max med N].val: loss 2.8940 ... complete 0.941 | avg_ply 331 | forfeit [58-256 med 122]— revealing that all forfeits cluster in the first half of games while late-game legality is effectively perfect.Expanded stopping criteria
val_loss,late_legal_move_rate,game_completion_rate, oravg_plies_completed. The new metrics often move even when val loss has plateaued, so this prevents premature early-stopping on runs where compound legality is still improving.best_game_completionandbest_avg_plies_completedare persisted in checkpoint state so they survive resume.patienceandlegality_late_plyinto the training-config record so the dashboard can read the actual patience limit instead of falling back to a hardcoded default.Test plan
pyright pawn/trainer.py pawn/dashboard/{charts,sol,theme}.py tests/lab/test_dashboard_charts.pycleanpytest tests/ -m "not gpu"— 1,339 passedcompute_game_completionproduce expected completion rates