Dashboard style overhaul + expanded game-completion eval & stopping criteria by thomas-schweich · Pull Request #55 · thomas-schweich/PAWN

thomas-schweich · 2026-04-10T18:52:44Z

Summary

Dashboard style overhaul

Another agent did a cohesive styling pass on the Solara dashboard:

New pawn/dashboard/theme.py centralizes the color palette, Plotly layout, and typography so the chart grid and the shell share a single design system.
charts.py refactored to pull from the theme; titles bolded; log-scale error_rate_chart added with optional log-linear fit overlays (desaturated dashed lines, slope & half-life shown in the legend, e.g. s=-7.2e-06; hl=40,000).
sol.py reorganized into labeled sections with a dedicated Game Integrity & Stopping row pairing the error rate chart against the patience chart.
val_accuracy_chart trimmed to Top-1 / Top-5 — legal and late-legal now live on the error rate chart where their small dynamic range is actually visible.

Game completion eval — vectorized, full val set

New _game_completion_chunk + _aggregate_game_completion replace the Python per-game loop with a fully vectorized first-illegal-ply search using torch.gather + argmax.
Runs over the entire validation set (previously limited to 64 games) by batching in batch_size chunks; peak VRAM is independent of val_games.
Adds min_forfeit_ply, max_forfeit_ply, and median_forfeit_ply statistics across games that actually forfeited (0 if none). Surfaced in the val log line as forfeit [min-max med N].
Example line from the live run: val: loss 2.8940 ... complete 0.941 | avg_ply 331 | forfeit [58-256 med 122] — revealing that all forfeits cluster in the first half of games while late-game legality is effectively perfect.

Expanded stopping criteria

Patience now resets on improvements to any of: val_loss, late_legal_move_rate, game_completion_rate, or avg_plies_completed. The new metrics often move even when val loss has plateaued, so this prevents premature early-stopping on runs where compound legality is still improving.
best_game_completion and best_avg_plies_completed are persisted in checkpoint state so they survive resume.
The trainer now logs patience and legality_late_ply into the training-config record so the dashboard can read the actual patience limit instead of falling back to a hardcoded default.

Test plan

pyright pawn/trainer.py pawn/dashboard/{charts,sol,theme}.py tests/lab/test_dashboard_charts.py clean
pytest tests/ -m "not gpu" — 1,339 passed
Ground-truth / random / mixed sanity checks on compute_game_completion produce expected completion rates
Legal rate (~99.6%) and completion rate (~94.1% on full val set) verified against the live 512-ctx pretraining run
CI passes

… criteria Dashboard: - New `pawn/dashboard/theme.py` centralizes color palette, layout, and styling so charts and the Solara shell share one coherent system. - charts.py refactored to pull from the theme; titles bolded; log-scale error_rate_chart added with optional log-linear fit overlays (desaturated dashed lines, half-life shown in legend). - sol.py reorganized into sections, with the Game Integrity section pairing the error rate chart against the patience chart. - val_accuracy_chart trimmed to Top-1/Top-5 (legal/late-legal moved to the error rate chart so accuracy scale is readable). Game completion eval: - Fully vectorized via `_game_completion_chunk` + `_aggregate_game_completion`: no Python per-game loop, processes the full val set in batch_size chunks, peak memory independent of val_games. - Adds min/max/median forfeit-ply statistics across games that actually forfeited (0 if none). Surfaced in val log line as `forfeit [min-max med N]`. - Runs over the full validation set (was limited to 64 games). Stopping criteria: - Patience now also resets on improvements to game_completion_rate and avg_plies_completed, not just val_loss and late_legal_move_rate. - best_game_completion and best_avg_plies_completed persisted in checkpoint state so they survive resume. - Trainer logs `patience` and `legality_late_ply` into the training-config record so downstream consumers (dashboard) can see them. Tests updated for the theme refactor (layer_color moved to theme module, titles wrapped in <b>...</b> for bold rendering).

- Compound early stopping note now lists all four reset criteria (val_loss, late_legal, game_completion_rate, avg_plies_completed) matching the trainer change in this PR. - New "Benchmark the Pod" startup step before launching trials, so the agent has ground-truth step times, compile speedup, and concurrency scaling for *this* pod when planning. - Always use torch.compile by default; the warmup is cheap relative to the 1.5-2.2x speedup, even for short runs. - VRAM caveat removed (skill is pod-focused; pod GPUs aren't VRAM-constrained). - max_seq_len default updated to 512. - Tools reference: add lab_resume, document tag filter on lab_results, health_warning event type, and graceful-checkpoint behavior on lab_kill. - Drop the stale 15-30 min compile overhead figure; replace with the measured 10-30 s (NVIDIA) / 1-2 min (AMD) numbers. - Note that uv run works in dev images post #53. .dockerignore: un-ignore .claude/skills so the manage-pod skill ships with the dev image (the rest of .claude stays excluded).

thomas-schweich · 2026-04-10T19:21:16Z

@claude Please review

claude · 2026-04-10T19:21:30Z

Claude finished @thomas-schweich's task in 2m 55s —— View job

PR Review

Gather context and understand PR scope
Read pawn/trainer.py changes
Read pawn/dashboard/theme.py (new file)
Read pawn/dashboard/charts.py changes
Read pawn/dashboard/sol.py changes
Read test changes
Synthesize and post review

Summary

This is a solid, well-organized PR. The vectorized game completion eval is correct and a clear improvement over the Python loop. The dashboard overhaul is cohesive and the theme.py extraction is the right architecture. One confirmed bug and a few smaller issues are called out below.

Bug

sol.py:446 — wrong metric key in KpiRow, Game Completion tile always shows "—"

# sol.py:446
avg_ply = _last(val, "val/avg_plies_to_forfeit")   # ← key doesn't exist

The trainer writes "val/avg_plies_completed" (trainer.py:683). This key was never logged, so the KPI always renders "—". Fix this →

Correctness Concerns

_flatten_config drops parent key prefix, introducing potential silent collisions

sol.py:616-626 — the old implementation used f"{k}.{k2}" for nested keys; the new one uses only k2. If model and training ever share a key name (e.g. both had a dropout field), the second would silently overwrite the first. In practice PAWN's model/training dicts don't collide today, but it's a regression in correctness:

# Before (safe):  flat["model.d_model"] = ...
# After (unsafe): flat["d_model"] = ...  (overwrites if training also has d_model)

Worth either keeping the dot-notation or at minimum asserting no collisions in a test.

_row passes lambda: None as a column item when val is empty

sol.py in the "Game Integrity" section:

card(...) if val else (lambda: None),

_row calls every item as item(). lambda: None returns None and Solara will render an empty column cell, leaving a visual gap. If val is empty there's also nothing to show in the error rate chart — consider only rendering the _row when val is non-empty.

Performance

load_metrics is called 3× per refresh tick

sol.py:634, sol.py:749, sol.py:979 — Dashboard, ConfigSummary, and MetricsCharts each call load_metrics independently with identical dependencies. The new KpiRow gets its data from the Dashboard call, but the other two components still do their own separate loads. For large metrics.jsonl files this is a meaningful overhead. A shared reactive state at the Dashboard level (passing data as a prop) would eliminate the duplication.

Minor Issues

legal_tokens not in the del block inside the game-completion chunk loop

trainer.py:674:

del gc_input, gc_loss_mask, gc_game_lengths, gc_logits, gc_preds, legal_mask_t
# ↑ legal_tokens (the numpy array from engine.compute_legal_token_masks) is missing

It'll be freed when overwritten in the next iteration, so this isn't a memory leak, but it's inconsistent with the intent of the explicit del block.

val_loss_chart misleadingly named for pawn runs

charts.py:248-257 — for run_type == "pawn" the function now returns a perplexity chart, not a loss chart. The dashboard description says "Validation perplexity" which is clear, but the function named val_loss_chart returning a "Perplexity" plot will confuse anyone calling it.

CSS keyframe hardcodes EMERALD instead of using a variable

sol.py GLOBAL_CSS, @keyframes pawn-pulse block hardcodes rgba(52,211,153,…) rather than referencing --pawn-success. If EMERALD is ever changed in theme.py the animation color won't follow.

What's Good

_game_completion_chunk vectorization — clean and correct. The argmax on an all-False tensor returning 0 is correctly masked by has_forfeit downstream, and the preds_clamped clamp before gather with the separate in_range guard is the right way to handle out-of-vocab predictions.
theme.py — excellent extraction. CSS custom properties, semantic color aliases, and a single PLOTLY_LAYOUT dict gives real coherence across the chart grid and the shell.
error_rate_chart with log-linear fit — diagnostically valuable. The second-half only window for the fit is a reasonable heuristic for a training curve; slope + half-life in the legend is immediately interpretable.
Compound patience — the expanded stopping criteria (patience resets on any of loss / late-legality / completion / avg-plies) directly addresses the described use case where legality metrics keep improving after val loss plateaus. Persisting best_game_completion and best_avg_plies_completed in checkpoint state is the right call.
_row + Section pattern — much cleaner than the nested with solara.Columns blocks it replaces.
_escape() in Runner output — correct placement. Runner output comes from an external process and is the right boundary to sanitize before injecting into HTML.
Test update — moving layer_color import to theme and relaxing the title assertion to "Loss" in ... are both correct.

Address PR review nit — the numpy legal_tokens array was missing from the explicit del block, so it leaked into the next loop iteration until overwritten. No functional impact (Python's GC handles it on rebind), but matches the intent of the explicit cleanup.

- Fix Game Completion KPI: metric key was val/avg_plies_to_forfeit but trainer logs val/avg_plies_completed, so the sub-label always read "—". - Restore dot-notation in _flatten_config to prevent silent collisions between nested sections sharing a field name; update the highlight order to match. - Drop lambda:None column placeholder — "Game Integrity & Stopping" now lives inside if val:, so nothing renders when val is empty. - Dashboard loads metrics once per tick and passes data= to ConfigSummary and MetricsCharts, eliminating the 3x load_metrics calls per refresh. - Split val_loss_chart (now always validation loss) from a new perplexity_chart used by the pawn pretraining branch. - Rewrite .pawn-pulse + @Keyframes pawn-pulse with color-mix so the badge background, border, and animation all derive from --pawn-success / currentColor instead of hardcoded emerald rgba. - Error-rate description corrected: late illegal is plies ≥ context/2 (second half of the context window), not the second half of games.

thomas-schweich · 2026-04-10T19:36:26Z

@claude please provide a follow-up review.

claude · 2026-04-10T19:36:50Z

Claude finished @thomas-schweich's task in 2m 59s —— View job

Follow-up Review

Gather context from previous review
Read current state of changed files
Check which issues from the first review were addressed
Synthesize and post follow-up review

All 7 issues called out in the first review were addressed in commits 73e5cfd and bae3555. Summary of what was fixed and a couple of residual observations:

All Previous Issues: Addressed ✓

Issue	Fix
Bug: wrong `val/avg_plies_to_forfeit` key in KpiRow	✓ Corrected to `val/avg_plies_completed`
`_flatten_config` dropped parent prefix (silent collision risk)	✓ Dot-notation restored; `_CONFIG_HIGHLIGHT_ORDER` updated to match
`lambda: None` placeholder passed as a column item	✓ "Game Integrity & Stopping" section moved inside `if val:` block
`load_metrics` called 3× per refresh tick	✓ Called once in `Dashboard`, passed as `data=` prop to `ConfigSummary` and `MetricsCharts`
`legal_tokens` missing from `del` block	✓ Added in `73e5cfd`
`val_loss_chart` returning perplexity for pawn runs (misleading name)	✓ Split into `val_loss_chart` (unchanged) + new `perplexity_chart`
CSS keyframe hardcoded EMERALD rgba	✓ Rewritten with `color-mix(in srgb, currentColor …, transparent)`

Residual Observations

val_loss_chart is now dead code

After the split, perplexity_chart is called for pawn runs (sol.py:904), but val_loss_chart is called from nowhere in sol.py. The adapter dashboard branches (hybrid, rosa, sparse, bottleneck, lora, film, tiny) all show training loss via loss_chart and have no validation chart section. The function definition says "Adapter runs only" but no adapter branch calls it.

The three test cases in TestValCharts are the only callers. This is purely a dead-code issue — no runtime impact — but it's worth either wiring it into adapter validation sections or removing it.

TestValCharts::test_val_loss_pawn_with_data is stale

charts.py:305:

fig = charts.val_loss_chart(SAMPLE_VAL, "step", "pawn")
assert fig is not None

SAMPLE_VAL contains "val/loss" (and "val/perplexity"), but the current val_loss_chart looks for the key "val_loss". The chart is generated successfully — make_chart handles missing keys gracefully — but it silently has no data. The test was written to verify that pawn produces a perplexity chart; that use case has moved to perplexity_chart. A corresponding test_perplexity_chart_with_data would better cover the code path that's actually live.

What Was Done Well

The color-mix() rewrite is clean — color-mix(in srgb, currentColor 55%, transparent) correctly derives the glow from currentColor (which is --pawn-success when active and --pawn-text-muted when idle), so the animation now follows the token color automatically. The param_count and formulation additions to _CONFIG_HIGHLIGHT_ORDER correctly reference top-level keys (not nested), which matches how trainer.py:477 emits them. The single-load data= prop threading is tidy — Dashboard owns the memo, ConfigSummary and MetricsCharts receive it, no duplicated reactive subscriptions.

The dead val_loss_chart and the stale test are minor. Everything else looks good — this PR is ready to merge.

thomas-schweich added 4 commits April 10, 2026 11:52

Driveby: update readme to explain active vocab transition.

0bf0792

Fix README type.

fb93387

thomas-schweich added 3 commits April 10, 2026 12:27

Clarify wording of README.

fb948f7

thomas-schweich merged commit 0f96df2 into main Apr 10, 2026
1 check passed

thomas-schweich deleted the dashboard-style-and-stopping-criteria branch April 10, 2026 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard style overhaul + expanded game-completion eval & stopping criteria#55

Dashboard style overhaul + expanded game-completion eval & stopping criteria#55
thomas-schweich merged 7 commits intomainfrom
dashboard-style-and-stopping-criteria

thomas-schweich commented Apr 10, 2026

Uh oh!

thomas-schweich commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

thomas-schweich commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomas-schweich commented Apr 10, 2026

Summary