Skip to content

Dashboard style overhaul + expanded game-completion eval & stopping criteria#55

Merged
thomas-schweich merged 7 commits intomainfrom
dashboard-style-and-stopping-criteria
Apr 10, 2026
Merged

Dashboard style overhaul + expanded game-completion eval & stopping criteria#55
thomas-schweich merged 7 commits intomainfrom
dashboard-style-and-stopping-criteria

Conversation

@thomas-schweich
Copy link
Copy Markdown
Owner

Summary

Dashboard style overhaul

Another agent did a cohesive styling pass on the Solara dashboard:

  • New pawn/dashboard/theme.py centralizes the color palette, Plotly layout, and typography so the chart grid and the shell share a single design system.
  • charts.py refactored to pull from the theme; titles bolded; log-scale error_rate_chart added with optional log-linear fit overlays (desaturated dashed lines, slope & half-life shown in the legend, e.g. s=-7.2e-06; hl=40,000).
  • sol.py reorganized into labeled sections with a dedicated Game Integrity & Stopping row pairing the error rate chart against the patience chart.
  • val_accuracy_chart trimmed to Top-1 / Top-5 — legal and late-legal now live on the error rate chart where their small dynamic range is actually visible.

Game completion eval — vectorized, full val set

  • New _game_completion_chunk + _aggregate_game_completion replace the Python per-game loop with a fully vectorized first-illegal-ply search using torch.gather + argmax.
  • Runs over the entire validation set (previously limited to 64 games) by batching in batch_size chunks; peak VRAM is independent of val_games.
  • Adds min_forfeit_ply, max_forfeit_ply, and median_forfeit_ply statistics across games that actually forfeited (0 if none). Surfaced in the val log line as forfeit [min-max med N].
  • Example line from the live run: val: loss 2.8940 ... complete 0.941 | avg_ply 331 | forfeit [58-256 med 122] — revealing that all forfeits cluster in the first half of games while late-game legality is effectively perfect.

Expanded stopping criteria

  • Patience now resets on improvements to any of: val_loss, late_legal_move_rate, game_completion_rate, or avg_plies_completed. The new metrics often move even when val loss has plateaued, so this prevents premature early-stopping on runs where compound legality is still improving.
  • best_game_completion and best_avg_plies_completed are persisted in checkpoint state so they survive resume.
  • The trainer now logs patience and legality_late_ply into the training-config record so the dashboard can read the actual patience limit instead of falling back to a hardcoded default.

Test plan

  • pyright pawn/trainer.py pawn/dashboard/{charts,sol,theme}.py tests/lab/test_dashboard_charts.py clean
  • pytest tests/ -m "not gpu" — 1,339 passed
  • Ground-truth / random / mixed sanity checks on compute_game_completion produce expected completion rates
  • Legal rate (~99.6%) and completion rate (~94.1% on full val set) verified against the live 512-ctx pretraining run
  • CI passes

… criteria

Dashboard:
- New `pawn/dashboard/theme.py` centralizes color palette, layout, and
  styling so charts and the Solara shell share one coherent system.
- charts.py refactored to pull from the theme; titles bolded; log-scale
  error_rate_chart added with optional log-linear fit overlays (desaturated
  dashed lines, half-life shown in legend).
- sol.py reorganized into sections, with the Game Integrity section
  pairing the error rate chart against the patience chart.
- val_accuracy_chart trimmed to Top-1/Top-5 (legal/late-legal moved to
  the error rate chart so accuracy scale is readable).

Game completion eval:
- Fully vectorized via `_game_completion_chunk` + `_aggregate_game_completion`:
  no Python per-game loop, processes the full val set in batch_size chunks,
  peak memory independent of val_games.
- Adds min/max/median forfeit-ply statistics across games that actually
  forfeited (0 if none). Surfaced in val log line as `forfeit [min-max med N]`.
- Runs over the full validation set (was limited to 64 games).

Stopping criteria:
- Patience now also resets on improvements to game_completion_rate and
  avg_plies_completed, not just val_loss and late_legal_move_rate.
- best_game_completion and best_avg_plies_completed persisted in
  checkpoint state so they survive resume.
- Trainer logs `patience` and `legality_late_ply` into the training-config
  record so downstream consumers (dashboard) can see them.

Tests updated for the theme refactor (layer_color moved to theme module,
titles wrapped in <b>...</b> for bold rendering).
- Compound early stopping note now lists all four reset criteria
  (val_loss, late_legal, game_completion_rate, avg_plies_completed)
  matching the trainer change in this PR.
- New "Benchmark the Pod" startup step before launching trials, so
  the agent has ground-truth step times, compile speedup, and
  concurrency scaling for *this* pod when planning.
- Always use torch.compile by default; the warmup is cheap relative
  to the 1.5-2.2x speedup, even for short runs.
- VRAM caveat removed (skill is pod-focused; pod GPUs aren't VRAM-constrained).
- max_seq_len default updated to 512.
- Tools reference: add lab_resume, document tag filter on lab_results,
  health_warning event type, and graceful-checkpoint behavior on lab_kill.
- Drop the stale 15-30 min compile overhead figure; replace with the
  measured 10-30 s (NVIDIA) / 1-2 min (AMD) numbers.
- Note that uv run works in dev images post #53.

.dockerignore: un-ignore .claude/skills so the manage-pod skill ships
with the dev image (the rest of .claude stays excluded).
@thomas-schweich
Copy link
Copy Markdown
Owner Author

@claude Please review

@claude
Copy link
Copy Markdown

claude bot commented Apr 10, 2026

Claude finished @thomas-schweich's task in 2m 55s —— View job


PR Review

  • Gather context and understand PR scope
  • Read pawn/trainer.py changes
  • Read pawn/dashboard/theme.py (new file)
  • Read pawn/dashboard/charts.py changes
  • Read pawn/dashboard/sol.py changes
  • Read test changes
  • Synthesize and post review

Summary

This is a solid, well-organized PR. The vectorized game completion eval is correct and a clear improvement over the Python loop. The dashboard overhaul is cohesive and the theme.py extraction is the right architecture. One confirmed bug and a few smaller issues are called out below.


Bug

sol.py:446 — wrong metric key in KpiRow, Game Completion tile always shows "—"

# sol.py:446
avg_ply = _last(val, "val/avg_plies_to_forfeit")   # ← key doesn't exist

The trainer writes "val/avg_plies_completed" (trainer.py:683). This key was never logged, so the KPI always renders "—". Fix this →


Correctness Concerns

_flatten_config drops parent key prefix, introducing potential silent collisions

sol.py:616-626 — the old implementation used f"{k}.{k2}" for nested keys; the new one uses only k2. If model and training ever share a key name (e.g. both had a dropout field), the second would silently overwrite the first. In practice PAWN's model/training dicts don't collide today, but it's a regression in correctness:

# Before (safe):  flat["model.d_model"] = ...
# After (unsafe): flat["d_model"] = ...  (overwrites if training also has d_model)

Worth either keeping the dot-notation or at minimum asserting no collisions in a test.

_row passes lambda: None as a column item when val is empty

sol.py in the "Game Integrity" section:

card(...) if val else (lambda: None),

_row calls every item as item(). lambda: None returns None and Solara will render an empty column cell, leaving a visual gap. If val is empty there's also nothing to show in the error rate chart — consider only rendering the _row when val is non-empty.


Performance

load_metrics is called 3× per refresh tick

sol.py:634, sol.py:749, sol.py:979Dashboard, ConfigSummary, and MetricsCharts each call load_metrics independently with identical dependencies. The new KpiRow gets its data from the Dashboard call, but the other two components still do their own separate loads. For large metrics.jsonl files this is a meaningful overhead. A shared reactive state at the Dashboard level (passing data as a prop) would eliminate the duplication.


Minor Issues

legal_tokens not in the del block inside the game-completion chunk loop

trainer.py:674:

del gc_input, gc_loss_mask, gc_game_lengths, gc_logits, gc_preds, legal_mask_t
# ↑ legal_tokens (the numpy array from engine.compute_legal_token_masks) is missing

It'll be freed when overwritten in the next iteration, so this isn't a memory leak, but it's inconsistent with the intent of the explicit del block.

val_loss_chart misleadingly named for pawn runs

charts.py:248-257 — for run_type == "pawn" the function now returns a perplexity chart, not a loss chart. The dashboard description says "Validation perplexity" which is clear, but the function named val_loss_chart returning a "Perplexity" plot will confuse anyone calling it.

CSS keyframe hardcodes EMERALD instead of using a variable

sol.py GLOBAL_CSS, @keyframes pawn-pulse block hardcodes rgba(52,211,153,…) rather than referencing --pawn-success. If EMERALD is ever changed in theme.py the animation color won't follow.


What's Good

  • _game_completion_chunk vectorization — clean and correct. The argmax on an all-False tensor returning 0 is correctly masked by has_forfeit downstream, and the preds_clamped clamp before gather with the separate in_range guard is the right way to handle out-of-vocab predictions.
  • theme.py — excellent extraction. CSS custom properties, semantic color aliases, and a single PLOTLY_LAYOUT dict gives real coherence across the chart grid and the shell.
  • error_rate_chart with log-linear fit — diagnostically valuable. The second-half only window for the fit is a reasonable heuristic for a training curve; slope + half-life in the legend is immediately interpretable.
  • Compound patience — the expanded stopping criteria (patience resets on any of loss / late-legality / completion / avg-plies) directly addresses the described use case where legality metrics keep improving after val loss plateaus. Persisting best_game_completion and best_avg_plies_completed in checkpoint state is the right call.
  • _row + Section pattern — much cleaner than the nested with solara.Columns blocks it replaces.
  • _escape() in Runner output — correct placement. Runner output comes from an external process and is the right boundary to sanitize before injecting into HTML.
  • Test update — moving layer_color import to theme and relaxing the title assertion to "Loss" in ... are both correct.

Address PR review nit — the numpy legal_tokens array was missing from
the explicit del block, so it leaked into the next loop iteration until
overwritten. No functional impact (Python's GC handles it on rebind),
but matches the intent of the explicit cleanup.
- Fix Game Completion KPI: metric key was val/avg_plies_to_forfeit but
  trainer logs val/avg_plies_completed, so the sub-label always read "—".
- Restore dot-notation in _flatten_config to prevent silent collisions
  between nested sections sharing a field name; update the highlight
  order to match.
- Drop lambda:None column placeholder — "Game Integrity & Stopping" now
  lives inside if val:, so nothing renders when val is empty.
- Dashboard loads metrics once per tick and passes data= to
  ConfigSummary and MetricsCharts, eliminating the 3x load_metrics calls
  per refresh.
- Split val_loss_chart (now always validation loss) from a new
  perplexity_chart used by the pawn pretraining branch.
- Rewrite .pawn-pulse + @Keyframes pawn-pulse with color-mix so the
  badge background, border, and animation all derive from --pawn-success
  / currentColor instead of hardcoded emerald rgba.
- Error-rate description corrected: late illegal is plies ≥ context/2
  (second half of the context window), not the second half of games.
@thomas-schweich
Copy link
Copy Markdown
Owner Author

@claude please provide a follow-up review.

@claude
Copy link
Copy Markdown

claude bot commented Apr 10, 2026

Claude finished @thomas-schweich's task in 2m 59s —— View job


Follow-up Review

  • Gather context from previous review
  • Read current state of changed files
  • Check which issues from the first review were addressed
  • Synthesize and post follow-up review

All 7 issues called out in the first review were addressed in commits 73e5cfd and bae3555. Summary of what was fixed and a couple of residual observations:


All Previous Issues: Addressed ✓

Issue Fix
Bug: wrong val/avg_plies_to_forfeit key in KpiRow ✓ Corrected to val/avg_plies_completed
_flatten_config dropped parent prefix (silent collision risk) ✓ Dot-notation restored; _CONFIG_HIGHLIGHT_ORDER updated to match
lambda: None placeholder passed as a column item ✓ "Game Integrity & Stopping" section moved inside if val: block
load_metrics called 3× per refresh tick ✓ Called once in Dashboard, passed as data= prop to ConfigSummary and MetricsCharts
legal_tokens missing from del block ✓ Added in 73e5cfd
val_loss_chart returning perplexity for pawn runs (misleading name) ✓ Split into val_loss_chart (unchanged) + new perplexity_chart
CSS keyframe hardcoded EMERALD rgba ✓ Rewritten with color-mix(in srgb, currentColor …, transparent)

Residual Observations

val_loss_chart is now dead code

After the split, perplexity_chart is called for pawn runs (sol.py:904), but val_loss_chart is called from nowhere in sol.py. The adapter dashboard branches (hybrid, rosa, sparse, bottleneck, lora, film, tiny) all show training loss via loss_chart and have no validation chart section. The function definition says "Adapter runs only" but no adapter branch calls it.

The three test cases in TestValCharts are the only callers. This is purely a dead-code issue — no runtime impact — but it's worth either wiring it into adapter validation sections or removing it.

TestValCharts::test_val_loss_pawn_with_data is stale

charts.py:305:

fig = charts.val_loss_chart(SAMPLE_VAL, "step", "pawn")
assert fig is not None

SAMPLE_VAL contains "val/loss" (and "val/perplexity"), but the current val_loss_chart looks for the key "val_loss". The chart is generated successfully — make_chart handles missing keys gracefully — but it silently has no data. The test was written to verify that pawn produces a perplexity chart; that use case has moved to perplexity_chart. A corresponding test_perplexity_chart_with_data would better cover the code path that's actually live.


What Was Done Well

The color-mix() rewrite is clean — color-mix(in srgb, currentColor 55%, transparent) correctly derives the glow from currentColor (which is --pawn-success when active and --pawn-text-muted when idle), so the animation now follows the token color automatically. The param_count and formulation additions to _CONFIG_HIGHLIGHT_ORDER correctly reference top-level keys (not nested), which matches how trainer.py:477 emits them. The single-load data= prop threading is tidy — Dashboard owns the memo, ConfigSummary and MetricsCharts receive it, no duplicated reactive subscriptions.


The dead val_loss_chart and the stale test are minor. Everything else looks good — this PR is ready to merge.

@thomas-schweich thomas-schweich merged commit 0f96df2 into main Apr 10, 2026
1 check passed
@thomas-schweich thomas-schweich deleted the dashboard-style-and-stopping-criteria branch April 10, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant