Surface forfeit-rate log-linear fit in lab_status for pretraining runs#56
Surface forfeit-rate log-linear fit in lab_status for pretraining runs#56thomas-schweich merged 2 commits intomainfrom
Conversation
The log-linear fit on forfeit rate has become the most informative
late-stage convergence signal — it keeps moving after val_loss, accuracy,
legal_move_rate, and late_legal_move_rate have all plateaued. The
dashboard chart shows it, but the MCP status tool didn't, so a
monitoring agent couldn't see the trend without reading the val log lines
by hand.
New helper `read_pretrain_val_summary(trial)` in monitor.py scans the
trial's metrics.jsonl for val records with game_completion_rate, returns:
- latest: all key val fields from the most recent val record
- forfeit_fit: log-linear regression over the most recent half of
(step, forfeit_rate) points, reported as slope_per_step,
half_life_steps, current_forfeit, n_points. Matches the dashboard's
fit exactly (same second-half-of-history window).
`status()` in runner.py attaches the summary under `pretrain` on running
trial rows when run_type == "pretrain". Metrics.jsonl is the source of
truth — no new state tracked on the Trial object, so recovery and resume
Just Work.
Updated lab_status docstring so agents know the new field exists.
|
Claude finished @thomas-schweich's task in 1m 48s —— View job Code Review
Overall this is clean and well-scoped. The implementation is straightforward, the docstrings are good, and the integration into 1. Redundant local
# Line 8 — already there
import math
# Line 192 — redundant
import math as _mathReplace all 2. No test coverage for
The PR description mentions "Verified on live trial 23" which is good, but that's not a regression guard. 3. summary["forfeit_fit"] = {
...
"current_forfeit": forfeit_rates[-1], # full-series last, not half-window last
}This is consistent and probably what you want (most recent observed value regardless of window). However, as the model approaches 100% completion — when the last overall record has 4. Docstring lists Minor, but the docstring in 5. Performance: full file scan on every
Summary: Items 1 and 4 are small and easy to fix; item 2 (missing tests) is the only thing I'd consider blocking given that the regression logic is non-trivial. Items 3 and 5 are informational. The overall design — reading from |
1. Drop the redundant local 'import math as _math' — math is already
imported at module level.
2. Add test coverage for read_pretrain_val_summary:
- returns None when run_dir is missing
- returns None when metrics.jsonl is missing
- returns None when there are no val records
- returns None when val records exist but lack game_completion_rate
(adapter runs)
- returns only 'latest' when fewer than 4 val records are present
- returns correct slope and half_life for a synthetic exponential
decay series (OLS recovers the known k within 1e-6)
- omits forfeit_fit entirely when all forfeit rates are exactly 0
- latest dict carries all documented fields
- current_forfeit tracks the unfiltered last-of-full-series value
Plus runner integration tests:
- runner.status() attaches 'pretrain' block to pretrain run_type rows
- runner.status() omits 'pretrain' block for adapter run_type rows
3. Clarify forfeit_fit fields in the docstring: list current_forfeit
and explain that it's the last value of the *full* forfeit series
(unlike slope/half_life, which come from the half-window OLS).
|
@claude please re-review |
|
Claude finished @thomas-schweich's task in 1m 40s —— View job Code Review
All four actionable items from the first review are resolved. Here's the full accounting: 1. Redundant The local re-import is gone. 2. Missing tests — FIXED
3. The docstring ( 4. Docstring missing
5. Performance: full file scan — No change (informational, acceptable) Still a full scan from byte 0 on every One minor nit not worth blocking on: LGTM. All blocking issues resolved, test coverage is solid, implementation is correct. |
Summary
The log-linear fit on forfeit rate has become the most informative late-stage convergence signal for the 512-ctx pretraining runs — it keeps moving after val_loss, accuracy, and legal/late_legal rates have all plateaued. The dashboard chart surfaces it, but the `lab_status` MCP tool didn't, so a monitoring agent couldn't see the trend without manually parsing val log lines.
This PR adds a `pretrain` block to each running `run_type=pretrain` row in `lab_status`, carrying:
Example output on the live 512-ctx run at step 272,000:
```json
{
"latest": {
"step": 272000,
"val_loss": 2.887,
"game_completion_rate": 0.961,
"avg_plies_completed": 335.3,
"forfeit_ply_min": 28,
"forfeit_ply_max": 438,
"forfeit_ply_median": 120
},
"forfeit_fit": {
"slope_per_step": -1.11e-05,
"half_life_steps": 62368,
"n_points": 66,
"current_forfeit": 0.0391
}
}
```
Implementation
Test plan