Skip to content

Surface forfeit-rate log-linear fit in lab_status for pretraining runs#56

Merged
thomas-schweich merged 2 commits intomainfrom
lab-pretrain-forfeit-fit
Apr 11, 2026
Merged

Surface forfeit-rate log-linear fit in lab_status for pretraining runs#56
thomas-schweich merged 2 commits intomainfrom
lab-pretrain-forfeit-fit

Conversation

@thomas-schweich
Copy link
Copy Markdown
Owner

Summary

The log-linear fit on forfeit rate has become the most informative late-stage convergence signal for the 512-ctx pretraining runs — it keeps moving after val_loss, accuracy, and legal/late_legal rates have all plateaued. The dashboard chart surfaces it, but the `lab_status` MCP tool didn't, so a monitoring agent couldn't see the trend without manually parsing val log lines.

This PR adds a `pretrain` block to each running `run_type=pretrain` row in `lab_status`, carrying:

  • `latest`: game_completion_rate, avg_plies_completed, forfeit_ply min/max/median, legal/late_legal, val_loss from the most recent val record
  • `forfeit_fit`: log-linear regression over the most recent half of the (step, forfeit_rate) series — reports slope_per_step, half_life_steps, current_forfeit, and n_points. The fit window matches the dashboard's second-half-of-history heuristic.

Example output on the live 512-ctx run at step 272,000:

```json
{
"latest": {
"step": 272000,
"val_loss": 2.887,
"game_completion_rate": 0.961,
"avg_plies_completed": 335.3,
"forfeit_ply_min": 28,
"forfeit_ply_max": 438,
"forfeit_ply_median": 120
},
"forfeit_fit": {
"slope_per_step": -1.11e-05,
"half_life_steps": 62368,
"n_points": 66,
"current_forfeit": 0.0391
}
}
```

Implementation

  • New `read_pretrain_val_summary(trial)` helper in `pawn/lab/monitor.py` scans `metrics.jsonl` directly — metrics.jsonl is the source of truth, so there's no new `Trial` state to track and no drift across resume/recovery.
  • `runner.status()` attaches the summary under `pretrain` only when `run_type == "pretrain"` and when val records with `game_completion_rate` exist (adapter runs and pre-first-eval pretraining trials skip it silently).
  • `lab_status` tool docstring updated so the agent knows the new field exists.

Test plan

  • `pyright pawn/lab/{monitor,runner,server}.py` clean
  • `pytest tests/lab` — 250 passed
  • Verified on the live trial 23 metrics file: helper returns the expected shape and the fit matches the dashboard (hl ≈ 62K at step 272K)
  • CI passes

The log-linear fit on forfeit rate has become the most informative
late-stage convergence signal — it keeps moving after val_loss, accuracy,
legal_move_rate, and late_legal_move_rate have all plateaued. The
dashboard chart shows it, but the MCP status tool didn't, so a
monitoring agent couldn't see the trend without reading the val log lines
by hand.

New helper `read_pretrain_val_summary(trial)` in monitor.py scans the
trial's metrics.jsonl for val records with game_completion_rate, returns:
  - latest: all key val fields from the most recent val record
  - forfeit_fit: log-linear regression over the most recent half of
    (step, forfeit_rate) points, reported as slope_per_step,
    half_life_steps, current_forfeit, n_points. Matches the dashboard's
    fit exactly (same second-half-of-history window).

`status()` in runner.py attaches the summary under `pretrain` on running
trial rows when run_type == "pretrain". Metrics.jsonl is the source of
truth — no new state tracked on the Trial object, so recovery and resume
Just Work.

Updated lab_status docstring so agents know the new field exists.
@thomas-schweich
Copy link
Copy Markdown
Owner Author

@claude

@claude
Copy link
Copy Markdown

claude bot commented Apr 11, 2026

Claude finished @thomas-schweich's task in 1m 48s —— View job


Code Review

  • Read changed files
  • Review implementation
  • Post findings

Overall this is clean and well-scoped. The implementation is straightforward, the docstrings are good, and the integration into status() is minimal. A few things worth flagging:


1. Redundant local import math as _math (monitor.py:192)

math is already imported at the top of the file (line 8). The local re-import inside the function body is unnecessary.

# Line 8 — already there
import math

# Line 192 — redundant
import math as _math

Replace all _math.log / _math.log(2) calls with math.log / math.log(2). Fix this →


2. No test coverage for read_pretrain_val_summary

tests/lab/test_monitor.py doesn't import or test the new function. TestStatus in test_runner.py also has no test for the pretrain block being attached to pretraining rows. For a function doing manual OLS regression over file data, at minimum you'd want:

  • returns None with no run_dir
  • returns None with no metrics.jsonl
  • returns only latest when n < 4
  • forfeit_fit fields are correct for a known (step, gc) series
  • all-zero forfeit rates (100% completion) → no forfeit_fit
  • runner.status() includes pretrain key when run_type == "pretrain" and omits it for adapter runs

The PR description mentions "Verified on live trial 23" which is good, but that's not a regression guard.


3. current_forfeit comes from forfeit_rates[-1] (full series), not the fit window

summary["forfeit_fit"] = {
    ...
    "current_forfeit": forfeit_rates[-1],   # full-series last, not half-window last
}

This is consistent and probably what you want (most recent observed value regardless of window). However, as the model approaches 100% completion — when the last overall record has forfeit_rate == 0.0 but the second-half window has positive values — the pos filter will exclude that zero and the fit would still be computed, but current_forfeit reports 0. That's actually fine and informative. Just worth being explicit about in the docstring.


4. Docstring lists forfeit_fit fields as {slope, half_life_steps, n_points} but the actual dict also has current_forfeit

Minor, but the docstring in read_pretrain_val_summary omits current_forfeit from the listed fields. The lab_status tool docstring in server.py does list it correctly. Fix this →


5. Performance: full file scan on every status() call

read_pretrain_val_summary opens and reads metrics.jsonl from byte 0 every time lab_status is called. read_metrics avoids this via offset tracking. At 272K steps with eval every ~2K steps, that's ~130+ lines to scan each time. For a monitoring tool called by an agent every few seconds this is probably fine, but it's worth noting the asymmetry. If it becomes a problem, the summary could be cached on the Trial object and invalidated by the existing offset mechanism.


Summary: Items 1 and 4 are small and easy to fix; item 2 (missing tests) is the only thing I'd consider blocking given that the regression logic is non-trivial. Items 3 and 5 are informational. The overall design — reading from metrics.jsonl directly as the source of truth, attaching the block only for pretraining runs with sufficient val history — is correct.

1. Drop the redundant local 'import math as _math' — math is already
   imported at module level.

2. Add test coverage for read_pretrain_val_summary:
   - returns None when run_dir is missing
   - returns None when metrics.jsonl is missing
   - returns None when there are no val records
   - returns None when val records exist but lack game_completion_rate
     (adapter runs)
   - returns only 'latest' when fewer than 4 val records are present
   - returns correct slope and half_life for a synthetic exponential
     decay series (OLS recovers the known k within 1e-6)
   - omits forfeit_fit entirely when all forfeit rates are exactly 0
   - latest dict carries all documented fields
   - current_forfeit tracks the unfiltered last-of-full-series value

   Plus runner integration tests:
   - runner.status() attaches 'pretrain' block to pretrain run_type rows
   - runner.status() omits 'pretrain' block for adapter run_type rows

3. Clarify forfeit_fit fields in the docstring: list current_forfeit
   and explain that it's the last value of the *full* forfeit series
   (unlike slope/half_life, which come from the half-window OLS).
@thomas-schweich
Copy link
Copy Markdown
Owner Author

@claude please re-review

@claude
Copy link
Copy Markdown

claude bot commented Apr 11, 2026

Claude finished @thomas-schweich's task in 1m 40s —— View job


Code Review

  • Read changed files
  • Compare current state against previous review findings
  • Post updated review

All four actionable items from the first review are resolved. Here's the full accounting:


1. Redundant import math as _math — FIXED

The local re-import is gone. monitor.py uses the module-level import math throughout read_pretrain_val_summary (lines 199, 207).


2. Missing tests — FIXED

TestReadPretrainValSummary in test_monitor.py now covers every case I flagged:

  • test_no_run_dir_returns_none / test_missing_metrics_file_returns_none
  • test_no_val_records_returns_none / test_val_records_without_game_completion_return_none (adapter silencing)
  • test_returns_latest_without_fit_when_too_few_records (n < 4)
  • test_returns_fit_on_known_series — exact exponential decay, slope verified to rel=1e-6
  • test_all_zero_forfeit_omits_fit (100% completion → no forfeit_fit)
  • test_latest_records_carries_all_fields
  • test_current_forfeit_is_last_of_full_series

test_runner.py adds test_status_attaches_pretrain_block_for_pretrain_runs and test_status_omits_pretrain_block_for_adapter_runs. Both are correctly wired.


3. current_forfeit from full series vs. fit window — Documented + tested

The docstring (monitor.py:131–135) now explicitly calls this out. The new test_current_forfeit_is_last_of_full_series test constructs a series ending in forfeit=0.0 and asserts that current_forfeit reports 0.0 even when the OLS window excluded that point. Behavior is intentional and verified.


4. Docstring missing current_forfeit — FIXED

monitor.py:125–128 now lists {slope_per_step, half_life_steps, n_points, current_forfeit} correctly.


5. Performance: full file scan — No change (informational, acceptable)

Still a full scan from byte 0 on every status() call, as before. Fine at current scale; caching on Trial remains a valid future optimization if it becomes a bottleneck.


One minor nit not worth blocking on: test_current_forfeit_is_last_of_full_series wraps the key assertion in if "forfeit_fit" in summary:, which means the test passes vacuously if the fit is skipped. Given the input is constructed so that the fit will be computed (11 records, 5 positive-forfeit points in the half-window), this is fine in practice — but an unconditional assert "forfeit_fit" in summary before the field check would make the test intention explicit. Not blocking.


LGTM. All blocking issues resolved, test coverage is solid, implementation is correct.

@thomas-schweich thomas-schweich merged commit 40b7195 into main Apr 11, 2026
1 check passed
@thomas-schweich thomas-schweich deleted the lab-pretrain-forfeit-fit branch April 11, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant