Resume recovery loss continuity by Sualeh77 · Pull Request #563 · The-School-of-AI/LLM

Sualeh77 · 2026-03-08T09:12:03Z

Description

Implements LossContinuityGuard — a lightweight guard that detects loss discontinuities after checkpoint resume. Loss jumps after resume indicate that some training state (optimizer moments, LR scheduler, model weights) was not correctly restored. This guard records a rolling window of optimizer-step losses before each checkpoint, saves the window statistics into the checkpoint metadata (client_state), and automatically verifies continuity for the first window_size steps after resume.

Changes:

src/llm/loss_continuity_guard.py (new): LossContinuityGuard class with observe(), state_dict(), restore(), and verify(). Uses a dual σ-based + relative-difference (20%) threshold to tolerate natural loss noise while catching real jumps. Handles distributed training via dist.all_reduce (guarded by dist.is_initialized()).
src/llm/pretrainer.py (modified): Integrates the guard with minimal surface area — instantiated in init, restored in _resume(), observed after _optimizer_step(), and persisted in _save_checkpoint() via client_state["loss_guard"].

How it works at resume:

1. guard.restore(client_state["loss_guard"])   ← loads pre-resume mean/std
2. training loop: guard.observe(loss) × N      ← collects post-resume losses
3. after window_size steps: guard.verify()     ← auto-triggers, logs WARNING if jump detected

Zero latency impact: observe() is a Python list.append(float) — completely invisible against GPU-bound fused Triton kernels.

Checklist

I have added tests that prove my fix is effective or that my feature works.
- tests/test_loss_continuity_guard.py — 11 unit tests covering normal resume, optimizer reset detection, LR reset detection, RNG drift tolerance, edge cases (empty restore, window rollover, relative-check fallback)
- tests/test_loss_continuity_guard_integration.py — 2 end-to-end integration tests using gpt2 + wikitext-2-raw-v1 via HuggingFace, verifying a clean resume passes and a corrupted-weights resume is detected
I have added necessary documentation (if applicable).
- LossContinuityGuard is fully docstringed with usage example, arg descriptions, and inline comments explaining threshold choices
My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project Contribution Guidelines
- Branch: resume_recovery-loss_continuity
- New module placed in src/llm/ alongside existing modules (loss_spike_recovery.py, etc.)
- Integration follows the client_state dict pattern already established in checkpoint.py and pretrainer.py

…557) - Add support for lower sequence length Co-authored-by: Hemanth Reddy K <h.kamireddy@yuvohealth.com>

firekind and others added 14 commits February 27, 2026 09:59

initial consolidation + p12

dc0629b

added 1b model and current kernels

3a44857

consolidated dataset, checkpointing and related aws code

affbf4c

consolidated latest dataloaders, kernels and 1B model

5db668c

consolidated experiment's train.py and main.py

99e0b84

consolidated configs, scripts used in experiments

fa0fbb1

added back fla dep

887c649

fixed scripts, added relative path resolver for more config fields

069e000

added triton dep for linux

cb561ef

updated ci

fcbfb58

feat: Implement background prefetching from S3 for BinIdx dataloader (#…

265601c

…557) - Add support for lower sequence length Co-authored-by: Hemanth Reddy K <h.kamireddy@yuvohealth.com>

pre-training pipeline rework - initial

df0f39f

removed unnecessary results folder

4380db2

Added resume recovery loss continuation guard mechanism

9e05fef

Sualeh77 requested review from NSR9, aiplaybookin, firekind, pankaj1311, sidrocks and smitasasindran as code owners March 8, 2026 09:12

Sualeh77 requested a review from Shwethaamrutha March 8, 2026 09:12

Sualeh77 self-assigned this Mar 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume recovery loss continuity#563

Resume recovery loss continuity#563
Sualeh77 wants to merge 14 commits intostagingfrom
resume_recovery-loss_continuity

Sualeh77 commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Sualeh77 commented Mar 8, 2026

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants