Training stability loss spike recovery by Sualeh77 · Pull Request #558 · The-School-of-AI/LLM

Sualeh77 · 2026-03-06T13:57:30Z

Pull Request Template

Description

Implements loss spike detection and automatic recovery for the pretraining pipeline, addressing training stability during LLM pretraining where sudden loss spikes can waste compute or cause divergence.

Key changes:

Two-signal detection:
1- Loss spikes: Sliding window z-score detection after forward pass — flags when loss > mean + K*std or loss > ratio * mean, with minimum absolute delta guard
2- Gradient norm explosion: L2 norm threshold check after backward pass, using DeepSpeed's global grad norm (ZeRO-safe) with local fallback

Automatic escalating recovery (production-default):

spike_count <= 3 → skip batch
spike_count <= 10 → reduce LR + skip batch
spike_count > 10 → rollback to last checkpoint + skip 200 batches (PaLM-style)

Supporting mechanisms:

Cooldown (50 steps) prevents cascading alerts after any spike action
Spike detector window reset on checkpoint rollback (stale statistics invalidated)
Embedding weight norm tracking (token_embed, lm_head, Kronecker projection) throttled to every 50 steps
Loss all-reduced across ranks before detection to prevent collective deadlocks
Interactive stdin mode as opt-in fallback (auto_recover=False), guarded against multi-GPU

Reviewers should focus on:

Integration points in pretrainer.py — two-stage detection (lines 118-168), while-loop conversion for rollback support
Escalation policy and cooldown logic in loss_spike_recovery.py
LossSpikeConfig defaults in config.py — are thresholds reasonable for our training regime?

References

PaLM: Scaling Language Modeling with Pathways — rollback + skip batches strategy
LLaMA training report — same PaLM-style mitigation

Checklist

I have added tests that prove my fix is effective or that my feature works.
- 40 unit tests in llm/tests/test_loss_spike_recovery.py covering detector, cooldown, escalation policy, config defaults, factory, grad norm, and embedding norms
- Local integration test in llm/tests/local_spike_recovery_test.py — trains a small GPT on wikitext-2 with injected spikes, verified on M1 Mac (MPS)
I have added necessary documentation (if applicable)
- llm/src/llm/LOSS_SPIKE_RECOVERY.md — conceptual design, configuration reference, and code change details
[ ]

Reviewers

Reviewer 1: Rahul Uniyal
Reviewer 2: Shyamant Achar

Note: Every pull request requires atleast 2 reviewers/approvers before it can be merged.

…557) - Add support for lower sequence length Co-authored-by: Hemanth Reddy K <h.kamireddy@yuvohealth.com>

…chanism

…ding norm tracking for the task : Embedding norms tracked separately (embeddings can diverge)

…ining made the auto action taking as prior choice than user intervention

… made

…and action triggerring mechanism with wikitext data and smll gpt model

firekind and others added 19 commits February 27, 2026 09:59

initial consolidation + p12

dc0629b

added 1b model and current kernels

3a44857

consolidated dataset, checkpointing and related aws code

affbf4c

consolidated latest dataloaders, kernels and 1B model

5db668c

consolidated experiment's train.py and main.py

99e0b84

consolidated configs, scripts used in experiments

fa0fbb1

added back fla dep

887c649

fixed scripts, added relative path resolver for more config fields

069e000

added triton dep for linux

cb561ef

updated ci

fcbfb58

feat: Implement background prefetching from S3 for BinIdx dataloader (#…

265601c

…557) - Add support for lower sequence length Co-authored-by: Hemanth Reddy K <h.kamireddy@yuvohealth.com>

pre-training pipeline rework - initial

df0f39f

removed unnecessary results folder

4380db2

Commited initial implementation of loss spike detection / hendling me…

2d43fff

…chanism

Included grad norm in loss spiking action mechanism. Also added embed…

576e0e1

…ding norm tracking for the task : Embedding norms tracked separately (embeddings can diverge)

Updated loss spike action mechanism, considering the distributive tra…

3f56a62

…ining made the auto action taking as prior choice than user intervention

Added reset logic to reset the sliding window of Loss spike tracking

8872277

Added Read Me file to explain the approach, and list down the changes…

c494a71

… made

Added local_spike_recovery_test.py file to test loss spike detection …

6edf975

…and action triggerring mechanism with wikitext data and smll gpt model

Sualeh77 requested review from NSR9, aiplaybookin, firekind, pankaj1311, sidrocks and smitasasindran as code owners March 6, 2026 13:57

Sualeh77 assigned rahulni, firekind and Sualeh77 and unassigned rahulni Mar 6, 2026

Sualeh77 requested a review from rahulni March 6, 2026 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stability loss spike recovery#558

Training stability loss spike recovery#558
Sualeh77 wants to merge 19 commits intostagingfrom
training_stability-loss_spike_recovery

Sualeh77 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Sualeh77 commented Mar 6, 2026

Pull Request Template

Description

Checklist

Reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants