Resume recovery optimizor scheduler by Sualeh77 · Pull Request #559 · The-School-of-AI/LLM

Sualeh77 · 2026-03-07T14:07:06Z

Pull Request Template

Description

Task: Optimizer state restored correctly (momentum, variance).

When training is paused and resumed from a checkpoint, DeepSpeed's load_checkpoint restores optimizer and scheduler states automatically. However, there was no verification that this restore actually succeeded — a silent failure (empty buffers, reset LR) would cause training to diverge without any warning.

This PR adds:

verify_optimizer_scheduler_restored() in llm/src/llm/utils.py — a standalone, testable function that checks:

Optimizer state is not empty after restore
Adam exp_avg (momentum) and exp_avg_sq (variance) buffers are non-zero (with a guard for early-step checkpoints at step <= 2)
Scheduler last_epoch and current LR are consistent with the restored global step
PreTrainer integration in llm/src/llm/pretrainer.py — calls the verification function in _resume() after checkpoint load, logs a - summary on rank 0, and prints a warning if scheduler state appears unrestore.

Unit tests in llm/tests/test_checkpoint_restore.py — 6 tests exercising the actual production function:

Round-trip save → load passes verification
Raises on empty optimizer state
Raises on all-zero momentum/variance buffers (step > 2)
Tolerates zero buffers at early steps (step <= 2)
LR preserved after restore
Accepts None scheduler
Integration test in llm/tests/local_checkpoint_restore_test.py — trains a ~28M param SmallGPT on wikitext-2 (HuggingFace), saves a checkpoint, loads into a fresh model, calls the production verification function, then continues training. Runs on CPU/MPS/CUDA without DeepSpeed.

Key files changed

llm/src/llm/utils.py — new verify_optimizer_scheduler_restored() function
llm/src/llm/pretrainer.py — _resume() now calls verification, new _verify_optimizer_scheduler_restored() method delegates to utils
llm/tests/test_checkpoint_restore.py — unit tests (new file)
llm/tests/local_checkpoint_restore_test.py — integration test (new file)

Checklist

I have added tests that prove my fix is effective or that my feature works.
I have added necessary documentation (if applicable).
My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project [Contribution Guidelines](https://github.com/The-School-of-AI/LLM/tree/main/experiments/

sualehqureshi-tomtom added 2 commits March 7, 2026 04:12

Added optimizer and scheduler resume recovery mechanism

b5acee3

Added local tests script

f2a41ca

Sualeh77 requested review from Shwethaamrutha and yashwantram97 March 7, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume recovery optimizor scheduler#559

Resume recovery optimizor scheduler#559
Sualeh77 wants to merge 2 commits intorefactor/consolidationfrom
resume_recovery-optimizor_scheduler

Sualeh77 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sualeh77 commented Mar 7, 2026

Pull Request Template

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants