Skip to content

Resume recovery optimizor scheduler#559

Open
Sualeh77 wants to merge 2 commits intorefactor/consolidationfrom
resume_recovery-optimizor_scheduler
Open

Resume recovery optimizor scheduler#559
Sualeh77 wants to merge 2 commits intorefactor/consolidationfrom
resume_recovery-optimizor_scheduler

Conversation

@Sualeh77
Copy link

@Sualeh77 Sualeh77 commented Mar 7, 2026

Pull Request Template

Description

Task: Optimizer state restored correctly (momentum, variance).

When training is paused and resumed from a checkpoint, DeepSpeed's load_checkpoint restores optimizer and scheduler states automatically. However, there was no verification that this restore actually succeeded — a silent failure (empty buffers, reset LR) would cause training to diverge without any warning.

This PR adds:

verify_optimizer_scheduler_restored() in llm/src/llm/utils.py — a standalone, testable function that checks:

  • Optimizer state is not empty after restore
  • Adam exp_avg (momentum) and exp_avg_sq (variance) buffers are non-zero (with a guard for early-step checkpoints at step <= 2)
  • Scheduler last_epoch and current LR are consistent with the restored global step
  • PreTrainer integration in llm/src/llm/pretrainer.py — calls the verification function in _resume() after checkpoint load, logs a - summary on rank 0, and prints a warning if scheduler state appears unrestore.

Unit tests in llm/tests/test_checkpoint_restore.py — 6 tests exercising the actual production function:

  • Round-trip save → load passes verification
  • Raises on empty optimizer state
  • Raises on all-zero momentum/variance buffers (step > 2)
  • Tolerates zero buffers at early steps (step <= 2)
  • LR preserved after restore
  • Accepts None scheduler
  • Integration test in llm/tests/local_checkpoint_restore_test.py — trains a ~28M param SmallGPT on wikitext-2 (HuggingFace), saves a checkpoint, loads into a fresh model, calls the production verification function, then continues training. Runs on CPU/MPS/CUDA without DeepSpeed.

Key files changed

  • llm/src/llm/utils.py — new verify_optimizer_scheduler_restored() function
  • llm/src/llm/pretrainer.py — _resume() now calls verification, new _verify_optimizer_scheduler_restored() method delegates to utils
  • llm/tests/test_checkpoint_restore.py — unit tests (new file)
  • llm/tests/local_checkpoint_restore_test.py — integration test (new file)

Checklist

  • I have added tests that prove my fix is effective or that my feature works.
  • I have added necessary documentation (if applicable).
  • My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project [Contribution Guidelines](https://github.com/The-School-of-AI/LLM/tree/main/experiments/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants