Resume recovery - RNG state manager by Sualeh77 · Pull Request #564 · The-School-of-AI/LLM

Sualeh77 · 2026-03-08T11:04:11Z

Pull Request Template

Description

RNG State Restoration for Reproducible Training Resume

When training resumes from a checkpoint, random number generator states were not being saved or restored. This caused data shuffling order, dropout masks, and other random operations to diverge from the original run, making training non-reproducible after resume.

This PR adds an RNGStateManager module that captures and restores RNG states across all libraries (Python random, NumPy, PyTorch CPU, PyTorch CUDA) and integrates it into the checkpoint save/load pipeline with minimal changes to existing code.

What changed

New files:

llm/src/llm/rng_state_manager.py — RNGStateManager class with capture() and restore() static methods. Handles all 4 RNG sources, guards against CUDA device count mismatch on world-size changes (warns instead of crashing), and gracefully skips CUDA RNG when no GPU is available.
llm/tests/test_rng_state_manager.py — 12 unit tests covering capture key structure, round-trip restore for all RNG sources, partial/empty state handling, and checkpoint dict integration.
llm/tests/test_rng_state_e2e.py — 2 end-to-end tests that train sshleifer/tiny-gpt2 on wikitext-2-raw-v1, checkpoint RNG state mid-training, trash and restore it, then verify losses match the uninterrupted baseline exactly. Includes a negative test proving losses diverge without restore.

Modified files:

llm/src/llm/pretrainer.py — 2 minimal changes:
_save_checkpoint(): adds "rng_state": RNGStateManager.capture() to client_state
_resume(): calls RNGStateManager.restore(rng_state) before training resumes

Key design decisions

Static methods, no instance state — drop-in single-call API
RNG state stored inside existing client_state dict — no checkpoint format changes
Per-rank by default since DeepSpeed saves client_state per-rank in each shard
Restore happens in _resume() before training loop iterates, which is correct because both DistributedSampler and
DataLoader(shuffle=True) consume RNG at iterator creation time, not at construction

Resume recovery RNG state manager implemented

6ae11b3

Sualeh77 requested review from Jayant-Guru-Shrivastava, Shwethaamrutha, firekind and yashwantram97 March 8, 2026 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume recovery - RNG state manager#564

Resume recovery - RNG state manager#564
Sualeh77 wants to merge 1 commit intorefactor/consolidationfrom
resume_recovery-rng_state

Sualeh77 commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sualeh77 commented Mar 8, 2026

Pull Request Template

Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants