feat: save full training state (optimizers, step) in checkpoints #56

Kuonirad · 2025-12-17T06:35:37Z

Modified save_train_state to save a dictionary containing model state, optimizer states, and training step.
Updated load_checkpoint to handle the new dict format while maintaining backward compatibility with old weight-only checkpoints.
Updated create_model to broadcast loaded checkpoint metadata (optimizers/step) from rank 0 to all ranks and restore optimizer states.
Updated init_train_state to resume training step from checkpoint.

- Modified `save_train_state` to save a dictionary containing model state, optimizer states, and training step. - Updated `load_checkpoint` to handle the new dict format while maintaining backward compatibility with old weight-only checkpoints. - Updated `create_model` to broadcast loaded checkpoint metadata (optimizers/step) from rank 0 to all ranks and restore optimizer states. - Updated `init_train_state` to resume training step from checkpoint.

- Modified `save_train_state` to save a dictionary containing model state, optimizer states, training step, and optional EMA state. - Updated `load_checkpoint` to handle the new dict format while maintaining backward compatibility. - Updated `create_model` to broadcast loaded checkpoint metadata (optimizers/step) from rank 0 to all ranks and restore optimizer states. - Updated `init_train_state` to return loaded checkpoint data and resume training step. - Updated `launch` to load EMA state if available and save the online state (plus EMA helper) instead of just the EMA weights, ensuring correct resumption.

- Implemented automatic checkpoint detection: scans `checkpoint_path` for the latest `step_X` file if no specific checkpoint is provided. - Added full RNG state persistence (torch, cuda, numpy, random) to checkpoints to ensure deterministic resumption. - Modified `save_train_state` and `load_checkpoint` to handle the augmented state dictionary. - Updated `torch.load` usage to allow complex objects (`weights_only=False`) required for optimizer/RNG states. - Cleaned up dataset configuration placeholder. - Verified bitwise-identical resumption via script.

Kuonirad

saved full training state (optimizers, step) in checkpoints

Kuonirad closed this Dec 17, 2025

Kuonirad reopened this Dec 17, 2025

google-labs-jules bot added 2 commits December 17, 2025 06:57

Kuonirad commented Dec 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: save full training state (optimizers, step) in checkpoints #56

feat: save full training state (optimizers, step) in checkpoints #56

Uh oh!

Kuonirad commented Dec 17, 2025

Uh oh!

Kuonirad left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: save full training state (optimizers, step) in checkpoints #56

Are you sure you want to change the base?

feat: save full training state (optimizers, step) in checkpoints #56

Uh oh!

Conversation

Kuonirad commented Dec 17, 2025

Uh oh!

Kuonirad left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant