Skip to content

Conversation

@shunjiad
Copy link
Contributor

@shunjiad shunjiad commented Aug 8, 2025

Megatron-LM supports saving and loading checkpoints from the object storage via MSC. Megatron-Bridge introduces three new files to the checkpoint directory, so MSC must support those files.

  • latest_train_state.pt
  • iter_{iteration}/run_config.yaml
  • iter_{iteration}/train_state.pt

To save the checkpoints to the object storage, the user can specify the MSC URL in the checkpoint section.

checkpoint:
  # Save checkpoints to S3 via MSC
  save: msc://s3-profile/checkpoints
  load: msc://s3-profile/checkpoints

Closes #229

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ananthsub
Copy link
Contributor

/ok to test 9bcf8f4

Signed-off-by: Shunjia Ding <shunjiad@nvidia.com>
@ananthsub ananthsub merged commit 661dbf2 into NVIDIA-NeMo:main Aug 12, 2025
23 of 24 checks passed
yfw pushed a commit that referenced this pull request Aug 19, 2025
Signed-off-by: Shunjia Ding <shunjiad@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multistorage client support

3 participants