Skip to content

Conversation

DNXie
Copy link
Member

@DNXie DNXie commented Oct 15, 2025

Enable resume GRPO runs from a specific step.

Our current implementation technically already support resuming from checkpoint. But it needs to be more obvious to the users how to use it.

There are two possible designs we can use.


Design 1 (minimal changes)

We can force user to always use initial_load_path

trainer:
  checkpoint:
    enable: true
    folder: ./checkpoint_v2
    initial_load_path: ./checkpoint/step-200
    initial_load_in_hf: false    
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

But the tricky part is that this line:

folder: ./checkpoint_v2

folder cannot exist, otherwise titan would ignore initial_load_path. So if users want to resume from a saved checkpoint, they have to start from step 0 and use a new folder to save the checkpoints.
We probably want to add a comment here in the config to make it clear.

Risk:

If later we have the replay_buffer and dataloader checkpoint saved, there would cause a version misalignment problem. Because the training starts from step 1 all the time.


Design 2 (this PR)

With the current design, to “resume,” users had to point initial_load_path at weights and also create a new folder because the Titan checkpointer ignores initial_load_path if the checkpoint folder already exists. This forced runs to restart at step 0, breaking version alignment. This PR introduces load_step to resume from an exact step without folder shenanigans or step resets.

With this PR, when load_step > 0, we:

  • Materialize the trainer’s weights at load_step in TorchStore.
  • Update the Generator (and optionally ReferenceModel) to that same version.
  • Start rollouts/training so new episodes are tagged with policy_version == load_step, unblocking ReplayBuffer.sample(curr_policy_version=...).

Key changes

  • New config knob: trainer.checkpoint.load_step (int).
  • On startup (after ts.initialize(...)):
    • trainer.push_weights(load_step) → ensure weights exist at that version in TorchStore.
    • policy.update_weights(load_step) (optional: ref_model.update_weights(load_step)).
  • training_step now starts at max(load_step, 0).
  • Added inline comments in the config with necessary explanation.

Example:

Resume from ./checkpoint/step-200

trainer:
  checkpoint:
    enable: true
    folder: ./checkpoint  # Directory to save or resume checkpoints (default: ./checkpoints)
    load_step: 200         # Step to load from; cannot be hf ckpt; -1 means load from initial_load_path. (default: -1)
    initial_load_path: hf://${model} # Optional: path or HF identifier to load model weights initially, will be ignored if `folder` exists
    initial_load_in_hf: true      # If true, interpret initial_load_path as a HuggingFace model repo
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

Start from scratch and load from initial_load_path

trainer:
  checkpoint:
    enable: true
    folder: ./checkpoint  # Directory to save or resume checkpoints (default: ./checkpoints)
    load_step:-1         # Step to load from; cannot be hf ckpt; -1 means load from initial_load_path. (default: -1)
    initial_load_path: hf://${model} # Optional: path or HF identifier to load model weights initially, will be ignored if `folder` exists
    initial_load_in_hf: true      # If true, interpret initial_load_path as a HuggingFace model repo
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

Tests

  • Resume at load_step=200; verify first rollouts carry generator_version == 200.
  • Run with checkpoint.interval=10; confirm a new checkpoint is saved at ./checkpoint/step-210.

TODO:

  • wandb log still starts from step 0
  • Update other yaml/py files

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2025
@DNXie DNXie requested a review from ebsmothers October 15, 2025 21:34
@DNXie
Copy link
Member Author

DNXie commented Oct 15, 2025

Will update all the other configs once I got preliminary approval on this PR.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level this makes sense to me. Two main comments:

(1) Is there some way that we can test this in CI? It would be very helpful given that checkpoint resume bugs are a huge pain to catch and reproduce
(2) Imo our checkpoint config is starting to get a bit unintuitive. We should think about ways we can consolidate/simplify some of the fields

@DNXie
Copy link
Member Author

DNXie commented Oct 16, 2025

@ebsmothers

(2) Imo our checkpoint config is starting to get a bit unintuitive. We should think about ways we can consolidate/simplify some of the fields

Yes I agree. But I am designing this based on Titan's implementation. I agree this can be a bit confusing. Alternative design is that we can remove this load_step and force user to always use initial_load_path

trainer:
  checkpoint:
    enable: true
    folder: ./checkpoint_v2
    initial_load_path: ./checkpoint/step-200
    initial_load_in_hf: false    
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

But the caveat is that: folder: ./checkpoint_v2. folder cannot exist, otherwise titan would ignore initial_load_path. So if users want to resume from a saved checkpoint, they have to start from step 0 and use a new folder to save the checkpoints.

But the risk is: If later we have the replay_buffer and dataloader checkpoint saved, there could be a version misalignment problem. Because the training starts from step 1 all the time, and version number starts from the load_step

cc @allenwang28 @joecummings

@allenwang28
Copy link
Contributor

Suggestions:

  1. Start with a load_step: -1 default and inherit the same comments in our config in the same way that Titan does. That way we're not re-inventing the logic for how checkpointing is loaded - it will inherit the same behavior from Titan
  2. Instead of trainer.load_weights => trainer.push_weights() to initialize the generator, is it possible to leverage our existing use_dcp path in Generator to also load the DCP checkpoint? Then both trainer and generator can initialize simultaneously and we don't have to wait for the weight sync to happen (faster startup)

@casteryh
Copy link
Contributor

Suggestions:

  1. Start with a load_step: -1 default and inherit the same comments in our config in the same way that Titan does. That way we're not re-inventing the logic for how checkpointing is loaded - it will inherit the same behavior from Titan
  2. Instead of trainer.load_weights => trainer.push_weights() to initialize the generator, is it possible to leverage our existing use_dcp path in Generator to also load the DCP checkpoint? Then both trainer and generator can initialize simultaneously and we don't have to wait for the weight sync to happen (faster startup)

use_dcp path requires the checkpoint to be in hf format (is titan's checkpoint in hf format?)
also I thought we were deprecating the dcp path?

@DNXie
Copy link
Member Author

DNXie commented Oct 16, 2025

@casteryh

use_dcp path requires the checkpoint to be in hf format (is titan's checkpoint in hf format?)

No. Titan's checkpoints are not in hf format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants