Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions apps/grpo/qwen3_1_7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,9 @@ trainer:
disable_loss_parallel: true
checkpoint:
enable: true
initial_load_path: hf://${model}
initial_load_in_hf: true
folder: ./checkpoint # The folder to save checkpoints to.
Copy link
Contributor

@JenniferWang JenniferWang Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we control these config fields and we should be opinionated on exposing RL friendly config field names and re-map it to TorchTitan fields internally.

Right now, some TorchTitan config names are really confusing: e.g. the ref model logically should not need checkpointing but still it requires checkpoint.enable = true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah was discussing this with @joecummings too.. ultimately we probably want to provide some kind of internal config mapping that we execute as a kind of training script post-init before we do any of the actual setup of actors (e.g. we can even bake it into our config.parse decorator).

initial_load_path: hf://${model} # The path to load the initial checkpoint from. Ignored if `folder` exists.
initial_load_in_hf: true # If true, interpret initial_load_path as a HuggingFace model repo
last_save_in_hf: true
interval: 500
async_mode: "disabled"
Expand Down
5 changes: 3 additions & 2 deletions apps/grpo/qwen3_32b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,9 @@ trainer:
disable_loss_parallel: true
checkpoint:
enable: true
initial_load_path: hf://${model}
initial_load_in_hf: true
folder: ./checkpoint # The folder to save checkpoints to.
initial_load_path: hf://${model} # The path to load the initial checkpoint from. Ignored if `folder` exists.
initial_load_in_hf: true # If true, interpret initial_load_path as a HuggingFace model repo
last_save_in_hf: true
interval: 500
async_mode: "disabled"
Expand Down
5 changes: 3 additions & 2 deletions apps/grpo/qwen3_8b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,9 @@ trainer:
disable_loss_parallel: true
checkpoint:
enable: true
initial_load_path: hf://${model}
initial_load_in_hf: true
folder: ./checkpoint # The folder to save checkpoints to.
initial_load_path: hf://${model} # The path to load the initial checkpoint from. Ignored if `folder` exists.
initial_load_in_hf: true # If true, interpret initial_load_path as a HuggingFace model repo
last_save_in_hf: true
interval: 500
async_mode: "disabled"
Expand Down
5 changes: 3 additions & 2 deletions apps/sft/llama3_8b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,9 @@ parallelism:

checkpoint:
enable: true
initial_load_path: hf://${model_name}
initial_load_in_hf: true
folder: ./checkpoint # The folder to save checkpoints to.
initial_load_path: hf://${model} # The path to load the initial checkpoint from. Ignored if `folder` exists.
initial_load_in_hf: true # If true, interpret initial_load_path as a HuggingFace model repo
last_save_in_hf: true
interval: 500
async_mode: "disabled"
Expand Down
5 changes: 3 additions & 2 deletions apps/sft/qwen3_8b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,9 @@ parallelism:

checkpoint:
enable: true
initial_load_path: hf://${model_name}
initial_load_in_hf: true
folder: ./checkpoint # The folder to save checkpoints to.
initial_load_path: hf://${model} # The path to load the initial checkpoint from. Ignored if `folder` exists.
initial_load_in_hf: true # If true, interpret initial_load_path as a HuggingFace model repo
last_save_in_hf: true
interval: 500
async_mode: "disabled"
Expand Down
Loading