Make checkpoint saving `folder` clear in the config #444

DNXie · 2025-10-16T22:58:52Z

Made it clear in the config where to specify the checkpoint saving folder. And added some necessary comments to clarify the behavior.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

codecov-commenter · 2025-10-16T23:02:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.63%. Comparing base (633b219) to head (8b9d30c).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #444      +/-   ##
==========================================
- Coverage   64.69%   64.63%   -0.06%     
==========================================
  Files          79       79              
  Lines        7775     7788      +13     
==========================================
+ Hits         5030     5034       +4     
- Misses       2745     2754       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JenniferWang · 2025-10-17T18:51:10Z

apps/grpo/qwen3_1_7b.yaml

    enable: true
-    initial_load_path: hf://${model}
-    initial_load_in_hf: true
+    folder: ./checkpoint              # The folder to save checkpoints to.


I think we control these config fields and we should be opinionated on exposing RL friendly config field names and re-map it to TorchTitan fields internally.

Right now, some TorchTitan config names are really confusing: e.g. the ref model logically should not need checkpointing but still it requires checkpoint.enable = true.

Yeah was discussing this with @joecummings too.. ultimately we probably want to provide some kind of internal config mapping that we execute as a kind of training script post-init before we do any of the actual setup of actors (e.g. we can even bake it into our config.parse decorator).

<Replace this line with a title. Use 1 line only, 67 chars or less>

8b9d30c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 16, 2025

DNXie requested review from allenwang28 and ebsmothers October 16, 2025 22:59

DNXie mentioned this pull request Oct 16, 2025

RFC: Checkpointing Beyond Model Weights (Why we can’t do it now, and what we’ll do next) #433

Open

JenniferWang reviewed Oct 17, 2025

View reviewed changes

ebsmothers approved these changes Oct 17, 2025

View reviewed changes

DNXie merged commit 25d6098 into meta-pytorch:main Oct 17, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make checkpoint saving `folder` clear in the config #444

Make checkpoint saving `folder` clear in the config #444

Uh oh!

DNXie commented Oct 16, 2025

Uh oh!

codecov-commenter commented Oct 16, 2025

Uh oh!

JenniferWang Oct 17, 2025 •

edited

Loading

Uh oh!

ebsmothers Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Make checkpoint saving folder clear in the config #444

Make checkpoint saving folder clear in the config #444

Uh oh!

Conversation

DNXie commented Oct 16, 2025

Uh oh!

codecov-commenter commented Oct 16, 2025

Codecov Report

Uh oh!

JenniferWang Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Make checkpoint saving `folder` clear in the config #444

Make checkpoint saving `folder` clear in the config #444

JenniferWang Oct 17, 2025 •

edited

Loading