feat: Added save_optimizer flag to control if saving optimizer in checkpointing#1843
feat: Added save_optimizer flag to control if saving optimizer in checkpointing#1843odedovadia wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughIntroduces a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/utils/checkpoint.py (1)
52-63: Addsave_optimizerto exemplar YAML files and move default from code to YAML.The
save_optimizerkey is not reflected in any exemplar YAML files underexamples/configs/, violating the guideline that config defaults live in YAML. Remove the inline default comment# Default: Truefrom line 62 and addsave_optimizer: trueto the checkpointing sections in all exemplar YAMLs (e.g.,examples/configs/dpo.yaml,examples/configs/grpo_math_1B.yaml). Additionally, update the docstring to explicitly document the recommended default value and that the field accepts boolean values.
🤖 Fix all issues with AI agents
In `@nemo_rl/utils/checkpoint.py`:
- Line 110: Replace the in-code default for save_optimizer with a strict read
from the config so the YAML remains the source of truth: change the assignment
to read the key without a non-None default (e.g., use config["save_optimizer"]
or config.get("save_optimizer") and validate presence) in the
constructor/initializer where self.save_optimizer is set (the assignment using
self.save_optimizer = config.get("save_optimizer", True)); add a clear error or
validation message if the key is missing so callers know to set it in YAML.
🧹 Nitpick comments (1)
nemo_rl/utils/checkpoint.py (1)
125-132: Addstacklevelto the warning for better caller attribution.This warning points at the helper instead of the caller.
🔧 Suggested tweak
- warnings.warn( + warnings.warn( f"Optimizer state not found at {optimizer_path}. " "Optimizer will be freshly initialized." - ) + , stacklevel=2)
Signed-off-by: Oded Ovadia <odedov@dreamgroup.com>
b44f4c8 to
140822d
Compare
Signed-off-by: Oded Ovadia <odedov@dreamgroup.com>
|
@odedov-dream is there any test in unit or functional test that we can add to guard the functionality of the |
Signed-off-by: Oded Ovadia <odedov@dreamgroup.com>
Sure, added 2 unit tests. |
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: Oded Ovadia <oded7272@gmail.com>
What does this PR do ?
Add
save_optimizerconfiguration flag to control whether optimizer state is saved with checkpoints, greatly reducing checkpoint size when optimizer state is not needed for resumption.save_optimizerfield toCheckpointingConfig(default:Truefor backward compatibility).CheckpointManager.get_resume_paths()helper to centralize checkpoint path resolution logic.Issues
Closes #1828
Usage
Before your PR is "Ready for review"
Pre checks:
Summary by CodeRabbit
Release Notes
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.