-
Notifications
You must be signed in to change notification settings - Fork 162
Avoid deepcopy megatron model config #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chenhan Yu <[email protected]>
WalkthroughRemoved deepcopy of model[0].config.dict in save_sharded_modelopt_state, using the config dict directly due to non-deep-copyable attributes. Added explanatory comments. No other logic changed; YAML writing and version tagging remain the same. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (3)
157-160
: Deepcopy removal is correct and safer for Megatron configs.Using the live
__dict__
and parsing into a fresh dict avoids deepcopy failures and does not mutate the source. Nice. Two small hardening nits below.
- Ensure all supported Megatron configs actually expose
.config.__dict__
(or dataclass) in your CI matrix. If any don’t, we should guard accordingly. Do you want me to add a defensive fallback?
161-162
: Use safe_dump and stable key ordering; ensure dir exists.Safer YAML emission and deterministic file diffs; also avoids surprises if the checkpoint dir wasn’t pre-created.
with open(run_config_name, "w") as f: - yaml.dump(config_dict, f, default_flow_style=False) + # Ensure parent exists and write a stable, safe YAML + Path(run_config_name).parent.mkdir(parents=True, exist_ok=True) + yaml.safe_dump(config_dict, f, default_flow_style=False, sort_keys=True)
171-173
: Double‑check deepcopy necessity/safety for modelopt_state.We still
deepcopy
themodelopt_state
. If that object can contain handles similar to Megatron configs, deepcopy may be fragile or slow. If the goal is only to avoid mutating the original, consider building a sanitized copy instead of full deepcopy—or confirm deepcopy is safe here.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, EAGLE1_DEFAULT_CFG in config.py already uses deepcopy(default_eagle_config) in its definition, making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, the EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG configurations in config.py are already deepcopied, so additional deepcopy calls are not needed when using ALGO_TO_CONFIG mapping in tests.
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, both EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG in config.py already use deepcopied configurations (either directly via deepcopy() or through variables that were already deepcopied), making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.
📚 Learning: 2025-09-15T20:46:29.252Z
Learnt from: realAsma
PR: NVIDIA/TensorRT-Model-Optimizer#318
File: modelopt/torch/quantization/plugins/transformers_trainer.py:170-189
Timestamp: 2025-09-15T20:46:29.252Z
Learning: In modelopt/torch/quantization/plugins/transformers_trainer.py, the restore_from_modelopt_state function can accept modelopt_state["modelopt_state_dict"] directly without needing to wrap it in a full dict structure or include modelopt_version.
Applied to files:
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
📚 Learning: 2025-09-05T19:10:36.393Z
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, the EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG configurations in config.py are already deepcopied, so additional deepcopy calls are not needed when using ALGO_TO_CONFIG mapping in tests.
Applied to files:
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
📚 Learning: 2025-09-05T19:10:36.393Z
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, both EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG in config.py already use deepcopied configurations (either directly via deepcopy() or through variables that were already deepcopied), making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.
Applied to files:
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
📚 Learning: 2025-09-05T19:10:36.393Z
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, EAGLE1_DEFAULT_CFG in config.py already uses deepcopy(default_eagle_config) in its definition, making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.
Applied to files:
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #349 +/- ##
=======================================
Coverage 73.84% 73.84%
=======================================
Files 172 172
Lines 17453 17453
=======================================
Hits 12888 12888
Misses 4565 4565 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__)) | ||
# We avoid deepcopy here since some attributes in Megatron-Bridge config cannot be | ||
# deepcopy. | ||
config_dict = _parse_transformer_config(model[0].config.__dict__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: what's the reason to save this at all? wouldn't it be redundant with the existing MLM args saved in the checkpoint? will there also be a read API that reloads from this?
i ask because _parse_transformer_config
is very simple but doesn't handle enums/lambdas/functional inputs in the dataclass for restoring as is done here: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/utils/yaml_utils.py
What does this PR do?
Type of change: ? Bug fix
Overview:
As suggested by @ananthsub in #341 (comment) some attributes cannot be deepcopy.
Usage
# Add a code snippet demonstrating how to use this
Testing
Before your PR is "Ready for review"
Additional Information