Avoid deepcopy megatron model config #349

ChenhanYu · 2025-09-19T21:30:39Z

What does this PR do?

Type of change: ? Bug fix

Overview:

As suggested by @ananthsub in #341 (comment) some attributes cannot be deepcopy.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Signed-off-by: Chenhan Yu <[email protected]>

coderabbitai · 2025-09-19T21:30:46Z

Walkthrough

Removed deepcopy of model[0].config.dict in save_sharded_modelopt_state, using the config dict directly due to non-deep-copyable attributes. Added explanatory comments. No other logic changed; YAML writing and version tagging remain the same.

Changes

Cohort / File(s)	Summary of Changes
Checkpoint Config Handling `modelopt/torch/opt/plugins/mcore_dist_checkpointing.py`	Replaced deepcopy of config with direct dict reference when constructing transformer run config; added comments explaining Megatron-Bridge config contains non-deep-copyable attributes. No API/signature changes; rest of save and YAML emission unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

I nibbled at a copy, took a lighter leap,
No burrowing deep—some fields refuse to sleep.
A YAML moon, a versioned night,
I hop past mirrors, travel light.
With configs close and carrots near,
The checkpoint’s clear—no cloning fear.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "Avoid deepcopy megatron model config" concisely and accurately summarizes the primary change in the changeset (removing a deepcopy of the Megatron model config in mcore_dist_checkpointing.py), is specific to the main change, and is short and clear for a teammate scanning history.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenhany/fix_avoid_deep_copy_config

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (3)
157-160: Deepcopy removal is correct and safer for Megatron configs.

Using the live __dict__ and parsing into a fresh dict avoids deepcopy failures and does not mutate the source. Nice. Two small hardening nits below.

Ensure all supported Megatron configs actually expose .config.__dict__ (or dataclass) in your CI matrix. If any don’t, we should guard accordingly. Do you want me to add a defensive fallback?

161-162: Use safe_dump and stable key ordering; ensure dir exists.

Safer YAML emission and deterministic file diffs; also avoids surprises if the checkpoint dir wasn’t pre-created.
 with open(run_config_name, "w") as f:
-    yaml.dump(config_dict, f, default_flow_style=False)
+    # Ensure parent exists and write a stable, safe YAML
+    Path(run_config_name).parent.mkdir(parents=True, exist_ok=True)
+    yaml.safe_dump(config_dict, f, default_flow_style=False, sort_keys=True)
171-173: Double‑check deepcopy necessity/safety for modelopt_state.

We still deepcopy the modelopt_state. If that object can contain handles similar to Megatron configs, deepcopy may be fragile or slow. If the goal is only to avoid mutating the original, consider building a sanitized copy instead of full deepcopy—or confirm deepcopy is safe here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c60baae and c8da426.

📒 Files selected for processing (1)

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, EAGLE1_DEFAULT_CFG in config.py already uses deepcopy(default_eagle_config) in its definition, making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.

Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, the EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG configurations in config.py are already deepcopied, so additional deepcopy calls are not needed when using ALGO_TO_CONFIG mapping in tests.

Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, both EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG in config.py already use deepcopied configurations (either directly via deepcopy() or through variables that were already deepcopied), making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.

📚 Learning: 2025-09-15T20:46:29.252Z

Learnt from: realAsma
PR: NVIDIA/TensorRT-Model-Optimizer#318
File: modelopt/torch/quantization/plugins/transformers_trainer.py:170-189
Timestamp: 2025-09-15T20:46:29.252Z
Learning: In modelopt/torch/quantization/plugins/transformers_trainer.py, the restore_from_modelopt_state function can accept modelopt_state["modelopt_state_dict"] directly without needing to wrap it in a full dict structure or include modelopt_version.

Applied to files:

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

📚 Learning: 2025-09-05T19:10:36.393Z

Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, the EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG configurations in config.py are already deepcopied, so additional deepcopy calls are not needed when using ALGO_TO_CONFIG mapping in tests.

Applied to files:

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

📚 Learning: 2025-09-05T19:10:36.393Z

Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, both EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG in config.py already use deepcopied configurations (either directly via deepcopy() or through variables that were already deepcopied), making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.

Applied to files:

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

📚 Learning: 2025-09-05T19:10:36.393Z

Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, EAGLE1_DEFAULT_CFG in config.py already uses deepcopy(default_eagle_config) in its definition, making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.

Applied to files:

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

codecov · 2025-09-19T21:44:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.84%. Comparing base (c60baae) to head (c8da426).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #349   +/-   ##
=======================================
  Coverage   73.84%   73.84%           
=======================================
  Files         172      172           
  Lines       17453    17453           
=======================================
  Hits        12888    12888           
  Misses       4565     4565

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ananthsub · 2025-09-19T21:49:26Z

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

-        config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__))
+        # We avoid deepcopy here since some attributes in Megatron-Bridge config cannot be
+        # deepcopy.
+        config_dict = _parse_transformer_config(model[0].config.__dict__)


n00b question: what's the reason to save this at all? wouldn't it be redundant with the existing MLM args saved in the checkpoint? will there also be a read API that reloads from this?

i ask because _parse_transformer_config is very simple but doesn't handle enums/lambdas/functional inputs in the dataclass for restoring as is done here: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/utils/yaml_utils.py

fix: avoid deepcopy megatron model config

c8da426

Signed-off-by: Chenhan Yu <[email protected]>

ChenhanYu requested a review from a team as a code owner September 19, 2025 21:30

ChenhanYu requested a review from realAsma September 19, 2025 21:30

ChenhanYu changed the title ~~fix: avoid deepcopy megatron model config~~ Avoid deepcopy megatron model config Sep 19, 2025

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

ChenhanYu self-assigned this Sep 19, 2025

ChenhanYu requested a review from yueshen2016 September 19, 2025 21:34

ChenhanYu requested review from kevalmorabia97 and removed request for yueshen2016 September 19, 2025 21:45

realAsma approved these changes Sep 19, 2025

View reviewed changes

ananthsub reviewed Sep 19, 2025

View reviewed changes

ChenhanYu enabled auto-merge (squash) September 19, 2025 22:23

ChenhanYu merged commit 74061f5 into main Sep 19, 2025
26 checks passed

ChenhanYu deleted the chenhany/fix_avoid_deep_copy_config branch September 19, 2025 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid deepcopy megatron model config #349

Avoid deepcopy megatron model config #349

ChenhanYu commented Sep 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Sep 19, 2025 •

edited

Loading

Uh oh!

ananthsub Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

Avoid deepcopy megatron model config #349

Avoid deepcopy megatron model config #349

Conversation

ChenhanYu commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

coderabbitai bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ananthsub Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ChenhanYu commented Sep 19, 2025 •

edited

Loading

coderabbitai bot commented Sep 19, 2025 •

edited

Loading

codecov bot commented Sep 19, 2025 •

edited

Loading