Skip to content

Conversation

ChenhanYu
Copy link
Collaborator

@ChenhanYu ChenhanYu commented Sep 19, 2025

What does this PR do?

Type of change: ? Bug fix

Overview:

As suggested by @ananthsub in #341 (comment) some attributes cannot be deepcopy.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

@ChenhanYu ChenhanYu requested a review from a team as a code owner September 19, 2025 21:30
@ChenhanYu ChenhanYu requested a review from realAsma September 19, 2025 21:30
Copy link

coderabbitai bot commented Sep 19, 2025

Walkthrough

Removed deepcopy of model[0].config.dict in save_sharded_modelopt_state, using the config dict directly due to non-deep-copyable attributes. Added explanatory comments. No other logic changed; YAML writing and version tagging remain the same.

Changes

Cohort / File(s) Summary of Changes
Checkpoint Config Handling
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
Replaced deepcopy of config with direct dict reference when constructing transformer run config; added comments explaining Megatron-Bridge config contains non-deep-copyable attributes. No API/signature changes; rest of save and YAML emission unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

I nibbled at a copy, took a lighter leap,
No burrowing deep—some fields refuse to sleep.
A YAML moon, a versioned night,
I hop past mirrors, travel light.
With configs close and carrots near,
The checkpoint’s clear—no cloning fear.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Avoid deepcopy megatron model config" concisely and accurately summarizes the primary change in the changeset (removing a deepcopy of the Megatron model config in mcore_dist_checkpointing.py), is specific to the main change, and is short and clear for a teammate scanning history.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenhany/fix_avoid_deep_copy_config

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ChenhanYu ChenhanYu changed the title fix: avoid deepcopy megatron model config Avoid deepcopy megatron model config Sep 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (3)

157-160: Deepcopy removal is correct and safer for Megatron configs.

Using the live __dict__ and parsing into a fresh dict avoids deepcopy failures and does not mutate the source. Nice. Two small hardening nits below.

  • Ensure all supported Megatron configs actually expose .config.__dict__ (or dataclass) in your CI matrix. If any don’t, we should guard accordingly. Do you want me to add a defensive fallback?

161-162: Use safe_dump and stable key ordering; ensure dir exists.

Safer YAML emission and deterministic file diffs; also avoids surprises if the checkpoint dir wasn’t pre-created.

 with open(run_config_name, "w") as f:
-    yaml.dump(config_dict, f, default_flow_style=False)
+    # Ensure parent exists and write a stable, safe YAML
+    Path(run_config_name).parent.mkdir(parents=True, exist_ok=True)
+    yaml.safe_dump(config_dict, f, default_flow_style=False, sort_keys=True)

171-173: Double‑check deepcopy necessity/safety for modelopt_state.

We still deepcopy the modelopt_state. If that object can contain handles similar to Megatron configs, deepcopy may be fragile or slow. If the goal is only to avoid mutating the original, consider building a sanitized copy instead of full deepcopy—or confirm deepcopy is safe here.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c60baae and c8da426.

📒 Files selected for processing (1)
  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, EAGLE1_DEFAULT_CFG in config.py already uses deepcopy(default_eagle_config) in its definition, making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, the EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG configurations in config.py are already deepcopied, so additional deepcopy calls are not needed when using ALGO_TO_CONFIG mapping in tests.
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, both EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG in config.py already use deepcopied configurations (either directly via deepcopy() or through variables that were already deepcopied), making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.
📚 Learning: 2025-09-15T20:46:29.252Z
Learnt from: realAsma
PR: NVIDIA/TensorRT-Model-Optimizer#318
File: modelopt/torch/quantization/plugins/transformers_trainer.py:170-189
Timestamp: 2025-09-15T20:46:29.252Z
Learning: In modelopt/torch/quantization/plugins/transformers_trainer.py, the restore_from_modelopt_state function can accept modelopt_state["modelopt_state_dict"] directly without needing to wrap it in a full dict structure or include modelopt_version.

Applied to files:

  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
📚 Learning: 2025-09-05T19:10:36.393Z
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, the EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG configurations in config.py are already deepcopied, so additional deepcopy calls are not needed when using ALGO_TO_CONFIG mapping in tests.

Applied to files:

  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
📚 Learning: 2025-09-05T19:10:36.393Z
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, both EAGLE1_DEFAULT_CFG and EAGLE3_DEFAULT_CFG in config.py already use deepcopied configurations (either directly via deepcopy() or through variables that were already deepcopied), making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.

Applied to files:

  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
📚 Learning: 2025-09-05T19:10:36.393Z
Learnt from: yeyu-nvidia
PR: NVIDIA/TensorRT-Model-Optimizer#295
File: tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:35-39
Timestamp: 2025-09-05T19:10:36.393Z
Learning: In the TensorRT-Model-Optimizer codebase, EAGLE1_DEFAULT_CFG in config.py already uses deepcopy(default_eagle_config) in its definition, making additional deepcopy calls unnecessary when using ALGO_TO_CONFIG mapping in tests.

Applied to files:

  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality

@ChenhanYu ChenhanYu self-assigned this Sep 19, 2025
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.84%. Comparing base (c60baae) to head (c8da426).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #349   +/-   ##
=======================================
  Coverage   73.84%   73.84%           
=======================================
  Files         172      172           
  Lines       17453    17453           
=======================================
  Hits        12888    12888           
  Misses       4565     4565           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu ChenhanYu requested review from kevalmorabia97 and removed request for yueshen2016 September 19, 2025 21:45
config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__))
# We avoid deepcopy here since some attributes in Megatron-Bridge config cannot be
# deepcopy.
config_dict = _parse_transformer_config(model[0].config.__dict__)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: what's the reason to save this at all? wouldn't it be redundant with the existing MLM args saved in the checkpoint? will there also be a read API that reloads from this?

i ask because _parse_transformer_config is very simple but doesn't handle enums/lambdas/functional inputs in the dataclass for restoring as is done here: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/utils/yaml_utils.py

@ChenhanYu ChenhanYu enabled auto-merge (squash) September 19, 2025 22:23
@ChenhanYu ChenhanYu merged commit 74061f5 into main Sep 19, 2025
26 checks passed
@ChenhanYu ChenhanYu deleted the chenhany/fix_avoid_deep_copy_config branch September 19, 2025 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants