fix: parallel state initialization error in Megatron to HF model conversion#1120
fix: parallel state initialization error in Megatron to HF model conversion#1120terrykong merged 10 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Stan Kirdey <stan@inflection.ai>
…text fix: temporary distributed context to handle pipeline-parallel megatron checkpoint
WalkthroughIntroduces an import-time fallback for temporary_distributed_context from megatron.bridge.training. Wraps model export in a CPU "gloo" distributed context, calling bridge.load_megatron_model with skip_temp_dist_context=True, then bridge.save_hf_pretrained. Existing output path checks and mcore state reset remain unchanged. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Exporter as community_import.py
participant Bridge as megatron.bridge
participant Dist as temporary_distributed_context (gloo)
User->>Exporter: export(input_path, output_path)
Exporter->>Exporter: validate output path
alt training bridge available
Exporter->>Dist: enter context (backend="gloo")
activate Dist
Note right of Dist: CPU-based distributed context
Exporter->>Bridge: load_megatron_model(input_path, skip_temp_dist_context=True)
Bridge-->>Exporter: MegatronModel
Exporter->>Bridge: save_hf_pretrained(model, output_path)
Bridge-->>Exporter: saved
Exporter->>Dist: exit context
deactivate Dist
else missing training bridge
Exporter-->>User: ImportError("megatron.bridge.training is not available.")
end
Exporter->>Exporter: reset mcore state
Exporter-->>User: done
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/models/megatron/community_import.py (1)
15-16: Make overwrite=True actually overwrite to avoid partial writes.Currently, overwrite=True doesn’t clear the target dir; save may fail or mix old/new files.
Apply:
-import os +import os +import shutil @@ - if os.path.exists(output_path) and not overwrite: - raise FileExistsError( - f"HF checkpoint already exists at {output_path}. Delete it to run or set overwrite=True." - ) + if os.path.exists(output_path): + if overwrite: + shutil.rmtree(output_path) + else: + raise FileExistsError( + f"HF checkpoint already exists at {output_path}. Delete it to run or set overwrite=True." + )Also applies to: 102-106, 115-120
🧹 Nitpick comments (2)
nemo_rl/models/megatron/community_import.py (2)
107-111: Chain ImportError to preserve the root cause (Ruff B904).Attach the original ImportError so debugging isn’t opaque.
Apply:
- try: - from megatron.bridge.training.model_load_save import temporary_distributed_context - except ImportError: - raise ImportError("megatron.bridge.training is not available.") + try: + from megatron.bridge.training.model_load_save import temporary_distributed_context + except ImportError as err: + raise ImportError("megatron.bridge.training is not available.") from err
114-121: Always reset Megatron state even if export fails.If load/save throws, rerun_state_machine.destroy_rerun_state_machine() won’t run. Use finally.
Apply:
- # Export performs on CPU with proper distributed context - with temporary_distributed_context(backend="gloo"): - # Load the Megatron model - megatron_model = bridge.load_megatron_model(input_path, skip_temp_dist_context=True) - - # Save in HuggingFace format - bridge.save_hf_pretrained(megatron_model, output_path) - - # resetting mcore state - import megatron.core.rerun_state_machine - - megatron.core.rerun_state_machine.destroy_rerun_state_machine() + try: + # Export performs on CPU with proper distributed context + with temporary_distributed_context(backend="gloo"): + # Load the Megatron model + megatron_model = bridge.load_megatron_model( + input_path, skip_temp_dist_context=True + ) + # Save in HuggingFace format + bridge.save_hf_pretrained(megatron_model, output_path) + finally: + # resetting mcore state + import megatron.core.rerun_state_machine + megatron.core.rerun_state_machine.destroy_rerun_state_machine()Also applies to: 122-126
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_rl/models/megatron/community_import.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_rl/models/megatron/community_import.py (1)
nemo_rl/models/policy/megatron_policy_worker.py (1)
prepare_for_lp_inference(1694-1697)
🪛 Ruff (0.12.2)
nemo_rl/models/megatron/community_import.py
110-110: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
110-110: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
nemo_rl/models/megatron/community_import.py (2)
115-120: Good fix: wrapping load+save in one temp distributed context.This keeps the PP group initialized across both steps and addresses the original AssertionError. The explicit gloo backend and skip_temp_dist_context=True look correct.
95-101: hf_tokenizer_path is unused — wire it through or drop it.In nemo_rl/models/megatron/community_import.py the hf_tokenizer_path parameter is unused (call at line ~120: bridge.save_hf_pretrained(megatron_model, output_path)). Repo search did not find save_hf_pretrained/load_megatron_model definitions — confirm whether bridge.save_hf_pretrained accepts a tokenizer_path.
Option A (preferred if supported by bridge API): pass tokenizer path.
- bridge.save_hf_pretrained(megatron_model, output_path) + bridge.save_hf_pretrained(megatron_model, output_path, tokenizer_path=hf_tokenizer_path)Option B: if unsupported, remove the parameter from the function signature and all call sites.
-def export_model_from_megatron( - hf_model_name: str, - input_path: str, - output_path: str, - hf_tokenizer_path: str, - overwrite: bool = False, -): +def export_model_from_megatron( + hf_model_name: str, + input_path: str, + output_path: str, + overwrite: bool = False, +):
|
@yaoyu-33 can you pleaser review |
ZhiyuLi-Nvidia
left a comment
There was a problem hiding this comment.
LGTM! Thank you @skirdey-inflection for your contribution.
|
Hi @skirdey-inflection ! Thanks for the contribution. Can you update your branch and apply the pre-commit hooks to lint? |
lint Signed-off-by: Stan Kirdey <stan@inflection.ai>
typo fix Signed-off-by: Stan Kirdey <stan@inflection.ai>
…ersion (#1120) Signed-off-by: Stan Kirdey <stan@inflection.ai> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
…ersion (NVIDIA-NeMo#1120) Signed-off-by: Stan Kirdey <stan@inflection.ai>
…ersion (NVIDIA-NeMo#1120) Signed-off-by: Stan Kirdey <stan@inflection.ai>
…ersion (NVIDIA-NeMo#1120) Signed-off-by: Stan Kirdey <stan@inflection.ai> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Error:
[rank0]: pp_rank = parallel_state.get_pipeline_model_parallel_rank()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/parallel_state.py", line 1474, in get_pipeline_model_parallel_rank
[rank0]: return torch.distributed.get_rank(group=get_pipeline_model_parallel_group())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "RL/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/parallel_state.py", line 1288, in get_pipeline_model_parallel_group
[rank0]: _PIPELINE_MODEL_PARALLEL_GROUP is not None
[rank0]: AssertionError: pipeline_model parallel group is not initialized
Fix parallel state initialization error in Megatron to HF model conversion
Wraps both model loading and HF saving operations within the same temporary distributed context to ensure the pipeline parallel group remains initialized throughout the conversion process, preventing the "pipeline_model parallel group is not initialized" AssertionError.
Tested on PP megatron checkpoint to HF conversion.
Summary by CodeRabbit