Conversation
|
/ok to test 58617fc |
|
/ok to test 98f2c6b |
Signed-off-by: Li Ding <liding@nvidia.com>
01b3e5a to
2109f36
Compare
|
/ok to test 2109f36 |
Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
|
/ok to test b8f986b |
Signed-off-by: Li Ding <liding@nvidia.com>
|
/ok to test 7c70aa2 |
|
/ok to test f5deffd |
Signed-off-by: Li Ding <liding@nvidia.com>
|
/ok to test f9b45fc |
Signed-off-by: Li Ding <liding@nvidia.com>
|
/ok to test d2a4b32 |
cuichenx
left a comment
There was a problem hiding this comment.
LGTM, comments and questions below
Signed-off-by: Li Ding <liding@nvidia.com>
📝 WalkthroughWalkthroughThis pull request introduces comprehensive support for Nemotron 3 Nano models with Mixture of Experts (MoE) across the Megatron-Bridge framework. Changes include a new Nemotron3NanoProvider with MoE configurations, recipe functions for pretrain/finetune workflows, example training scripts, model parameter mappings for HuggingFace conversion, float32 expert bias preservation in checkpointing, and extensive test coverage for conversion and recipe execution. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/functional_tests/models/deepseek/test_deepseek_conversion.py (1)
96-97: Duplicatesave_pretrainedcall.The model is saved twice consecutively on lines 96 and 97. This appears to be unintentional and wastes I/O.
🐛 Proposed fix
# Save model and config model.save_pretrained(model_dir, safe_serialization=True) - model.save_pretrained(model_dir, safe_serialization=True)src/megatron/bridge/training/utils/train_utils.py (1)
565-580: Add defensive check forhybrid_override_patternin hybrid model case.The code correctly counts "E" occurrences in
hybrid_override_patternfor MoE layer tracking. However, line 567 calls.count("E")without validation. While current Nemotron-H configs properly sethybrid_override_patternwhenis_hybrid_model=True, the Optional type definition in the base class permits None, and there's no validation preventing misconfiguration.To align with the defensive pattern used in
flop_utils.py, add a guard:Suggested fix
if config.model.is_hybrid_model: if config.model.hybrid_override_pattern: layers = config.model.hybrid_override_pattern.count("E") else: # Fallback or raise error if pattern is missing layers = config.model.num_layers else: layers = config.model.num_layers
🤖 Fix all issues with AI agents
In `@examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py`:
- Around line 66-69: The call to pretrain_config is passing an unsupported
tokenizer_model argument which will raise a TypeError; remove the
tokenizer_model=args.tokenizer_model argument from the pretrain_config(...)
invocation (keep per_split_data_args_path=args.per_split_data_args_path as-is)
so the function uses its hardcoded tokenizer (or NullTokenizer behavior)
instead; locate the call to pretrain_config in pretrain_nemotron_3_nano.py and
delete the tokenizer_model parameter.
In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`:
- Around line 41-42: The docstring for the TypedDict Nemotron3NanoCommonKwargs
incorrectly references "Nemotron Next 3B v2"; update the docstring to correctly
describe that these are typed options for the Nemotron 3 Nano recipe helper
functions (reference the TypedDict name Nemotron3NanoCommonKwargs) so the
documentation matches the actual model; keep the wording concise and accurate
(e.g., "Typed options accepted by Nemotron 3 Nano recipe helper functions.").
- Around line 186-187: The docstring incorrectly documents non-existent
parameters tokenizer_model and vocab_size; update the function's docstring to
remove those parameters and instead describe how the tokenizer is configured via
the use_null_tokenizer boolean flag (and any real parameters present in the
function signature), ensuring the docstring names match the actual signature
(e.g., reference use_null_tokenizer) and clarifies expected behavior when
use_null_tokenizer is True/False.
In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py`:
- Around line 84-86: The loop is iterating model.named_parameters() so the
buffer e_score_correction_bias is skipped; change the iteration to
model.named_buffers() and perform the same dtype conversion for entries where
the buffer name contains "e_score_correction_bias" (i.e., replace the loop using
named_parameters() with one using named_buffers() and keep the v.data =
v.data.to(torch.float32) logic for the matching buffer).
In `@tests/functional_tests/recipes/test_nemotronh_recipes_pretrain.py`:
- Around line 49-64: The model_overrides dict in
NEMOTRON_3_NANO_PRETRAIN_RECIPES sets "n_routed_experts": 16 which doesn't match
the actual attribute name on the model; change that key to "num_moe_experts": 16
so the override applies to the Nemotron3NanoProvider's model (see
Nemotron3NanoProvider and the NEMOTRON_3_NANO_PRETRAIN_RECIPES constant and
ensure the model_overrides uses num_moe_experts instead of n_routed_experts).
🧹 Nitpick comments (8)
src/megatron/bridge/training/utils/wandb_utils.py (1)
51-52: Consider including exception details in the error message for easier debugging.The broad
Exceptioncatch is acceptable here for robustness (and matches the pattern at line 77), but capturing the exception details would help diagnose wandb issues without requiring additional investigation.♻️ Suggested improvement
- except Exception: - print_rank_last(f" failed to log checkpoint {checkpoint_path} to wandb") + except Exception as e: + print_rank_last(f" failed to log checkpoint {checkpoint_path} to wandb: {e}")Note: The same improvement could be applied to
on_load_checkpoint_successat line 77-78 for consistency.src/megatron/bridge/recipes/utils/optimizer_utils.py (1)
200-208: Consider adding scheduler parameters to the sample-based variant for consistency.The
distributed_fused_adam_with_cosine_annealing_samplesfunction still uses hardcoded scheduler values while the iteration-based variant now accepts configurable parameters. If sample-based training ever needs different weight decay or LR decay styles, this would require a code change.src/megatron/bridge/recipes/nemotronh/__init__.py (1)
15-26: Minor comment placement inconsistency.The comment
# Nemotron Nano v2 modelson line 15 is now followed by the Nemotron 3 Nano import block, while the actual v2 import is on lines 21-26. Consider reordering to keep comments adjacent to their related imports:♻️ Suggested reordering
-# Nemotron Nano v2 models # Nemotron 3 Nano models from megatron.bridge.recipes.nemotronh.nemotron_3_nano import ( nemotron_3_nano_finetune_config, nemotron_3_nano_pretrain_config, ) + +# Nemotron Nano v2 models from megatron.bridge.recipes.nemotronh.nemotron_nano_v2 import (tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py (1)
278-279: Consider revising the assertion for dataclass attributes.Since
resultis aNemotronHModelProviderdataclass instance,hasattr(result, "num_moe_experts")will returnTrueif the field exists in the class definition (even if not set during construction). This assertion may not accurately test the intended behavior.Consider checking against the expected default value or verifying the field wasn't explicitly set:
- assert not hasattr(result, "num_moe_experts") or result.num_moe_experts is None + # When n_routed_experts is 0, MoE configs should not be populated + # The provider should either not have the field or it should be None/default + assert getattr(result, "num_moe_experts", None) is Nonetests/functional_tests/models/nemotronh/test_nemotron_h_conversion.py (3)
404-405: Minor: Loop variable naming.The variable
keyssuggests plural, but it iterates one key at a time. Consider renaming tokeyfor clarity.- for keys in HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES.keys(): - assert config_data[keys] == HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES[keys] + for key in HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES.keys(): + assert config_data[key] == HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES[key]
433-434: Replaceassert Falsewithpytest.fail()for clearer test failures.While
assert Falseworks in practice (pytest doesn't use-O), usingpytest.fail()is more explicit and won't be removed by Python optimization. The same applies to lines 501.Suggested fix
- assert False, f"Failed to load created Nemotron-3-Nano toy model: {e}" + pytest.fail(f"Failed to load created Nemotron-3-Nano toy model: {e}")And at line 501:
- assert False, f"Nemotron-3-Nano {test_name} conversion failed with return code {result.returncode}" + pytest.fail(f"Nemotron-3-Nano {test_name} conversion failed with return code {result.returncode}")
526-529: Same loop variable naming issue as line 404.- for keys in HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES.keys(): - assert saved_config[keys] == HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES[keys], ( - f"{keys} should match toy config" + for key in HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES.keys(): + assert saved_config[key] == HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES[key], ( + f"{key} should match toy config" )src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py (1)
471-491: Consider extracting shared model configuration logic.The model provider instantiation (lines 471-486) and DeepEP configuration (lines 488-491) are nearly identical to the pretrain helper (lines 208-228). Extracting these into a shared helper would reduce duplication and ensure consistency.
♻️ Example helper extraction
def _create_nemotron_3_nano_model_config( model_provider: type[Nemotron3NanoProvider], tensor_model_parallel_size: int, pipeline_model_parallel_size: int, pipeline_parallelism_dtype: Optional[torch.dtype], virtual_pipeline_parallelism: Optional[int], context_parallel_size: int, sequence_parallelism: bool, expert_tensor_parallelism: int, expert_model_parallelism: int, enable_deepep: bool, ) -> Nemotron3NanoProvider: model_cfg = model_provider( tensor_model_parallel_size=tensor_model_parallel_size, pipeline_model_parallel_size=pipeline_model_parallel_size, pipeline_dtype=pipeline_parallelism_dtype, virtual_pipeline_model_parallel_size=virtual_pipeline_parallelism, context_parallel_size=context_parallel_size, sequence_parallel=sequence_parallelism, expert_tensor_parallel_size=expert_tensor_parallelism, expert_model_parallel_size=expert_model_parallelism, apply_rope_fusion=False, async_tensor_model_parallel_allreduce=True, attention_backend="fused", gradient_accumulation_fusion=True, init_method_std=0.0173, use_fused_weighted_squared_relu=True, ) if enable_deepep: model_cfg.moe_token_dispatcher_type = "flex" model_cfg.moe_shared_expert_overlap = False model_cfg.moe_flex_dispatcher_backend = "deepep" return model_cfg
Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
|
/ok to test 10b7f9b |
cuichenx
left a comment
There was a problem hiding this comment.
LGTM!
If you could add some loss curves to the PR description that would be great
Signed-off-by: Li Ding <liding@nvidia.com>
|
/ok to test 0c5ef6c |
|
/ok to test 07824cc |
Add support for Nemotron-3-Nano model #1858
HF repo: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
These checklist items are required for all models in Megatron Bridge
[x] Model providers
[x] Model bridge for HF conversion
[x] Unit tests (config and bridge)
[x] Model conversion functional tests
[x] Optimal pretraining recipe
[x] Optimal finetuning recipe
[x] Recipe unit tests
[x] Recipe functional tests
Training curves

Pretraining
SFT

LoRA
