Skip to content

Nemotron-3-Nano Model Support#1914

Merged
liding-nv merged 21 commits intomainfrom
liding/nano-v3-pr
Jan 27, 2026
Merged

Nemotron-3-Nano Model Support#1914
liding-nv merged 21 commits intomainfrom
liding/nano-v3-pr

Conversation

@liding-nv
Copy link
Contributor

@liding-nv liding-nv commented Jan 12, 2026

Add support for Nemotron-3-Nano model #1858

HF repo: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

These checklist items are required for all models in Megatron Bridge

[x] Model providers

[x] Model bridge for HF conversion

[x] Unit tests (config and bridge)

[x] Model conversion functional tests

[x] Optimal pretraining recipe

[x] Optimal finetuning recipe

[x] Recipe unit tests

[x] Recipe functional tests

Training curves
Pretraining
W B Chart 1_26_2026, 8_17_07 PM

SFT
W B Chart 1_26_2026, 8_27_41 PM

LoRA
W B Chart 1_26_2026, 8_28_00 PM

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@liding-nv
Copy link
Contributor Author

/ok to test 58617fc

@liding-nv
Copy link
Contributor Author

/ok to test 98f2c6b

Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Contributor Author

/ok to test 2109f36

Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv liding-nv marked this pull request as ready for review January 21, 2026 23:43
@liding-nv liding-nv requested a review from yaoyu-33 January 21, 2026 23:43
@liding-nv
Copy link
Contributor Author

/ok to test b8f986b

Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Contributor Author

/ok to test 7c70aa2

@chtruong814
Copy link
Contributor

/ok to test f5deffd

Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Contributor Author

/ok to test f9b45fc

Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Contributor Author

/ok to test d2a4b32

Copy link
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, comments and questions below

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive support for Nemotron 3 Nano models with Mixture of Experts (MoE) across the Megatron-Bridge framework. Changes include a new Nemotron3NanoProvider with MoE configurations, recipe functions for pretrain/finetune workflows, example training scripts, model parameter mappings for HuggingFace conversion, float32 expert bias preservation in checkpointing, and extensive test coverage for conversion and recipe execution.

Changes

Cohort / File(s) Summary
Model Provider & Bridge Core
src/megatron/bridge/models/__init__.py, src/megatron/bridge/models/nemotronh/__init__.py, src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
Added Nemotron3NanoProvider dataclass (3B model with 18 MoE/architecture parameters) and extended NemotronHModelProvider with 9 new MoE configuration fields (aux_loss_coeff, router settings, grouped_gemm, etc.). Exposed both in public API.
Model Bridge MoE Mapping
src/megatron/bridge/models/nemotronh/nemotron_h_bridge.py
Introduced HuggingFace-to-Megatron MoE parameter mappings for routed/shared experts, router gates, and e_score_correction_bias; populates MoE configs when hf_config.n_routed_experts > 0.
Model Float32 Expert Bias
src/megatron/bridge/models/model_provider.py
Added iteration over model submodules to call _maintain_float32_expert_bias() during FP16/BF16 wrapping, preserving expert bias in float32 within Float16Module.
Recipe Configurations
src/megatron/bridge/recipes/nemotronh/__init__.py, src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py
New nemotron_3_nano.py module with nemotron_3_nano_pretrain_config() and nemotron_3_nano_finetune_config() public functions, internal _common() helpers, and TypedDicts for parameter management; supports DeepEP MoE, PEFT (LoRA), tokenizer selection, and distributed training configuration.
Optimizer Scheduler Parameters
src/megatron/bridge/recipes/utils/optimizer_utils.py
Extended distributed_fused_adam_with_cosine_annealing() with start_weight_decay, end_weight_decay, weight_decay_incr_style, lr_decay_style parameters; SchedulerConfig now uses these instead of hardcoded values.
Example Training Scripts
examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py, examples/recipes/nemotron_3/finetune_nemotron_3_nano.py
Two new scripts (105 and 106 lines) for orchestrating Nemotron 3 Nano pretraining and finetuning; both include CLI argument parsing, OmegaConf config merging, Hydra-style override support, and distributed process cleanup.
Checkpointing & Metadata
src/megatron/bridge/training/checkpointing.py
Added _clean_metadata_for_serialization() call when saving distributed checkpoint content_metadata to ensure proper serialization.
Training Metrics & Logging
src/megatron/bridge/training/utils/train_utils.py, src/megatron/bridge/training/utils/wandb_utils.py
MOE layer counting now differentiates hybrid vs standard models in training_log; wrapped wandb on_save_checkpoint_success with try/except to prevent exception propagation.
Model Conversion Float32 Casting
tests/functional_tests/models/deepseek/test_deepseek_conversion.py, tests/functional_tests/models/glm/test_glm45_conversion.py
Added casting of e_score_correction_bias parameters to float32 during model serialization in both Deepseek and GLM conversion flows.
Provider Unit Tests
tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
Added Nemotron3NanoProvider import and comprehensive unit tests validating default configuration, MoE settings, overridability, and inheritance relationships.
Bridge Unit Tests
tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py
Added MoE configuration mapping tests validating presence/absence of MoE configs based on n_routed_experts flag; extended registry tests for HF-to-Megatron MoE parameter mappings.
Conversion Functional Tests
tests/functional_tests/models/nemotronh/test_nemotron_h_conversion.py
Added TestNemotron3NanoConversion class with HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES constant, toy model creation fixture, and parameterized conversion tests across TP/PP configurations; replaced direct dynamic module calls with dynamic_module_utils namespace.
Pretrain Recipe Functional Tests
tests/functional_tests/recipes/test_nemotronh_recipes_pretrain.py
Added TestNemotron3NanoRecipes class with NEMOTRON_3_NANO_PRETRAIN_RECIPES constant and parameterized test for pretrain recipe execution with specified parallelism/model overrides.
Finetune Recipe Functional Tests
tests/functional_tests/recipes/test_nemotronh_recipes_finetune.py
Added TestNemotron3NanoFinetuneRecipes class with toy model/checkpoint fixtures, LoRA/DoRA/SFT test variants, MoE overrides (HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES and MEGATRON_NEMOTRON_3_NANO_OVERRIDES), and _get_finetune_config() adapter method.
Recipe Unit Tests
tests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.py
Comprehensive 377-line test suite covering pretrain/finetune configurations with defaults, DeepEP toggle, custom parallelism, data paths, precision variants, and CommOverlapConfig validation.
Recipe Test Utilities
tests/functional_tests/recipes/utils.py, tests/functional_tests/models/nemotronh/test_nemotron_h_provider.py
Removed duplicate global_batch_size assignment in run_pretrain_recipe_test; added Nemotron3NanoProvider to HF_MODEL_ID_TO_BRIDGE_MODEL_PROVIDER mapping.
Comparison Precision Fix
examples/conversion/compare_hf_and_megatron/compare.py
Changed hf_logits initialization dtype from bfloat16 to float32 in 1-step HF vs Megatron comparison path.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major changes (Nemotron-3-Nano, MoE configuration, new recipes) but PR description lacks concrete test results, metrics, and convergence validation evidence. Add test execution summaries, convergence validation results, MoE configuration verification across parallelism settings, and performance metrics to PR description.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'Nemotron-3-Nano Model Support' accurately and concisely describes the main objective of the changeset: adding support for a new Nemotron model variant.
Docstring Coverage ✅ Passed Docstring coverage is 98.39% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/functional_tests/models/deepseek/test_deepseek_conversion.py (1)

96-97: Duplicate save_pretrained call.

The model is saved twice consecutively on lines 96 and 97. This appears to be unintentional and wastes I/O.

🐛 Proposed fix
         # Save model and config
         model.save_pretrained(model_dir, safe_serialization=True)
-        model.save_pretrained(model_dir, safe_serialization=True)
src/megatron/bridge/training/utils/train_utils.py (1)

565-580: Add defensive check for hybrid_override_pattern in hybrid model case.

The code correctly counts "E" occurrences in hybrid_override_pattern for MoE layer tracking. However, line 567 calls .count("E") without validation. While current Nemotron-H configs properly set hybrid_override_pattern when is_hybrid_model=True, the Optional type definition in the base class permits None, and there's no validation preventing misconfiguration.

To align with the defensive pattern used in flop_utils.py, add a guard:

Suggested fix
if config.model.is_hybrid_model:
    if config.model.hybrid_override_pattern:
        layers = config.model.hybrid_override_pattern.count("E")
    else:
        # Fallback or raise error if pattern is missing
        layers = config.model.num_layers
else:
    layers = config.model.num_layers
🤖 Fix all issues with AI agents
In `@examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py`:
- Around line 66-69: The call to pretrain_config is passing an unsupported
tokenizer_model argument which will raise a TypeError; remove the
tokenizer_model=args.tokenizer_model argument from the pretrain_config(...)
invocation (keep per_split_data_args_path=args.per_split_data_args_path as-is)
so the function uses its hardcoded tokenizer (or NullTokenizer behavior)
instead; locate the call to pretrain_config in pretrain_nemotron_3_nano.py and
delete the tokenizer_model parameter.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`:
- Around line 41-42: The docstring for the TypedDict Nemotron3NanoCommonKwargs
incorrectly references "Nemotron Next 3B v2"; update the docstring to correctly
describe that these are typed options for the Nemotron 3 Nano recipe helper
functions (reference the TypedDict name Nemotron3NanoCommonKwargs) so the
documentation matches the actual model; keep the wording concise and accurate
(e.g., "Typed options accepted by Nemotron 3 Nano recipe helper functions.").
- Around line 186-187: The docstring incorrectly documents non-existent
parameters tokenizer_model and vocab_size; update the function's docstring to
remove those parameters and instead describe how the tokenizer is configured via
the use_null_tokenizer boolean flag (and any real parameters present in the
function signature), ensuring the docstring names match the actual signature
(e.g., reference use_null_tokenizer) and clarifies expected behavior when
use_null_tokenizer is True/False.

In `@tests/functional_tests/models/deepseek/test_deepseek_conversion.py`:
- Around line 84-86: The loop is iterating model.named_parameters() so the
buffer e_score_correction_bias is skipped; change the iteration to
model.named_buffers() and perform the same dtype conversion for entries where
the buffer name contains "e_score_correction_bias" (i.e., replace the loop using
named_parameters() with one using named_buffers() and keep the v.data =
v.data.to(torch.float32) logic for the matching buffer).

In `@tests/functional_tests/recipes/test_nemotronh_recipes_pretrain.py`:
- Around line 49-64: The model_overrides dict in
NEMOTRON_3_NANO_PRETRAIN_RECIPES sets "n_routed_experts": 16 which doesn't match
the actual attribute name on the model; change that key to "num_moe_experts": 16
so the override applies to the Nemotron3NanoProvider's model (see
Nemotron3NanoProvider and the NEMOTRON_3_NANO_PRETRAIN_RECIPES constant and
ensure the model_overrides uses num_moe_experts instead of n_routed_experts).
🧹 Nitpick comments (8)
src/megatron/bridge/training/utils/wandb_utils.py (1)

51-52: Consider including exception details in the error message for easier debugging.

The broad Exception catch is acceptable here for robustness (and matches the pattern at line 77), but capturing the exception details would help diagnose wandb issues without requiring additional investigation.

♻️ Suggested improvement
-        except Exception:
-            print_rank_last(f"  failed to log checkpoint {checkpoint_path} to wandb")
+        except Exception as e:
+            print_rank_last(f"  failed to log checkpoint {checkpoint_path} to wandb: {e}")

Note: The same improvement could be applied to on_load_checkpoint_success at line 77-78 for consistency.

src/megatron/bridge/recipes/utils/optimizer_utils.py (1)

200-208: Consider adding scheduler parameters to the sample-based variant for consistency.

The distributed_fused_adam_with_cosine_annealing_samples function still uses hardcoded scheduler values while the iteration-based variant now accepts configurable parameters. If sample-based training ever needs different weight decay or LR decay styles, this would require a code change.

src/megatron/bridge/recipes/nemotronh/__init__.py (1)

15-26: Minor comment placement inconsistency.

The comment # Nemotron Nano v2 models on line 15 is now followed by the Nemotron 3 Nano import block, while the actual v2 import is on lines 21-26. Consider reordering to keep comments adjacent to their related imports:

♻️ Suggested reordering
-# Nemotron Nano v2 models
 # Nemotron 3 Nano models
 from megatron.bridge.recipes.nemotronh.nemotron_3_nano import (
     nemotron_3_nano_finetune_config,
     nemotron_3_nano_pretrain_config,
 )
+
+# Nemotron Nano v2 models
 from megatron.bridge.recipes.nemotronh.nemotron_nano_v2 import (
tests/unit_tests/models/nemotronh/test_nemotron_h_bridge.py (1)

278-279: Consider revising the assertion for dataclass attributes.

Since result is a NemotronHModelProvider dataclass instance, hasattr(result, "num_moe_experts") will return True if the field exists in the class definition (even if not set during construction). This assertion may not accurately test the intended behavior.

Consider checking against the expected default value or verifying the field wasn't explicitly set:

-        assert not hasattr(result, "num_moe_experts") or result.num_moe_experts is None
+        # When n_routed_experts is 0, MoE configs should not be populated
+        # The provider should either not have the field or it should be None/default
+        assert getattr(result, "num_moe_experts", None) is None
tests/functional_tests/models/nemotronh/test_nemotron_h_conversion.py (3)

404-405: Minor: Loop variable naming.

The variable keys suggests plural, but it iterates one key at a time. Consider renaming to key for clarity.

-        for keys in HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES.keys():
-            assert config_data[keys] == HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES[keys]
+        for key in HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES.keys():
+            assert config_data[key] == HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES[key]

433-434: Replace assert False with pytest.fail() for clearer test failures.

While assert False works in practice (pytest doesn't use -O), using pytest.fail() is more explicit and won't be removed by Python optimization. The same applies to lines 501.

Suggested fix
-            assert False, f"Failed to load created Nemotron-3-Nano toy model: {e}"
+            pytest.fail(f"Failed to load created Nemotron-3-Nano toy model: {e}")

And at line 501:

-                assert False, f"Nemotron-3-Nano {test_name} conversion failed with return code {result.returncode}"
+                pytest.fail(f"Nemotron-3-Nano {test_name} conversion failed with return code {result.returncode}")

526-529: Same loop variable naming issue as line 404.

-            for keys in HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES.keys():
-                assert saved_config[keys] == HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES[keys], (
-                    f"{keys} should match toy config"
+            for key in HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES.keys():
+                assert saved_config[key] == HF_NEMOTRON_3_NANO_TOY_MODEL_OVERRIDES[key], (
+                    f"{key} should match toy config"
                 )
src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py (1)

471-491: Consider extracting shared model configuration logic.

The model provider instantiation (lines 471-486) and DeepEP configuration (lines 488-491) are nearly identical to the pretrain helper (lines 208-228). Extracting these into a shared helper would reduce duplication and ensure consistency.

♻️ Example helper extraction
def _create_nemotron_3_nano_model_config(
    model_provider: type[Nemotron3NanoProvider],
    tensor_model_parallel_size: int,
    pipeline_model_parallel_size: int,
    pipeline_parallelism_dtype: Optional[torch.dtype],
    virtual_pipeline_parallelism: Optional[int],
    context_parallel_size: int,
    sequence_parallelism: bool,
    expert_tensor_parallelism: int,
    expert_model_parallelism: int,
    enable_deepep: bool,
) -> Nemotron3NanoProvider:
    model_cfg = model_provider(
        tensor_model_parallel_size=tensor_model_parallel_size,
        pipeline_model_parallel_size=pipeline_model_parallel_size,
        pipeline_dtype=pipeline_parallelism_dtype,
        virtual_pipeline_model_parallel_size=virtual_pipeline_parallelism,
        context_parallel_size=context_parallel_size,
        sequence_parallel=sequence_parallelism,
        expert_tensor_parallel_size=expert_tensor_parallelism,
        expert_model_parallel_size=expert_model_parallelism,
        apply_rope_fusion=False,
        async_tensor_model_parallel_allreduce=True,
        attention_backend="fused",
        gradient_accumulation_fusion=True,
        init_method_std=0.0173,
        use_fused_weighted_squared_relu=True,
    )
    if enable_deepep:
        model_cfg.moe_token_dispatcher_type = "flex"
        model_cfg.moe_shared_expert_overlap = False
        model_cfg.moe_flex_dispatcher_backend = "deepep"
    return model_cfg

Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Contributor Author

/ok to test 10b7f9b

cuichenx
cuichenx previously approved these changes Jan 26, 2026
Copy link
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

If you could add some loss curves to the PR description that would be great

Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv
Copy link
Contributor Author

/ok to test 0c5ef6c

@ko3n1g
Copy link
Contributor

ko3n1g commented Jan 27, 2026

/ok to test 07824cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants