Add refactored recipes for finetuning by athitten · Pull Request #2268 · NVIDIA-NeMo/Megatron-Bridge

athitten · 2026-02-06T22:17:44Z

What does this PR do ?

Splits the finetuning recipes for all models in src/megatron/bridge/recipes into _sft_config and _peft_config.
Introduces the parameterless API introduced in PR Add refactored recipe files for pretrain configs of LLMs #2067 for sft and peft recipes

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Release Notes

New Features
- Added SFT (Supervised Fine-Tuning) configuration functions for multiple model families across recipes.
- Introduced PEFT (Parameter-Efficient Fine-Tuning) configuration builders supporting LoRA and other efficient fine-tuning methods.
Refactor
- Unified model configuration interfaces for improved consistency across recipes.
- Simplified configuration API for easier model setup and customization.

Signed-off-by: Abhishree <abhishreetm@gmail.com>

copy-pr-bot · 2026-02-06T22:17:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-06T22:35:29Z

📝 Walkthrough

Walkthrough

This PR refactors recipe configurations across multiple model families by introducing centralized helper functions (_sft_common(), _peft_common()) in common.py and systematically replacing legacy finetune-based config builders with new SFT and PEFT-oriented factory functions.

Changes

Cohort / File(s)	Summary
Common Recipe Helpers `src/megatron/bridge/recipes/common.py`	Added `_sft_common()` and `_peft_common()` helper functions returning ConfigContainer with base SFT and PEFT-specific defaults including dataset, optimizer, scheduler, logging, checkpointing, and LoRA configuration.
Gemma Recipe Refactoring `src/megatron/bridge/recipes/gemma/gemma2.py`, `gemma3.py`	Replaced finetune-based configs with new SFT/PEFT factories for model sizes (2B, 9B, 27B for Gemma2; 1B for Gemma3); removed TypedDict-based kwargs; introduced `_adjust_gemma_vocab_size()` helpers for tokenizer alignment.
LLM OSS Recipe Refactoring `src/megatron/bridge/recipes/gpt_oss/gpt_oss.py`, `src/megatron/bridge/recipes/llama/llama3.py`	Systematically replaced finetune configs with SFT/PEFT factories across GPT-OSS (20B, 120B) and Llama3 variants (32 1B/3B, 3 8B/70B, 31 8B/70B/405B); removed legacy finetune TypedDicts; extended public API with multiple SFT/PEFT entry points.
GLM Recipe Updates `src/megatron/bridge/recipes/glm/glm45.py`	Refactored GLM-4.5 variants (355B, AIR-106B) to use `_pretrain_common`, `_sft_common`, `_peft_common` base builders; replaced TypedDict definitions with direct ConfigContainer construction and customization.
Qwen Recipe Consolidation `src/megatron/bridge/recipes/qwen/qwen2.py`, `qwen3.py`, `qwen3_moe.py`, `qwen3_next.py`	Introduced extensive SFT/PEFT config factories for Qwen2/2.5 (500m–72b), Qwen3 (600m–32b), Qwen3-MoE (30B-A3B, 235B-A22B), and Qwen3-Next variants; removed finetune-based builders and legacy TypedDict argument wrappers.
Nemotron Recipe Modernization `src/megatron/bridge/recipes/nemotron/nemotron_3_nano.py`, `nemotron_nano_v2.py`, `nemotronh.py`	Refactored all Nemotron variants (3-Nano, Nano-V2 9B/12B, NemotronH 4B–56B) to use centralized SFT/PEFT helpers; removed finetune-specific config functions and TypedDict definitions; updated `__all__` exports to include new SFT/PEFT entry points.
OLMoE & Moonlight Recipe Updates `src/megatron/bridge/recipes/olmoe/olmoe_7b.py`, `src/megatron/bridge/recipes/moonlight/moonlight_16b.py`	Added new SFT/PEFT config factories for OLMoE-7B and Moonlight-16B; consolidated configuration via base helpers instead of inline TypedDict definitions; maintained model provider and MoE-specific settings in new builders.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Add refactored recipe files for pretrain configs of LLMs #2067: Introduced _pretrain_common() helper in common.py, establishing the pattern extended by this PR with _sft_common() and _peft_common().
Fix performance config scripts for parameterless recipe API #2201: Earlier changes to centralized recipe helpers in common.py; this PR complements and builds upon that foundation.
Nemotron-3-Nano Model Support #1914: Introduced Nemotron-3-Nano recipe using the same SFT/PEFT helper patterns and LoRA wiring that this PR standardizes across all model families.

Suggested labels

r0.3.0

Suggested reviewers

yaoyu-33
ko3n1g
ananthsub

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR with major refactoring affecting 16+ recipe files lacks required test results, convergence validation, and performance metrics documentation before marking ready for review.	Run functional test suite for all affected models and document results confirming successful training and convergence for SFT and PEFT paths.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: refactoring and adding new recipes for finetuning with SFT and PEFT separation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch athitten/sft_recipe_refactor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/recipes/common.py`:
- Around line 297-304: The PEFT base config validation fails because
CheckpointConfig.pretrained_checkpoint is set to None while
ConfigContainer.validate() requires it when peft is enabled; update the
CheckpointConfig initialization in the recipes/common.py snippet (the
checkpoint=CheckpointConfig(...) block) to default pretrained_checkpoint to a
non-null placeholder (for example use the same checkpoint_dir or checkpoint.load
value) or ensure the config helper explicitly sets
checkpoint.pretrained_checkpoint when peft is set so ConfigContainer.validate()
passes.

In `@src/megatron/bridge/recipes/gemma/gemma3.py`:
- Around line 331-382: The DDP config contradicts the optimizer config: PEFT
sets cfg.optimizer.use_distributed_optimizer = False but
cfg.ddp.use_distributed_optimizer remains True; update the DDP setting to match
the optimizer by setting cfg.ddp.use_distributed_optimizer to the optimizer
value (e.g., cfg.ddp.use_distributed_optimizer =
cfg.optimizer.use_distributed_optimizer) or explicitly to False so distributed
optimizer is disabled consistently (refer to
cfg.optimizer.use_distributed_optimizer and cfg.ddp.use_distributed_optimizer).
- Around line 134-145: The function _adjust_gemma3_vocab_size currently calls
AutoTokenizer.from_pretrained without handling load errors and lacks type hints;
update its signature to include type hints (e.g., model_cfg: Any or a more
specific config type, hf_path: str) and an explicit return type (-> None), wrap
the tokenizer load and length check in a try/except that catches Exception
around AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True) and on
failure simply log or silently skip the vocab adjustment (do not re-raise), then
only adjust model_cfg.vocab_size when tokenizer is successfully loaded and its
length is larger; reference symbols: _adjust_gemma3_vocab_size, model_cfg,
hf_path, AutoTokenizer.from_pretrained.

In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py`:
- Line 15: Replace the old typing Union usage by removing the redundant import
"from typing import Union" and switching all occurrences of Union[str, PEFT] to
the new Python 3.10 union syntax "str | PEFT"; update the import list
accordingly (remove Union) and change the type annotations in gpt_oss.py where
found (around the import area and the annotations near the symbols referenced at
lines ~509-511 and ~640-642) so type hints use "str | PEFT" instead of
Union[str, PEFT].

In `@src/megatron/bridge/recipes/llama/llama3.py`:
- Around line 1011-1013: The assignment cfg.mixed_precision = bf16_mixed() is
dead code because it appears after return in llama31_8b_pretrain_config; remove
that line (or if you intended to override the default, move the assignment
before the return). Prefer removing the line to match other 8B configs that rely
on _pretrain_common()'s default, unless you specifically want to force
bf16_mixed().

In `@src/megatron/bridge/recipes/qwen/qwen2.py`:
- Around line 1182-1189: In the SFT configs (e.g., qwen2_72b_sft_config,
qwen25_32b_sft_config, qwen25_72b_sft_config) you left cfg.model.pipeline_dtype
= None while setting cfg.model.pipeline_model_parallel_size > 1; update those
config objects to set cfg.model.pipeline_dtype = torch.bfloat16 (same as
qwen2_72b_pretrain_config) whenever pipeline_model_parallel_size > 1 to match
the pretrain requirement and avoid runtime misconfiguration.

In `@src/megatron/bridge/recipes/qwen/qwen3_moe.py`:
- Around line 257-259: In the SFT/PEFT config functions
qwen3_30b_a3b_sft_config, qwen3_235b_a22b_sft_config, and
qwen3_235b_a22b_peft_config ensure that when
cfg.model.pipeline_model_parallel_size > 1 you set cfg.model.pipeline_dtype =
torch.bfloat16 (instead of None); mirror the pretrain configs
(qwen3_30b_a3b_pretrain_config and qwen3_235b_a22b_pretrain_config) which set
pipeline_dtype=torch.bfloat16 for multi-stage PP so the pipeline uses bfloat16
when pipeline parallelism is enabled.

In `@src/megatron/bridge/recipes/qwen/qwen3_next.py`:
- Line 15: Remove the unused "from typing import Union" import and update any
type annotations that use Union[str, PEFT] to the modern union syntax str |
PEFT; search for occurrences in this file (e.g., in the function/class
signatures that reference PEFT) and replace Union[str, PEFT] with str | PEFT so
the code follows the guideline and the typing remains correct.

In `@src/megatron/bridge/recipes/qwen/qwen3.py`:
- Around line 946-950: The SFT config sets
cfg.model.pipeline_model_parallel_size = 2 but leaves cfg.model.pipeline_dtype =
None which will cause a runtime error for PP>1; update the SFT config so
cfg.model.pipeline_dtype is set to torch.bfloat16 (same as the pretrain 32B
setting) or conditionally set cfg.model.pipeline_dtype = torch.bfloat16 whenever
cfg.model.pipeline_model_parallel_size > 1; adjust any imports if needed and
ensure the assignment occurs near the existing parallelism settings (references:
cfg.model.pipeline_model_parallel_size and cfg.model.pipeline_dtype).

🟡 Minor comments (16)

src/megatron/bridge/recipes/llama/llama3.py-1222-1237 (1)
1222-1237: ⚠️ Potential issue | 🟡 Minor

Parallelism settings are assigned after the CP-dependent dataset check.

At line 1227, cfg.model.context_parallel_size is read (from whatever AutoBridge defaults to), and then at line 1236, it's explicitly set to 1. If AutoBridge ever defaults CP to something other than 1, pad_seq_to_mult would be incorrectly configured. This ordering appears across all Llama SFT and PEFT configs.

Consider moving the parallelism block (lines 1230–1237) before the packed sequence check (lines 1226–1228) so the CP check reads the intended value.
Proposed reordering (showing llama32_1b_sft_config)
     cfg.dataset.packed_sequence_specs.packed_sequence_size = 4096

-    # Packed sequence settings
-    if cfg.model.context_parallel_size > 1:
-        cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2
-
     # Parallelism settings
     cfg.model.pipeline_model_parallel_layout = None
     cfg.model.tensor_model_parallel_size = 1
     cfg.model.pipeline_model_parallel_size = 1
     cfg.model.pipeline_dtype = None
     cfg.model.virtual_pipeline_model_parallel_size = None
     cfg.model.context_parallel_size = 1
     cfg.model.sequence_parallel = False

+    # Packed sequence settings
+    if cfg.model.context_parallel_size > 1:
+        cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2
+
src/megatron/bridge/recipes/qwen/qwen3_next.py-22-22 (1)
22-22: ⚠️ Potential issue | 🟡 Minor

Unused import: default_peft_config.

default_peft_config is imported but never used since qwen3_next_80b_a3b_peft_config raises NotImplementedError without calling it. This is also flagged by Flake8 (F401).
Proposed fix
-from megatron.bridge.recipes.utils.finetune_utils import default_peft_config, default_squad_config
+from megatron.bridge.recipes.utils.finetune_utils import default_squad_config
src/megatron/bridge/recipes/gpt_oss/gpt_oss.py-282-293 (1)
282-293: ⚠️ Potential issue | 🟡 Minor

CP guard is checked before context_parallel_size is explicitly assigned.

At Line 282, cfg.model.context_parallel_size is read from the AutoBridge-loaded model provider defaults. The explicit assignment cfg.model.context_parallel_size = 1 doesn't happen until Line 291 — nine lines later. If AutoBridge returns a model with CP > 1, the guard would fire and set pad_seq_to_mult, but then CP gets overridden to 1 anyway, leaving a stale padding config.

The same issue exists in gpt_oss_120b_sft_config (Lines 404 vs 413) and both PEFT configs (Lines 540 vs 549, Lines 671 vs 680).

Move the parallelism assignments before the CP guard to match the pattern used in the Qwen recipes, or simply assign CP first.
Proposed fix (showing 20b SFT)
+    # Parallelism settings (MoE-specific)
+    cfg.model.pipeline_model_parallel_layout = None
+    cfg.model.tensor_model_parallel_size = 1
+    cfg.model.pipeline_model_parallel_size = 1
+    cfg.model.pipeline_dtype = torch.bfloat16
+    cfg.model.virtual_pipeline_model_parallel_size = None
+    cfg.model.context_parallel_size = 1
+    cfg.model.expert_model_parallel_size = 8
+    cfg.model.sequence_parallel = False
+
     # Packed sequence settings
     if cfg.model.context_parallel_size > 1:
         cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2
 
-    # Parallelism settings (MoE-specific)
-    cfg.model.pipeline_model_parallel_layout = None
-    cfg.model.tensor_model_parallel_size = 1
-    ...
src/megatron/bridge/recipes/qwen/qwen3.py-1486-1510 (1)

1486-1510: ⚠️ Potential issue | 🟡 Minor

32B PEFT with TP=1, PP=1 may be tight on 80 GB GPUs.

The 32B model requires ~64 GB in bf16 for base weights alone. With TP=1 / PP=1 each GPU must hold the full model replica, leaving very little headroom for activations on 80 GB devices (A100/H100). The pretrain and SFT 32B configs both use TP=8 / PP=2.

Consider documenting a minimum VRAM requirement or providing an alternate config with higher TP for users on 80 GB hardware.
src/megatron/bridge/recipes/gemma/gemma2.py-24-24 (1)
24-24: ⚠️ Potential issue | 🟡 Minor

Remove unused CommOverlapConfig import.

Flake8 flags it as unused.
🛠️ Suggested fix
-from megatron.bridge.training.comm_overlap import CommOverlapConfig
src/megatron/bridge/recipes/moonlight/moonlight_16b.py-519-526 (1)
519-526: ⚠️ Potential issue | 🟡 Minor

Sort __all__ to satisfy RUF022.
🛠️ Suggested fix
-__all__ = [
-    # Pretrain config
-    "moonlight_16b_pretrain_config",
-    # SFT config
-    "moonlight_16b_sft_config",
-    # PEFT config
-    "moonlight_16b_peft_config",
-]
+__all__ = [
+    "moonlight_16b_peft_config",
+    "moonlight_16b_pretrain_config",
+    "moonlight_16b_sft_config",
+]
src/megatron/bridge/recipes/gemma/gemma3.py-25-25 (1)
25-25: ⚠️ Potential issue | 🟡 Minor

Remove unused CommOverlapConfig import.

Flake8 flags it as unused.
🛠️ Suggested fix
-from megatron.bridge.training.comm_overlap import CommOverlapConfig
src/megatron/bridge/recipes/moonlight/moonlight_16b.py-450-455 (1)
450-455: ⚠️ Potential issue | 🟡 Minor

Validate PEFT schemes instead of assigning raw strings.

If a caller passes an unsupported string (e.g., "none"), cfg.peft becomes a str and PEFT application will break. default_peft_config already normalizes/validates string inputs.
🛠️ Suggested fix
-    if isinstance(peft_scheme, str) and peft_scheme.lower() in ["lora", "dora"]:
-        cfg.peft = default_peft_config(peft_scheme)
-    else:
-        cfg.peft = peft_scheme
+    cfg.peft = default_peft_config(peft_scheme)
src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py-466-473 (1)
466-473: ⚠️ Potential issue | 🟡 Minor

Sort __all__ to satisfy RUF022.
🛠️ Suggested fix
-__all__ = [
-    # Pretrain config
-    "nemotron_3_nano_pretrain_config",
-    # SFT config
-    "nemotron_3_nano_sft_config",
-    # PEFT config
-    "nemotron_3_nano_peft_config",
-]
+__all__ = [
+    "nemotron_3_nano_peft_config",
+    "nemotron_3_nano_pretrain_config",
+    "nemotron_3_nano_sft_config",
+]
src/megatron/bridge/recipes/olmoe/olmoe_7b.py-15-16 (1)
15-16: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 union syntax instead of Union[X, Y].

Remove the Union import from line 15 and replace Union[str, PEFT] with str | PEFT in the peft_scheme parameter at line 331.
🛠️ Suggested fix
-from typing import Union
-
@@
-def olmoe_7b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+def olmoe_7b_peft_config(
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py-15-16 (1)
15-16: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 union types for peft_scheme.

Line 15 imports Union from typing, but the codebase requires Python 3.10+ (per coding guidelines), which supports PEP 604 union syntax (X | Y). Remove the Union import and update the type annotation on line 307 to use str | PEFT instead of Union[str, PEFT].
🛠️ Suggested fix
-from typing import Union
-
@@
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
src/megatron/bridge/recipes/gemma/gemma2.py-16-17 (1)
16-17: ⚠️ Potential issue | 🟡 Minor

Replace Union[str, PEFT] with str | PEFT for all three PEFT builder functions and remove the Union import.

All three PEFT config functions (gemma2_2b_peft_config, gemma2_9b_peft_config, gemma2_27b_peft_config) use Union[str, PEFT] which should be updated to PEP 604 union syntax per coding guidelines. Remove the from typing import Union import at line 16 once these changes are applied.
🛠️ Suggested fixes
-from typing import Union
-
 import torch
 def gemma2_2b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
 def gemma2_9b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
 def gemma2_27b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
src/megatron/bridge/recipes/gemma/gemma3.py-15-16 (1)
15-16: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 unions for peft_scheme.

The Union[str, PEFT] type hint at line 265 should use the PEP 604 syntax str | PEFT. After this change, the Union import at line 16 can be removed since it will no longer be used.
🛠️ Suggested fix
-from typing import Union
-
 import torch
-def gemma3_1b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+def gemma3_1b_peft_config(
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
src/megatron/bridge/recipes/glm/glm45.py-536-538 (1)
536-538: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 syntax for union types instead of Union.

Replace Union[str, PEFT] with str | PEFT at lines 537 and 673 to conform to Python 3.10+ standards and coding guidelines.
🛠️ Suggested fix
 def glm45_355b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
 
 def glm45_air_106b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
src/megatron/bridge/recipes/moonlight/moonlight_16b.py-15-16 (1)
15-16: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 union types instead of Union.

Switch peft_scheme to str | PEFT and remove the Union import from line 15.
🛠️ Suggested fix
-from typing import Union
-
@@
-def moonlight_16b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+def moonlight_16b_peft_config(
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
Per coding guidelines: Use 'X | Y' for union types instead of 'Union[X, Y]'.
src/megatron/bridge/recipes/gemma/gemma2.py-283-299 (1)
283-299: ⚠️ Potential issue | 🟡 Minor

Narrow exception handling and add type hints.

Catching Exception is too broad and hides real errors. The coding guidelines require type hints for all function arguments and return types, and limiting except clauses to the smallest set of errors. Per the transformers library, AutoTokenizer.from_pretrained() raises OSError (file/authentication issues) and ValueError (config issues). Restructure the try-except to handle only these specific exceptions and move the vocab size adjustment outside the try block to follow the duck-typing pattern.
🛠️ Suggested fix
+import logging
@@
-def _adjust_gemma2_vocab_size(model_cfg, hf_path: str):
+def _adjust_gemma2_vocab_size(model_cfg: object, hf_path: str) -> None:
@@
-        try:
-            from transformers import AutoTokenizer
-
-            tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
-            if len(tokenizer) > model_cfg.vocab_size:
-                model_cfg.vocab_size = len(tokenizer)
-        except Exception:
-            # Skip vocab size adjustment if tokenizer cannot be loaded
-            # (e.g., due to missing HuggingFace authentication)
-            pass
+        try:
+            from transformers import AutoTokenizer
+
+            tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
+        except (OSError, ValueError) as exc:
+            logging.warning("Skipping Gemma2 vocab resize for %s: %s", hf_path, exc)
+            return
+
+        if len(tokenizer) > model_cfg.vocab_size:
+            model_cfg.vocab_size = len(tokenizer)

🧹 Nitpick comments (7)

src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py (1)

1-31: Imports and header look good.

The file structure, copyright header, and import organization follow the project conventions.

One minor note: per coding guidelines for Python 3.10+, Union[str, PEFT] (used in PEFT function signatures) could be written as str | PEFT. This is entirely optional given the "Chill" review mode and the consistency with the broader PR convention.

src/megatron/bridge/recipes/nemotronh/nemotronh.py (1)

15-31: Imports and structure are clean.

Same minor note as the Nano v2 file regarding Union vs str | PEFT for Python 3.10+, applicable to all PEFT function signatures in this file.

src/megatron/bridge/recipes/qwen/qwen2.py (2)

906-909: Dead code: context_parallel_size > 1 guard is always false.

In every SFT and PEFT config function, context_parallel_size is hardcoded to 1 a few lines above the guard. The if cfg.model.context_parallel_size > 1 block (e.g., Lines 908-909) will never execute. This pattern repeats across all ~20 new functions in this file.

If this is a user-documentation hint ("change CP here and the padding adjusts"), consider adding a brief inline comment to clarify intent. Otherwise it's misleading dead code.

876-2863: Extreme boilerplate duplication across SFT/PEFT config functions.

Every SFT function (~10 of them) and every PEFT function (~10 of them) in this file are nearly identical — only the HF model ID, tokenizer path, and parallelism settings differ. The shared body (TE, CUDA graph, kernel selection, memory saving, FP8 comments, optimizer precision, checkpoint, DDP) is copy-pasted verbatim and accounts for ~60-70 lines per function.

Consider extracting a _qwen2_sft_common(cfg) and _qwen2_peft_common(cfg) helper that applies the shared knobs, so each public config function only declares the model-specific delta (~15-20 lines each). This would cut ~1500 lines of duplicated code.

src/megatron/bridge/recipes/gpt_oss/gpt_oss.py (1)

255-768: Significant SFT/PEFT boilerplate duplication, similar to qwen2.py.

The 20B and 120B SFT/PEFT configs share ~90% of their body (TE, CUDA graph, kernel, memory saving, FP8, optimizer precision, checkpoint, DDP, MoE settings). A shared _gpt_oss_finetune_common(cfg) helper applying these shared knobs would reduce duplication and maintenance burden.

src/megatron/bridge/recipes/qwen/qwen3_moe.py (1)

529-532: PEFT LoRA customization (dim=8, alpha=16) is applied conditionally — custom PEFT instances bypass it silently.

When a user passes a custom PEFT instance (not a string), default_peft_config returns it as-is, and the isinstance(peft_scheme, str) guard at Line 529 skips the MoE-specific dim/alpha/target_modules overrides. This may be intentional (user supplies a fully configured PEFT), but the docstring mentions "LoRA/DoRA uses dim=8, alpha=16" without noting that custom instances won't get these overrides.

Consider adding a brief note to the docstring or a log warning if the user's custom PEFT doesn't match the expected MoE LoRA config.

Also applies to: 668-672

src/megatron/bridge/recipes/qwen/qwen3.py (1)

502-1578: Consider extracting shared SFT/PEFT boilerplate into a helper.

The 12 new config functions share ~90% identical code (CUDA graph settings, FP8 comments, optimizer precision, kernel selections, DDP config, checkpoint config, memory saving defaults). They differ only in model name, TP/PP, and recompute for 32B.

A small helper like _qwen3_sft(model_id, tp, pp, *, recompute=False, pipeline_dtype=None) could consolidate this, leaving each public function as a 1–3 line wrapper. This would also reduce the risk of per-variant inconsistencies (e.g., the pipeline_dtype omission in the 32B SFT config).

Not blocking — the explicit style is consistent with the pretrain configs and may be intentional for recipe discoverability.

coderabbitai · 2026-02-06T22:35:32Z

src/megatron/bridge/recipes/common.py

+        checkpoint=CheckpointConfig(
+            save_interval=100,
+            save=checkpoint_dir,
+            load=checkpoint_dir,
+            pretrained_checkpoint=None,  # Set to load from pretrained weights
+            ckpt_format="torch_dist",
+            fully_parallel_save=True,
+        ),


⚠️ Potential issue | 🟠 Major

PEFT base config fails validation without a pretrained checkpoint.

ConfigContainer.validate() asserts checkpoint.pretrained_checkpoint when peft is set; leaving it None makes the parameterless PEFT configs error unless callers remember to patch it. Consider defaulting it to a placeholder (e.g., checkpoint_dir or checkpoint.load) or explicitly wiring it via the helper.

🛠️ Suggested fix

checkpoint=CheckpointConfig( save_interval=100, save=checkpoint_dir, load=checkpoint_dir, - pretrained_checkpoint=None, # Set to load from pretrained weights + pretrained_checkpoint=checkpoint_dir, # Override with actual pretrained weights path ckpt_format="torch_dist", fully_parallel_save=True, ),

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/common.py` around lines 297 - 304, The PEFT base config validation fails because CheckpointConfig.pretrained_checkpoint is set to None while ConfigContainer.validate() requires it when peft is enabled; update the CheckpointConfig initialization in the recipes/common.py snippet (the checkpoint=CheckpointConfig(...) block) to default pretrained_checkpoint to a non-null placeholder (for example use the same checkpoint_dir or checkpoint.load value) or ensure the config helper explicitly sets checkpoint.pretrained_checkpoint when peft is set so ConfigContainer.validate() passes.

coderabbitai · 2026-02-06T22:35:32Z

src/megatron/bridge/recipes/gemma/gemma3.py

+def _adjust_gemma3_vocab_size(model_cfg, hf_path: str):
+    """Adjust vocab size for Gemma3 (model vocab < tokenizer vocab).
+
+    Note: This requires HuggingFace authentication for Gemma3 models.
+    If the tokenizer cannot be loaded, the vocab size adjustment is skipped.
+    """
+    if hasattr(model_cfg, "vocab_size") and hf_path:
+        from transformers import AutoTokenizer
+
+        tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
+        if len(tokenizer) > model_cfg.vocab_size:
+            model_cfg.vocab_size = len(tokenizer)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the current state of the file cat -n src/megatron/bridge/recipes/gemma/gemma3.py | head -150

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 6797

🏁 Script executed:

# Check imports at the top of the file head -30 src/megatron/bridge/recipes/gemma/gemma3.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1360

🏁 Script executed:

# Search for any existing logging usage or logging imports in the file rg "logging|import logging" src/megatron/bridge/recipes/gemma/gemma3.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check if this function is called elsewhere in the codebase rg "_adjust_gemma3_vocab_size" --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 358

🏁 Script executed:

# Search for similar error handling patterns in the codebase to understand project conventions rg "AutoTokenizer.from_pretrained" --type py -A 5 -B 2 | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4903

Add error handling for tokenizer load failures and type hints.

The docstring states that vocab adjustment is skipped if the tokenizer cannot be loaded, but the code lacks exception handling and will raise instead. Additionally, the function is missing type hints for model_cfg and the return type, as required by coding guidelines.

🛠️ Suggested fix

+import logging @@ -def _adjust_gemma3_vocab_size(model_cfg, hf_path: str): +def _adjust_gemma3_vocab_size(model_cfg: object, hf_path: str) -> None: @@ - if hasattr(model_cfg, "vocab_size") and hf_path: - from transformers import AutoTokenizer - - tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True) - if len(tokenizer) > model_cfg.vocab_size: - model_cfg.vocab_size = len(tokenizer) + if hasattr(model_cfg, "vocab_size") and hf_path: + try: + from transformers import AutoTokenizer + + tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True) + except (OSError, ValueError) as exc: + logging.warning("Skipping Gemma3 vocab resize for %s: %s", hf_path, exc) + return + + if len(tokenizer) > model_cfg.vocab_size: + model_cfg.vocab_size = len(tokenizer)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/gemma/gemma3.py` around lines 134 - 145, The function _adjust_gemma3_vocab_size currently calls AutoTokenizer.from_pretrained without handling load errors and lacks type hints; update its signature to include type hints (e.g., model_cfg: Any or a more specific config type, hf_path: str) and an explicit return type (-> None), wrap the tokenizer load and length check in a try/except that catches Exception around AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True) and on failure simply log or silently skip the vocab adjustment (do not re-raise), then only adjust model_cfg.vocab_size when tokenizer is successfully loaded and its length is larger; reference symbols: _adjust_gemma3_vocab_size, model_cfg, hf_path, AutoTokenizer.from_pretrained.

coderabbitai · 2026-02-06T22:35:32Z

src/megatron/bridge/recipes/gemma/gemma3.py

+    # Optimizer config - PEFT uses higher learning rate
+    cfg.optimizer.adam_beta2 = 0.95
+    cfg.optimizer.use_precision_aware_optimizer = False
+    cfg.optimizer.main_grads_dtype = torch.float32
+    cfg.optimizer.main_params_dtype = torch.float32
+    cfg.optimizer.exp_avg_dtype = torch.float32
+    cfg.optimizer.exp_avg_sq_dtype = torch.float32
+    cfg.optimizer.use_distributed_optimizer = False  # PEFT disables distributed optimizer
+
+    # Scheduler config - PEFT uses higher learning rate
+    cfg.scheduler.lr_warmup_iters = 10
+    cfg.scheduler.lr_decay_iters = 100
+    cfg.scheduler.max_lr = 1e-4
+
+    # TE (Transformer Engine)
+    cfg.model.transformer_impl = "transformer_engine"
+
+    # CUDA Graph
+    cfg.model.cuda_graph_impl = "none"
+    cfg.model.cuda_graph_scope = "full"
+    cfg.model.cuda_graph_warmup_steps = 3
+
+    # Kernel selections - PEFT disables cross_entropy_loss_fusion
+    cfg.model.attention_backend = None
+    cfg.model.cross_entropy_loss_fusion = False  # PEFT disables this
+    cfg.model.cross_entropy_fusion_impl = "native"  # Gemma3 uses native
+
+    # Memory saving (recompute & offloading)
+    cfg.model.recompute_granularity = None
+    cfg.model.recompute_modules = None
+    cfg.model.fine_grained_activation_offloading = False
+    cfg.model.offload_modules = None
+
+    # FP8 & MXFP8 (mixed_precision settings)
+    # Note: mixed_precision="bf16_mixed" is set in _peft_common as default
+    # These are defaults for FP8, enable them if using FP8 - FP8 is not enabled by default
+    # cfg.mixed_precision.fp8_recipe = "tensorwise"  # default, uncomment to enable
+    # cfg.mixed_precision.fp8 = None  # not enabled by default
+    # cfg.mixed_precision.fp8_param_gather = False  # default
+    # cfg.mixed_precision.reuse_grad_buf_for_mxfp8_param_ag = False  # default
+
+    # Checkpoint config
+    cfg.checkpoint.save_interval = 100
+    cfg.checkpoint.ckpt_format = "torch_dist"
+    cfg.checkpoint.fully_parallel_save = True
+
+    # DDP config
+    cfg.ddp.overlap_grad_reduce = True
+    cfg.ddp.overlap_param_gather = True
+    cfg.ddp.check_for_nan_in_grad = True
+    cfg.ddp.use_distributed_optimizer = True
+    cfg.ddp.average_in_collective = True


⚠️ Potential issue | 🟠 Major

Align DDP distributed optimizer flag with optimizer config.

PEFT disables the distributed optimizer, but DDP still sets it to True, which can trip validation or re-enable distributed optimizer behavior unintentionally.

🛠️ Suggested fix

- cfg.ddp.use_distributed_optimizer = True + cfg.ddp.use_distributed_optimizer = False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Optimizer config - PEFT uses higher learning rate

cfg.optimizer.adam_beta2 = 0.95

cfg.optimizer.use_precision_aware_optimizer = False

cfg.optimizer.main_grads_dtype = torch.float32

cfg.optimizer.main_params_dtype = torch.float32

cfg.optimizer.exp_avg_dtype = torch.float32

cfg.optimizer.exp_avg_sq_dtype = torch.float32

cfg.optimizer.use_distributed_optimizer = False # PEFT disables distributed optimizer

# Scheduler config - PEFT uses higher learning rate

cfg.scheduler.lr_warmup_iters = 10

cfg.scheduler.lr_decay_iters = 100

cfg.scheduler.max_lr = 1e-4

# TE (Transformer Engine)

cfg.model.transformer_impl = "transformer_engine"

# CUDA Graph

cfg.model.cuda_graph_impl = "none"

cfg.model.cuda_graph_scope = "full"

cfg.model.cuda_graph_warmup_steps = 3

# Kernel selections - PEFT disables cross_entropy_loss_fusion

cfg.model.attention_backend = None

cfg.model.cross_entropy_loss_fusion = False # PEFT disables this

cfg.model.cross_entropy_fusion_impl = "native" # Gemma3 uses native

# Memory saving (recompute & offloading)

cfg.model.recompute_granularity = None

cfg.model.recompute_modules = None

cfg.model.fine_grained_activation_offloading = False

cfg.model.offload_modules = None

# FP8 & MXFP8 (mixed_precision settings)

# Note: mixed_precision="bf16_mixed" is set in _peft_common as default

# These are defaults for FP8, enable them if using FP8 - FP8 is not enabled by default

# cfg.mixed_precision.fp8_recipe = "tensorwise" # default, uncomment to enable

# cfg.mixed_precision.fp8 = None # not enabled by default

# cfg.mixed_precision.fp8_param_gather = False # default

# cfg.mixed_precision.reuse_grad_buf_for_mxfp8_param_ag = False # default

# Checkpoint config

cfg.checkpoint.save_interval = 100

cfg.checkpoint.ckpt_format = "torch_dist"

cfg.checkpoint.fully_parallel_save = True

# DDP config

cfg.ddp.overlap_grad_reduce = True

cfg.ddp.overlap_param_gather = True

cfg.ddp.check_for_nan_in_grad = True

cfg.ddp.use_distributed_optimizer = True

cfg.ddp.average_in_collective = True

# Optimizer config - PEFT uses higher learning rate

cfg.optimizer.adam_beta2 = 0.95

cfg.optimizer.use_precision_aware_optimizer = False

cfg.optimizer.main_grads_dtype = torch.float32

cfg.optimizer.main_params_dtype = torch.float32

cfg.optimizer.exp_avg_dtype = torch.float32

cfg.optimizer.exp_avg_sq_dtype = torch.float32

cfg.optimizer.use_distributed_optimizer = False # PEFT disables distributed optimizer

# Scheduler config - PEFT uses higher learning rate

cfg.scheduler.lr_warmup_iters = 10

cfg.scheduler.lr_decay_iters = 100

cfg.scheduler.max_lr = 1e-4

# TE (Transformer Engine)

cfg.model.transformer_impl = "transformer_engine"

# CUDA Graph

cfg.model.cuda_graph_impl = "none"

cfg.model.cuda_graph_scope = "full"

cfg.model.cuda_graph_warmup_steps = 3

# Kernel selections - PEFT disables cross_entropy_loss_fusion

cfg.model.attention_backend = None

cfg.model.cross_entropy_loss_fusion = False # PEFT disables this

cfg.model.cross_entropy_fusion_impl = "native" # Gemma3 uses native

# Memory saving (recompute & offloading)

cfg.model.recompute_granularity = None

cfg.model.recompute_modules = None

cfg.model.fine_grained_activation_offloading = False

cfg.model.offload_modules = None

# FP8 & MXFP8 (mixed_precision settings)

# Note: mixed_precision="bf16_mixed" is set in _peft_common as default

# These are defaults for FP8, enable them if using FP8 - FP8 is not enabled by default

# cfg.mixed_precision.fp8_recipe = "tensorwise" # default, uncomment to enable

# cfg.mixed_precision.fp8 = None # not enabled by default

# cfg.mixed_precision.fp8_param_gather = False # default

# cfg.mixed_precision.reuse_grad_buf_for_mxfp8_param_ag = False # default

# Checkpoint config

cfg.checkpoint.save_interval = 100

cfg.checkpoint.ckpt_format = "torch_dist"

cfg.checkpoint.fully_parallel_save = True

# DDP config

cfg.ddp.overlap_grad_reduce = True

cfg.ddp.overlap_param_gather = True

cfg.ddp.check_for_nan_in_grad = True

cfg.ddp.use_distributed_optimizer = False

cfg.ddp.average_in_collective = True

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/gemma/gemma3.py` around lines 331 - 382, The DDP config contradicts the optimizer config: PEFT sets cfg.optimizer.use_distributed_optimizer = False but cfg.ddp.use_distributed_optimizer remains True; update the DDP setting to match the optimizer by setting cfg.ddp.use_distributed_optimizer to the optimizer value (e.g., cfg.ddp.use_distributed_optimizer = cfg.optimizer.use_distributed_optimizer) or explicitly to False so distributed optimizer is disabled consistently (refer to cfg.optimizer.use_distributed_optimizer and cfg.ddp.use_distributed_optimizer).

coderabbitai · 2026-02-06T22:35:32Z

src/megatron/bridge/recipes/gpt_oss/gpt_oss.py


-import os
-from typing import List, Optional, Union
+from typing import Union


🛠️ Refactor suggestion | 🟠 Major

Use str | PEFT instead of Union[str, PEFT].

The coding guidelines require X | Y for union types instead of Union[X, Y] (Python 3.10+). The from typing import Union import (Line 15) can then be removed.

Proposed fix

-from typing import Union - import torch

-def gpt_oss_20b_peft_config( - peft_scheme: Union[str, PEFT] = "lora", -) -> ConfigContainer: +def gpt_oss_20b_peft_config( + peft_scheme: str | PEFT = "lora", +) -> ConfigContainer:

-def gpt_oss_120b_peft_config( - peft_scheme: Union[str, PEFT] = "lora", +def gpt_oss_120b_peft_config( + peft_scheme: str | PEFT = "lora",

As per coding guidelines, "Use 'X | Y' for union types instead of 'Union[X, Y]'".

Also applies to: 509-511, 640-642

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` at line 15, Replace the old typing Union usage by removing the redundant import "from typing import Union" and switching all occurrences of Union[str, PEFT] to the new Python 3.10 union syntax "str | PEFT"; update the import list accordingly (remove Union) and change the type annotations in gpt_oss.py where found (around the import area and the annotations near the symbols referenced at lines ~509-511 and ~640-642) so type hints use "str | PEFT" instead of Union[str, PEFT].

coderabbitai · 2026-02-06T22:35:33Z

src/megatron/bridge/recipes/llama/llama3.py

    return cfg

+    cfg.mixed_precision = bf16_mixed()


⚠️ Potential issue | 🔴 Critical

Unreachable dead code: bf16_mixed() assignment after return.

Line 1013 sets cfg.mixed_precision = bf16_mixed() after return cfg on line 1012 inside llama31_8b_pretrain_config. This line will never execute. Either it should be moved before the return, or removed if the default from _pretrain_common() ("bf16_mixed") is sufficient.

Note: Other Llama pretrain configs (8B base, 1B, 3B) don't set mixed_precision explicitly and rely on the _pretrain_common default, while the 70B/405B variants explicitly use bf16_mixed(). If the intent matches the other 8B configs, just delete the dead line.

Proposed fix — remove the dead code

cfg.ddp.data_parallel_sharding_strategy = "no_shard" return cfg - - cfg.mixed_precision = bf16_mixed() def llama31_70b_pretrain_config() -> ConfigContainer:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return cfg

cfg.mixed_precision = bf16_mixed()

return cfg

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/llama/llama3.py` around lines 1011 - 1013, The assignment cfg.mixed_precision = bf16_mixed() is dead code because it appears after return in llama31_8b_pretrain_config; remove that line (or if you intended to override the default, move the assignment before the return). Prefer removing the line to match other 8B configs that rely on _pretrain_common()'s default, unless you specifically want to force bf16_mixed().

coderabbitai · 2026-02-06T22:35:33Z

src/megatron/bridge/recipes/qwen/qwen2.py

+    # Parallelism settings - needs more parallelism for 72B
+    cfg.model.tensor_model_parallel_size = 8
+    cfg.model.pipeline_model_parallel_size = 4
+    cfg.model.pipeline_model_parallel_layout = None
+    cfg.model.pipeline_dtype = None
+    cfg.model.virtual_pipeline_model_parallel_size = None
+    cfg.model.context_parallel_size = 1
+    cfg.model.sequence_parallel = False


⚠️ Potential issue | 🔴 Critical

Missing pipeline_dtype for pipeline_model_parallel_size > 1 in SFT configs.

qwen2_72b_sft_config sets pipeline_model_parallel_size = 4 but leaves pipeline_dtype = None (Line 1186). The corresponding pretrain config (qwen2_72b_pretrain_config, Line 312) correctly sets pipeline_dtype = torch.bfloat16 with the comment "Required for PP > 1".

The same issue applies to:

qwen25_32b_sft_config (Line 1671, PP=2)

qwen25_72b_sft_config (Line 1767, PP=4)

This will likely cause a runtime error or silent misconfiguration when pipeline parallelism is active.

Proposed fix for all affected SFT configs

# qwen2_72b_sft_config (line 1186) - cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1 # qwen25_32b_sft_config (line 1671) - cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1 # qwen25_72b_sft_config (line 1767) - cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/qwen/qwen2.py` around lines 1182 - 1189, In the SFT configs (e.g., qwen2_72b_sft_config, qwen25_32b_sft_config, qwen25_72b_sft_config) you left cfg.model.pipeline_dtype = None while setting cfg.model.pipeline_model_parallel_size > 1; update those config objects to set cfg.model.pipeline_dtype = torch.bfloat16 (same as qwen2_72b_pretrain_config) whenever pipeline_model_parallel_size > 1 to match the pretrain requirement and avoid runtime misconfiguration.

coderabbitai · 2026-02-06T22:35:33Z

src/megatron/bridge/recipes/qwen/qwen3_moe.py

+    cfg.model.pipeline_model_parallel_size = 2
+    cfg.model.pipeline_model_parallel_layout = None
+    cfg.model.pipeline_dtype = None


⚠️ Potential issue | 🔴 Critical

Missing pipeline_dtype = torch.bfloat16 when pipeline_model_parallel_size > 1 in SFT/PEFT configs.

Same issue as in qwen2.py. The pretrain configs correctly set pipeline_dtype = torch.bfloat16 when PP > 1, but the new SFT/PEFT configs set pipeline_dtype = None:

Function PP pipeline_dtype Line

qwen3_30b_a3b_sft_config 2 None 259

qwen3_235b_a22b_sft_config 16 None 381

qwen3_235b_a22b_peft_config 4 None 652

The corresponding pretrain configs (qwen3_30b_a3b_pretrain_config Line 46, qwen3_235b_a22b_pretrain_config Line 151) both set pipeline_dtype = torch.bfloat16.

Proposed fix

# qwen3_30b_a3b_sft_config (line 259) - cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1 # qwen3_235b_a22b_sft_config (line 381) - cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1 # qwen3_235b_a22b_peft_config (line 652) - cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1

Also applies to: 378-381, 649-652

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/qwen/qwen3_moe.py` around lines 257 - 259, In the SFT/PEFT config functions qwen3_30b_a3b_sft_config, qwen3_235b_a22b_sft_config, and qwen3_235b_a22b_peft_config ensure that when cfg.model.pipeline_model_parallel_size > 1 you set cfg.model.pipeline_dtype = torch.bfloat16 (instead of None); mirror the pretrain configs (qwen3_30b_a3b_pretrain_config and qwen3_235b_a22b_pretrain_config) which set pipeline_dtype=torch.bfloat16 for multi-stage PP so the pipeline uses bfloat16 when pipeline parallelism is enabled.

coderabbitai · 2026-02-06T22:35:33Z

src/megatron/bridge/recipes/qwen/qwen3_next.py

 # limitations under the License.

-import os
+from typing import Union


🛠️ Refactor suggestion | 🟠 Major

Use str | PEFT instead of Union[str, PEFT].

Same guidelines violation as in gpt_oss.py. Remove from typing import Union and use the pipe syntax.

Proposed fix

-from typing import Union - import torch

def qwen3_next_80b_a3b_peft_config( - peft_scheme: Union[str, PEFT] = "lora", + peft_scheme: str | PEFT = "lora", ) -> ConfigContainer:

As per coding guidelines, "Use 'X | Y' for union types instead of 'Union[X, Y]'".

Also applies to: 279-281

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/qwen/qwen3_next.py` at line 15, Remove the unused "from typing import Union" import and update any type annotations that use Union[str, PEFT] to the modern union syntax str | PEFT; search for occurrences in this file (e.g., in the function/class signatures that reference PEFT) and replace Union[str, PEFT] with str | PEFT so the code follows the guideline and the typing remains correct.

coderabbitai · 2026-02-06T22:35:33Z

src/megatron/bridge/recipes/qwen/qwen3.py

+    # Parallelism settings - needs more parallelism for 32B
+    cfg.model.tensor_model_parallel_size = 8
+    cfg.model.pipeline_model_parallel_size = 2
+    cfg.model.pipeline_model_parallel_layout = None
+    cfg.model.pipeline_dtype = None


⚠️ Potential issue | 🟠 Major

Missing pipeline_dtype for PP=2 — will likely cause a runtime error.

The pretrain config for 32B (line 446) correctly sets pipeline_dtype = torch.bfloat16 with the comment # Required for PP > 1, but this SFT config sets pipeline_dtype = None despite also using pipeline_model_parallel_size = 2.

🐛 Proposed fix

cfg.model.pipeline_model_parallel_size = 2 cfg.model.pipeline_model_parallel_layout = None - cfg.model.pipeline_dtype = None + cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1 cfg.model.virtual_pipeline_model_parallel_size = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Parallelism settings - needs more parallelism for 32B

cfg.model.tensor_model_parallel_size = 8

cfg.model.pipeline_model_parallel_size = 2

cfg.model.pipeline_model_parallel_layout = None

cfg.model.pipeline_dtype = None

# Parallelism settings - needs more parallelism for 32B

cfg.model.tensor_model_parallel_size = 8

cfg.model.pipeline_model_parallel_size = 2

cfg.model.pipeline_model_parallel_layout = None

cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1

🤖 Prompt for AI Agents

In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 946 - 950, The SFT config sets cfg.model.pipeline_model_parallel_size = 2 but leaves cfg.model.pipeline_dtype = None which will cause a runtime error for PP>1; update the SFT config so cfg.model.pipeline_dtype is set to torch.bfloat16 (same as the pretrain 32B setting) or conditionally set cfg.model.pipeline_dtype = torch.bfloat16 whenever cfg.model.pipeline_model_parallel_size > 1; adjust any imports if needed and ensure the assignment occurs near the existing parallelism settings (references: cfg.model.pipeline_model_parallel_size and cfg.model.pipeline_dtype).

Signed-off-by: Abhishree <abhishreetm@gmail.com>

athitten · 2026-02-07T01:12:23Z

/ok to test f90bcb4

Signed-off-by: Abhishree <abhishreetm@gmail.com>

athitten · 2026-02-07T04:06:30Z

/ok to test df33dbd

athitten added 28 commits February 6, 2026 13:03

Add refactored recipe for qwen2, qwen3

5d0e623

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe for qwen3_moe, qwen3_next

594385d

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe for llama2, llama3

ad0dd2a

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add deepseek_v2, deepseek_v3 refactored recipe

3c2bebb

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe for gemma2, gemma3

2e9d178

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe for glm45

389f6f4

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add GPT OSS refactored recipe

0d80f6b

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add kimi k2 refactored recipe

f5b7f6b

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe for moonlight_16b

41dd675

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe nemotron nano v2

7323ed7

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe for Nemotron-H

73d7ae3

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add OLMoE 7B refactored config

80f269c

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe for GPt3 175B

603170a

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add new pretrain_configs to the recipes and remove *_new.py recipes

aaff25e

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add missing _model_config and remove comments

d2f2149

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored recipe for nemotron nanov3 and update test file

07eb584

Signed-off-by: Abhishree <abhishreetm@gmail.com>

fix lint check error

24e9086

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored SFT recipes for qwen2, qwen3

cfb41f4

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add SFT refactored recipe for qwen3_moe and qwen3_next

524729b

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add recfactored sft configs for gemma2, gemma3

3c12320

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored sft recipe for GLM45

6cc7ca1

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored SFT recipe for gpt_oss

c5de74d

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Fix DDP configs and add Llama3 refactored SFT recipe

a3b5e07

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add SFT refactored recipe for moonlight_16b

63735ab

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored SFT recipe for nemotron

8ab7fb6

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add refactored SFT recipe for olmoe_7b

ace2074

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Add new _sft_config, _peft_config to the original recipes

12e1f6b

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Delete _new.py recipe files created for sft, peft

3a9fdd6

Signed-off-by: Abhishree <abhishreetm@gmail.com>

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

Fix lint check errors

f90bcb4

Signed-off-by: Abhishree <abhishreetm@gmail.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 7, 2026 01:12 Inactive

copy-pr-bot bot temporarily deployed to test February 7, 2026 01:13 Inactive

athitten added 3 commits February 6, 2026 17:20

Add new config names in __init__.py for all models

807d72d

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Fix tests for the new SFT, PEFT API and add pretrained ckpt option

98b1903

Signed-off-by: Abhishree <abhishreetm@gmail.com>

Update functional tests

df33dbd

Signed-off-by: Abhishree <abhishreetm@gmail.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 7, 2026 04:06 Inactive

Function	PP	pipeline_dtype	Line
`qwen3_30b_a3b_sft_config`	2	None	259
`qwen3_235b_a22b_sft_config`	16	None	381
`qwen3_235b_a22b_peft_config`	4	None	652

Conversation

athitten commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

coderabbitai bot commented Feb 6, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

athitten commented Feb 7, 2026

Uh oh!

athitten commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

athitten commented Feb 6, 2026 •

edited by coderabbitai bot

Loading