Skip to content

Add refactored recipes for finetuning#2268

Open
athitten wants to merge 32 commits intomainfrom
athitten/sft_recipe_refactor
Open

Add refactored recipes for finetuning#2268
athitten wants to merge 32 commits intomainfrom
athitten/sft_recipe_refactor

Conversation

@athitten
Copy link
Contributor

@athitten athitten commented Feb 6, 2026

What does this PR do ?

  1. Splits the finetuning recipes for all models in src/megatron/bridge/recipes into _sft_config and _peft_config.
  2. Introduces the parameterless API introduced in PR Add refactored recipe files for pretrain configs of LLMs #2067 for sft and peft recipes

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added SFT (Supervised Fine-Tuning) configuration functions for multiple model families across recipes.
    • Introduced PEFT (Parameter-Efficient Fine-Tuning) configuration builders supporting LoRA and other efficient fine-tuning methods.
  • Refactor

    • Unified model configuration interfaces for improved consistency across recipes.
    • Simplified configuration API for easier model setup and customization.

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

This PR refactors recipe configurations across multiple model families by introducing centralized helper functions (_sft_common(), _peft_common()) in common.py and systematically replacing legacy finetune-based config builders with new SFT and PEFT-oriented factory functions.

Changes

Cohort / File(s) Summary
Common Recipe Helpers
src/megatron/bridge/recipes/common.py
Added _sft_common() and _peft_common() helper functions returning ConfigContainer with base SFT and PEFT-specific defaults including dataset, optimizer, scheduler, logging, checkpointing, and LoRA configuration.
Gemma Recipe Refactoring
src/megatron/bridge/recipes/gemma/gemma2.py, gemma3.py
Replaced finetune-based configs with new SFT/PEFT factories for model sizes (2B, 9B, 27B for Gemma2; 1B for Gemma3); removed TypedDict-based kwargs; introduced _adjust_gemma_vocab_size() helpers for tokenizer alignment.
LLM OSS Recipe Refactoring
src/megatron/bridge/recipes/gpt_oss/gpt_oss.py, src/megatron/bridge/recipes/llama/llama3.py
Systematically replaced finetune configs with SFT/PEFT factories across GPT-OSS (20B, 120B) and Llama3 variants (32 1B/3B, 3 8B/70B, 31 8B/70B/405B); removed legacy finetune TypedDicts; extended public API with multiple SFT/PEFT entry points.
GLM Recipe Updates
src/megatron/bridge/recipes/glm/glm45.py
Refactored GLM-4.5 variants (355B, AIR-106B) to use _pretrain_common, _sft_common, _peft_common base builders; replaced TypedDict definitions with direct ConfigContainer construction and customization.
Qwen Recipe Consolidation
src/megatron/bridge/recipes/qwen/qwen2.py, qwen3.py, qwen3_moe.py, qwen3_next.py
Introduced extensive SFT/PEFT config factories for Qwen2/2.5 (500m–72b), Qwen3 (600m–32b), Qwen3-MoE (30B-A3B, 235B-A22B), and Qwen3-Next variants; removed finetune-based builders and legacy TypedDict argument wrappers.
Nemotron Recipe Modernization
src/megatron/bridge/recipes/nemotron/nemotron_3_nano.py, nemotron_nano_v2.py, nemotronh.py
Refactored all Nemotron variants (3-Nano, Nano-V2 9B/12B, NemotronH 4B–56B) to use centralized SFT/PEFT helpers; removed finetune-specific config functions and TypedDict definitions; updated __all__ exports to include new SFT/PEFT entry points.
OLMoE & Moonlight Recipe Updates
src/megatron/bridge/recipes/olmoe/olmoe_7b.py, src/megatron/bridge/recipes/moonlight/moonlight_16b.py
Added new SFT/PEFT config factories for OLMoE-7B and Moonlight-16B; consolidated configuration via base helpers instead of inline TypedDict definitions; maintained model provider and MoE-specific settings in new builders.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

r0.3.0

Suggested reviewers

  • yaoyu-33
  • ko3n1g
  • ananthsub
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR with major refactoring affecting 16+ recipe files lacks required test results, convergence validation, and performance metrics documentation before marking ready for review. Run functional test suite for all affected models and document results confirming successful training and convergence for SFT and PEFT paths.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: refactoring and adding new recipes for finetuning with SFT and PEFT separation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch athitten/sft_recipe_refactor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In `@src/megatron/bridge/recipes/common.py`:
- Around line 297-304: The PEFT base config validation fails because
CheckpointConfig.pretrained_checkpoint is set to None while
ConfigContainer.validate() requires it when peft is enabled; update the
CheckpointConfig initialization in the recipes/common.py snippet (the
checkpoint=CheckpointConfig(...) block) to default pretrained_checkpoint to a
non-null placeholder (for example use the same checkpoint_dir or checkpoint.load
value) or ensure the config helper explicitly sets
checkpoint.pretrained_checkpoint when peft is set so ConfigContainer.validate()
passes.

In `@src/megatron/bridge/recipes/gemma/gemma3.py`:
- Around line 331-382: The DDP config contradicts the optimizer config: PEFT
sets cfg.optimizer.use_distributed_optimizer = False but
cfg.ddp.use_distributed_optimizer remains True; update the DDP setting to match
the optimizer by setting cfg.ddp.use_distributed_optimizer to the optimizer
value (e.g., cfg.ddp.use_distributed_optimizer =
cfg.optimizer.use_distributed_optimizer) or explicitly to False so distributed
optimizer is disabled consistently (refer to
cfg.optimizer.use_distributed_optimizer and cfg.ddp.use_distributed_optimizer).
- Around line 134-145: The function _adjust_gemma3_vocab_size currently calls
AutoTokenizer.from_pretrained without handling load errors and lacks type hints;
update its signature to include type hints (e.g., model_cfg: Any or a more
specific config type, hf_path: str) and an explicit return type (-> None), wrap
the tokenizer load and length check in a try/except that catches Exception
around AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True) and on
failure simply log or silently skip the vocab adjustment (do not re-raise), then
only adjust model_cfg.vocab_size when tokenizer is successfully loaded and its
length is larger; reference symbols: _adjust_gemma3_vocab_size, model_cfg,
hf_path, AutoTokenizer.from_pretrained.

In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py`:
- Line 15: Replace the old typing Union usage by removing the redundant import
"from typing import Union" and switching all occurrences of Union[str, PEFT] to
the new Python 3.10 union syntax "str | PEFT"; update the import list
accordingly (remove Union) and change the type annotations in gpt_oss.py where
found (around the import area and the annotations near the symbols referenced at
lines ~509-511 and ~640-642) so type hints use "str | PEFT" instead of
Union[str, PEFT].

In `@src/megatron/bridge/recipes/llama/llama3.py`:
- Around line 1011-1013: The assignment cfg.mixed_precision = bf16_mixed() is
dead code because it appears after return in llama31_8b_pretrain_config; remove
that line (or if you intended to override the default, move the assignment
before the return). Prefer removing the line to match other 8B configs that rely
on _pretrain_common()'s default, unless you specifically want to force
bf16_mixed().

In `@src/megatron/bridge/recipes/qwen/qwen2.py`:
- Around line 1182-1189: In the SFT configs (e.g., qwen2_72b_sft_config,
qwen25_32b_sft_config, qwen25_72b_sft_config) you left cfg.model.pipeline_dtype
= None while setting cfg.model.pipeline_model_parallel_size > 1; update those
config objects to set cfg.model.pipeline_dtype = torch.bfloat16 (same as
qwen2_72b_pretrain_config) whenever pipeline_model_parallel_size > 1 to match
the pretrain requirement and avoid runtime misconfiguration.

In `@src/megatron/bridge/recipes/qwen/qwen3_moe.py`:
- Around line 257-259: In the SFT/PEFT config functions
qwen3_30b_a3b_sft_config, qwen3_235b_a22b_sft_config, and
qwen3_235b_a22b_peft_config ensure that when
cfg.model.pipeline_model_parallel_size > 1 you set cfg.model.pipeline_dtype =
torch.bfloat16 (instead of None); mirror the pretrain configs
(qwen3_30b_a3b_pretrain_config and qwen3_235b_a22b_pretrain_config) which set
pipeline_dtype=torch.bfloat16 for multi-stage PP so the pipeline uses bfloat16
when pipeline parallelism is enabled.

In `@src/megatron/bridge/recipes/qwen/qwen3_next.py`:
- Line 15: Remove the unused "from typing import Union" import and update any
type annotations that use Union[str, PEFT] to the modern union syntax str |
PEFT; search for occurrences in this file (e.g., in the function/class
signatures that reference PEFT) and replace Union[str, PEFT] with str | PEFT so
the code follows the guideline and the typing remains correct.

In `@src/megatron/bridge/recipes/qwen/qwen3.py`:
- Around line 946-950: The SFT config sets
cfg.model.pipeline_model_parallel_size = 2 but leaves cfg.model.pipeline_dtype =
None which will cause a runtime error for PP>1; update the SFT config so
cfg.model.pipeline_dtype is set to torch.bfloat16 (same as the pretrain 32B
setting) or conditionally set cfg.model.pipeline_dtype = torch.bfloat16 whenever
cfg.model.pipeline_model_parallel_size > 1; adjust any imports if needed and
ensure the assignment occurs near the existing parallelism settings (references:
cfg.model.pipeline_model_parallel_size and cfg.model.pipeline_dtype).
🟡 Minor comments (16)
src/megatron/bridge/recipes/llama/llama3.py-1222-1237 (1)

1222-1237: ⚠️ Potential issue | 🟡 Minor

Parallelism settings are assigned after the CP-dependent dataset check.

At line 1227, cfg.model.context_parallel_size is read (from whatever AutoBridge defaults to), and then at line 1236, it's explicitly set to 1. If AutoBridge ever defaults CP to something other than 1, pad_seq_to_mult would be incorrectly configured. This ordering appears across all Llama SFT and PEFT configs.

Consider moving the parallelism block (lines 1230–1237) before the packed sequence check (lines 1226–1228) so the CP check reads the intended value.

Proposed reordering (showing llama32_1b_sft_config)
     cfg.dataset.packed_sequence_specs.packed_sequence_size = 4096

-    # Packed sequence settings
-    if cfg.model.context_parallel_size > 1:
-        cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2
-
     # Parallelism settings
     cfg.model.pipeline_model_parallel_layout = None
     cfg.model.tensor_model_parallel_size = 1
     cfg.model.pipeline_model_parallel_size = 1
     cfg.model.pipeline_dtype = None
     cfg.model.virtual_pipeline_model_parallel_size = None
     cfg.model.context_parallel_size = 1
     cfg.model.sequence_parallel = False

+    # Packed sequence settings
+    if cfg.model.context_parallel_size > 1:
+        cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2
+
src/megatron/bridge/recipes/qwen/qwen3_next.py-22-22 (1)

22-22: ⚠️ Potential issue | 🟡 Minor

Unused import: default_peft_config.

default_peft_config is imported but never used since qwen3_next_80b_a3b_peft_config raises NotImplementedError without calling it. This is also flagged by Flake8 (F401).

Proposed fix
-from megatron.bridge.recipes.utils.finetune_utils import default_peft_config, default_squad_config
+from megatron.bridge.recipes.utils.finetune_utils import default_squad_config
src/megatron/bridge/recipes/gpt_oss/gpt_oss.py-282-293 (1)

282-293: ⚠️ Potential issue | 🟡 Minor

CP guard is checked before context_parallel_size is explicitly assigned.

At Line 282, cfg.model.context_parallel_size is read from the AutoBridge-loaded model provider defaults. The explicit assignment cfg.model.context_parallel_size = 1 doesn't happen until Line 291 — nine lines later. If AutoBridge returns a model with CP > 1, the guard would fire and set pad_seq_to_mult, but then CP gets overridden to 1 anyway, leaving a stale padding config.

The same issue exists in gpt_oss_120b_sft_config (Lines 404 vs 413) and both PEFT configs (Lines 540 vs 549, Lines 671 vs 680).

Move the parallelism assignments before the CP guard to match the pattern used in the Qwen recipes, or simply assign CP first.

Proposed fix (showing 20b SFT)
+    # Parallelism settings (MoE-specific)
+    cfg.model.pipeline_model_parallel_layout = None
+    cfg.model.tensor_model_parallel_size = 1
+    cfg.model.pipeline_model_parallel_size = 1
+    cfg.model.pipeline_dtype = torch.bfloat16
+    cfg.model.virtual_pipeline_model_parallel_size = None
+    cfg.model.context_parallel_size = 1
+    cfg.model.expert_model_parallel_size = 8
+    cfg.model.sequence_parallel = False
+
     # Packed sequence settings
     if cfg.model.context_parallel_size > 1:
         cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2
 
-    # Parallelism settings (MoE-specific)
-    cfg.model.pipeline_model_parallel_layout = None
-    cfg.model.tensor_model_parallel_size = 1
-    ...
src/megatron/bridge/recipes/qwen/qwen3.py-1486-1510 (1)

1486-1510: ⚠️ Potential issue | 🟡 Minor

32B PEFT with TP=1, PP=1 may be tight on 80 GB GPUs.

The 32B model requires ~64 GB in bf16 for base weights alone. With TP=1 / PP=1 each GPU must hold the full model replica, leaving very little headroom for activations on 80 GB devices (A100/H100). The pretrain and SFT 32B configs both use TP=8 / PP=2.

Consider documenting a minimum VRAM requirement or providing an alternate config with higher TP for users on 80 GB hardware.

src/megatron/bridge/recipes/gemma/gemma2.py-24-24 (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Remove unused CommOverlapConfig import.

Flake8 flags it as unused.

🛠️ Suggested fix
-from megatron.bridge.training.comm_overlap import CommOverlapConfig
src/megatron/bridge/recipes/moonlight/moonlight_16b.py-519-526 (1)

519-526: ⚠️ Potential issue | 🟡 Minor

Sort __all__ to satisfy RUF022.

🛠️ Suggested fix
-__all__ = [
-    # Pretrain config
-    "moonlight_16b_pretrain_config",
-    # SFT config
-    "moonlight_16b_sft_config",
-    # PEFT config
-    "moonlight_16b_peft_config",
-]
+__all__ = [
+    "moonlight_16b_peft_config",
+    "moonlight_16b_pretrain_config",
+    "moonlight_16b_sft_config",
+]
src/megatron/bridge/recipes/gemma/gemma3.py-25-25 (1)

25-25: ⚠️ Potential issue | 🟡 Minor

Remove unused CommOverlapConfig import.

Flake8 flags it as unused.

🛠️ Suggested fix
-from megatron.bridge.training.comm_overlap import CommOverlapConfig
src/megatron/bridge/recipes/moonlight/moonlight_16b.py-450-455 (1)

450-455: ⚠️ Potential issue | 🟡 Minor

Validate PEFT schemes instead of assigning raw strings.

If a caller passes an unsupported string (e.g., "none"), cfg.peft becomes a str and PEFT application will break. default_peft_config already normalizes/validates string inputs.

🛠️ Suggested fix
-    if isinstance(peft_scheme, str) and peft_scheme.lower() in ["lora", "dora"]:
-        cfg.peft = default_peft_config(peft_scheme)
-    else:
-        cfg.peft = peft_scheme
+    cfg.peft = default_peft_config(peft_scheme)
src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py-466-473 (1)

466-473: ⚠️ Potential issue | 🟡 Minor

Sort __all__ to satisfy RUF022.

🛠️ Suggested fix
-__all__ = [
-    # Pretrain config
-    "nemotron_3_nano_pretrain_config",
-    # SFT config
-    "nemotron_3_nano_sft_config",
-    # PEFT config
-    "nemotron_3_nano_peft_config",
-]
+__all__ = [
+    "nemotron_3_nano_peft_config",
+    "nemotron_3_nano_pretrain_config",
+    "nemotron_3_nano_sft_config",
+]
src/megatron/bridge/recipes/olmoe/olmoe_7b.py-15-16 (1)

15-16: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 union syntax instead of Union[X, Y].

Remove the Union import from line 15 and replace Union[str, PEFT] with str | PEFT in the peft_scheme parameter at line 331.

🛠️ Suggested fix
-from typing import Union
-
@@
-def olmoe_7b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+def olmoe_7b_peft_config(
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py-15-16 (1)

15-16: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 union types for peft_scheme.

Line 15 imports Union from typing, but the codebase requires Python 3.10+ (per coding guidelines), which supports PEP 604 union syntax (X | Y). Remove the Union import and update the type annotation on line 307 to use str | PEFT instead of Union[str, PEFT].

🛠️ Suggested fix
-from typing import Union
-
@@
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
src/megatron/bridge/recipes/gemma/gemma2.py-16-17 (1)

16-17: ⚠️ Potential issue | 🟡 Minor

Replace Union[str, PEFT] with str | PEFT for all three PEFT builder functions and remove the Union import.

All three PEFT config functions (gemma2_2b_peft_config, gemma2_9b_peft_config, gemma2_27b_peft_config) use Union[str, PEFT] which should be updated to PEP 604 union syntax per coding guidelines. Remove the from typing import Union import at line 16 once these changes are applied.

🛠️ Suggested fixes
-from typing import Union
-
 import torch
 def gemma2_2b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
 def gemma2_9b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
 def gemma2_27b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
src/megatron/bridge/recipes/gemma/gemma3.py-15-16 (1)

15-16: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 unions for peft_scheme.

The Union[str, PEFT] type hint at line 265 should use the PEP 604 syntax str | PEFT. After this change, the Union import at line 16 can be removed since it will no longer be used.

🛠️ Suggested fix
-from typing import Union
-
 import torch
-def gemma3_1b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+def gemma3_1b_peft_config(
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
src/megatron/bridge/recipes/glm/glm45.py-536-538 (1)

536-538: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 syntax for union types instead of Union.

Replace Union[str, PEFT] with str | PEFT at lines 537 and 673 to conform to Python 3.10+ standards and coding guidelines.

🛠️ Suggested fix
 def glm45_355b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
 
 def glm45_air_106b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:
src/megatron/bridge/recipes/moonlight/moonlight_16b.py-15-16 (1)

15-16: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 union types instead of Union.

Switch peft_scheme to str | PEFT and remove the Union import from line 15.

🛠️ Suggested fix
-from typing import Union
-
@@
-def moonlight_16b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+def moonlight_16b_peft_config(
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:

Per coding guidelines: Use 'X | Y' for union types instead of 'Union[X, Y]'.

src/megatron/bridge/recipes/gemma/gemma2.py-283-299 (1)

283-299: ⚠️ Potential issue | 🟡 Minor

Narrow exception handling and add type hints.

Catching Exception is too broad and hides real errors. The coding guidelines require type hints for all function arguments and return types, and limiting except clauses to the smallest set of errors. Per the transformers library, AutoTokenizer.from_pretrained() raises OSError (file/authentication issues) and ValueError (config issues). Restructure the try-except to handle only these specific exceptions and move the vocab size adjustment outside the try block to follow the duck-typing pattern.

🛠️ Suggested fix
+import logging
@@
-def _adjust_gemma2_vocab_size(model_cfg, hf_path: str):
+def _adjust_gemma2_vocab_size(model_cfg: object, hf_path: str) -> None:
@@
-        try:
-            from transformers import AutoTokenizer
-
-            tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
-            if len(tokenizer) > model_cfg.vocab_size:
-                model_cfg.vocab_size = len(tokenizer)
-        except Exception:
-            # Skip vocab size adjustment if tokenizer cannot be loaded
-            # (e.g., due to missing HuggingFace authentication)
-            pass
+        try:
+            from transformers import AutoTokenizer
+
+            tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
+        except (OSError, ValueError) as exc:
+            logging.warning("Skipping Gemma2 vocab resize for %s: %s", hf_path, exc)
+            return
+
+        if len(tokenizer) > model_cfg.vocab_size:
+            model_cfg.vocab_size = len(tokenizer)
🧹 Nitpick comments (7)
src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py (1)

1-31: Imports and header look good.

The file structure, copyright header, and import organization follow the project conventions.

One minor note: per coding guidelines for Python 3.10+, Union[str, PEFT] (used in PEFT function signatures) could be written as str | PEFT. This is entirely optional given the "Chill" review mode and the consistency with the broader PR convention.

src/megatron/bridge/recipes/nemotronh/nemotronh.py (1)

15-31: Imports and structure are clean.

Same minor note as the Nano v2 file regarding Union vs str | PEFT for Python 3.10+, applicable to all PEFT function signatures in this file.

src/megatron/bridge/recipes/qwen/qwen2.py (2)

906-909: Dead code: context_parallel_size > 1 guard is always false.

In every SFT and PEFT config function, context_parallel_size is hardcoded to 1 a few lines above the guard. The if cfg.model.context_parallel_size > 1 block (e.g., Lines 908-909) will never execute. This pattern repeats across all ~20 new functions in this file.

If this is a user-documentation hint ("change CP here and the padding adjusts"), consider adding a brief inline comment to clarify intent. Otherwise it's misleading dead code.


876-2863: Extreme boilerplate duplication across SFT/PEFT config functions.

Every SFT function (~10 of them) and every PEFT function (~10 of them) in this file are nearly identical — only the HF model ID, tokenizer path, and parallelism settings differ. The shared body (TE, CUDA graph, kernel selection, memory saving, FP8 comments, optimizer precision, checkpoint, DDP) is copy-pasted verbatim and accounts for ~60-70 lines per function.

Consider extracting a _qwen2_sft_common(cfg) and _qwen2_peft_common(cfg) helper that applies the shared knobs, so each public config function only declares the model-specific delta (~15-20 lines each). This would cut ~1500 lines of duplicated code.

src/megatron/bridge/recipes/gpt_oss/gpt_oss.py (1)

255-768: Significant SFT/PEFT boilerplate duplication, similar to qwen2.py.

The 20B and 120B SFT/PEFT configs share ~90% of their body (TE, CUDA graph, kernel, memory saving, FP8, optimizer precision, checkpoint, DDP, MoE settings). A shared _gpt_oss_finetune_common(cfg) helper applying these shared knobs would reduce duplication and maintenance burden.

src/megatron/bridge/recipes/qwen/qwen3_moe.py (1)

529-532: PEFT LoRA customization (dim=8, alpha=16) is applied conditionally — custom PEFT instances bypass it silently.

When a user passes a custom PEFT instance (not a string), default_peft_config returns it as-is, and the isinstance(peft_scheme, str) guard at Line 529 skips the MoE-specific dim/alpha/target_modules overrides. This may be intentional (user supplies a fully configured PEFT), but the docstring mentions "LoRA/DoRA uses dim=8, alpha=16" without noting that custom instances won't get these overrides.

Consider adding a brief note to the docstring or a log warning if the user's custom PEFT doesn't match the expected MoE LoRA config.

Also applies to: 668-672

src/megatron/bridge/recipes/qwen/qwen3.py (1)

502-1578: Consider extracting shared SFT/PEFT boilerplate into a helper.

The 12 new config functions share ~90% identical code (CUDA graph settings, FP8 comments, optimizer precision, kernel selections, DDP config, checkpoint config, memory saving defaults). They differ only in model name, TP/PP, and recompute for 32B.

A small helper like _qwen3_sft(model_id, tp, pp, *, recompute=False, pipeline_dtype=None) could consolidate this, leaving each public function as a 1–3 line wrapper. This would also reduce the risk of per-variant inconsistencies (e.g., the pipeline_dtype omission in the 32B SFT config).

Not blocking — the explicit style is consistent with the pretrain configs and may be intentional for recipe discoverability.

Comment on lines +297 to +304
checkpoint=CheckpointConfig(
save_interval=100,
save=checkpoint_dir,
load=checkpoint_dir,
pretrained_checkpoint=None, # Set to load from pretrained weights
ckpt_format="torch_dist",
fully_parallel_save=True,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

PEFT base config fails validation without a pretrained checkpoint.

ConfigContainer.validate() asserts checkpoint.pretrained_checkpoint when peft is set; leaving it None makes the parameterless PEFT configs error unless callers remember to patch it. Consider defaulting it to a placeholder (e.g., checkpoint_dir or checkpoint.load) or explicitly wiring it via the helper.

🛠️ Suggested fix
         checkpoint=CheckpointConfig(
             save_interval=100,
             save=checkpoint_dir,
             load=checkpoint_dir,
-            pretrained_checkpoint=None,  # Set to load from pretrained weights
+            pretrained_checkpoint=checkpoint_dir,  # Override with actual pretrained weights path
             ckpt_format="torch_dist",
             fully_parallel_save=True,
         ),
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/common.py` around lines 297 - 304, The PEFT base
config validation fails because CheckpointConfig.pretrained_checkpoint is set to
None while ConfigContainer.validate() requires it when peft is enabled; update
the CheckpointConfig initialization in the recipes/common.py snippet (the
checkpoint=CheckpointConfig(...) block) to default pretrained_checkpoint to a
non-null placeholder (for example use the same checkpoint_dir or checkpoint.load
value) or ensure the config helper explicitly sets
checkpoint.pretrained_checkpoint when peft is set so ConfigContainer.validate()
passes.

Comment on lines +134 to +145
def _adjust_gemma3_vocab_size(model_cfg, hf_path: str):
"""Adjust vocab size for Gemma3 (model vocab < tokenizer vocab).

Note: This requires HuggingFace authentication for Gemma3 models.
If the tokenizer cannot be loaded, the vocab size adjustment is skipped.
"""
if hasattr(model_cfg, "vocab_size") and hf_path:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
if len(tokenizer) > model_cfg.vocab_size:
model_cfg.vocab_size = len(tokenizer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the current state of the file
cat -n src/megatron/bridge/recipes/gemma/gemma3.py | head -150

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 6797


🏁 Script executed:

# Check imports at the top of the file
head -30 src/megatron/bridge/recipes/gemma/gemma3.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1360


🏁 Script executed:

# Search for any existing logging usage or logging imports in the file
rg "logging|import logging" src/megatron/bridge/recipes/gemma/gemma3.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check if this function is called elsewhere in the codebase
rg "_adjust_gemma3_vocab_size" --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 358


🏁 Script executed:

# Search for similar error handling patterns in the codebase to understand project conventions
rg "AutoTokenizer.from_pretrained" --type py -A 5 -B 2 | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4903


Add error handling for tokenizer load failures and type hints.

The docstring states that vocab adjustment is skipped if the tokenizer cannot be loaded, but the code lacks exception handling and will raise instead. Additionally, the function is missing type hints for model_cfg and the return type, as required by coding guidelines.

🛠️ Suggested fix
+import logging
@@
-def _adjust_gemma3_vocab_size(model_cfg, hf_path: str):
+def _adjust_gemma3_vocab_size(model_cfg: object, hf_path: str) -> None:
@@
-    if hasattr(model_cfg, "vocab_size") and hf_path:
-        from transformers import AutoTokenizer
-
-        tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
-        if len(tokenizer) > model_cfg.vocab_size:
-            model_cfg.vocab_size = len(tokenizer)
+    if hasattr(model_cfg, "vocab_size") and hf_path:
+        try:
+            from transformers import AutoTokenizer
+
+            tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
+        except (OSError, ValueError) as exc:
+            logging.warning("Skipping Gemma3 vocab resize for %s: %s", hf_path, exc)
+            return
+
+        if len(tokenizer) > model_cfg.vocab_size:
+            model_cfg.vocab_size = len(tokenizer)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/gemma/gemma3.py` around lines 134 - 145, The
function _adjust_gemma3_vocab_size currently calls AutoTokenizer.from_pretrained
without handling load errors and lacks type hints; update its signature to
include type hints (e.g., model_cfg: Any or a more specific config type,
hf_path: str) and an explicit return type (-> None), wrap the tokenizer load and
length check in a try/except that catches Exception around
AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True) and on failure
simply log or silently skip the vocab adjustment (do not re-raise), then only
adjust model_cfg.vocab_size when tokenizer is successfully loaded and its length
is larger; reference symbols: _adjust_gemma3_vocab_size, model_cfg, hf_path,
AutoTokenizer.from_pretrained.

Comment on lines 331 to 382
# Optimizer config - PEFT uses higher learning rate
cfg.optimizer.adam_beta2 = 0.95
cfg.optimizer.use_precision_aware_optimizer = False
cfg.optimizer.main_grads_dtype = torch.float32
cfg.optimizer.main_params_dtype = torch.float32
cfg.optimizer.exp_avg_dtype = torch.float32
cfg.optimizer.exp_avg_sq_dtype = torch.float32
cfg.optimizer.use_distributed_optimizer = False # PEFT disables distributed optimizer

# Scheduler config - PEFT uses higher learning rate
cfg.scheduler.lr_warmup_iters = 10
cfg.scheduler.lr_decay_iters = 100
cfg.scheduler.max_lr = 1e-4

# TE (Transformer Engine)
cfg.model.transformer_impl = "transformer_engine"

# CUDA Graph
cfg.model.cuda_graph_impl = "none"
cfg.model.cuda_graph_scope = "full"
cfg.model.cuda_graph_warmup_steps = 3

# Kernel selections - PEFT disables cross_entropy_loss_fusion
cfg.model.attention_backend = None
cfg.model.cross_entropy_loss_fusion = False # PEFT disables this
cfg.model.cross_entropy_fusion_impl = "native" # Gemma3 uses native

# Memory saving (recompute & offloading)
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None

# FP8 & MXFP8 (mixed_precision settings)
# Note: mixed_precision="bf16_mixed" is set in _peft_common as default
# These are defaults for FP8, enable them if using FP8 - FP8 is not enabled by default
# cfg.mixed_precision.fp8_recipe = "tensorwise" # default, uncomment to enable
# cfg.mixed_precision.fp8 = None # not enabled by default
# cfg.mixed_precision.fp8_param_gather = False # default
# cfg.mixed_precision.reuse_grad_buf_for_mxfp8_param_ag = False # default

# Checkpoint config
cfg.checkpoint.save_interval = 100
cfg.checkpoint.ckpt_format = "torch_dist"
cfg.checkpoint.fully_parallel_save = True

# DDP config
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
cfg.ddp.check_for_nan_in_grad = True
cfg.ddp.use_distributed_optimizer = True
cfg.ddp.average_in_collective = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align DDP distributed optimizer flag with optimizer config.

PEFT disables the distributed optimizer, but DDP still sets it to True, which can trip validation or re-enable distributed optimizer behavior unintentionally.

🛠️ Suggested fix
-    cfg.ddp.use_distributed_optimizer = True
+    cfg.ddp.use_distributed_optimizer = False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Optimizer config - PEFT uses higher learning rate
cfg.optimizer.adam_beta2 = 0.95
cfg.optimizer.use_precision_aware_optimizer = False
cfg.optimizer.main_grads_dtype = torch.float32
cfg.optimizer.main_params_dtype = torch.float32
cfg.optimizer.exp_avg_dtype = torch.float32
cfg.optimizer.exp_avg_sq_dtype = torch.float32
cfg.optimizer.use_distributed_optimizer = False # PEFT disables distributed optimizer
# Scheduler config - PEFT uses higher learning rate
cfg.scheduler.lr_warmup_iters = 10
cfg.scheduler.lr_decay_iters = 100
cfg.scheduler.max_lr = 1e-4
# TE (Transformer Engine)
cfg.model.transformer_impl = "transformer_engine"
# CUDA Graph
cfg.model.cuda_graph_impl = "none"
cfg.model.cuda_graph_scope = "full"
cfg.model.cuda_graph_warmup_steps = 3
# Kernel selections - PEFT disables cross_entropy_loss_fusion
cfg.model.attention_backend = None
cfg.model.cross_entropy_loss_fusion = False # PEFT disables this
cfg.model.cross_entropy_fusion_impl = "native" # Gemma3 uses native
# Memory saving (recompute & offloading)
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None
# FP8 & MXFP8 (mixed_precision settings)
# Note: mixed_precision="bf16_mixed" is set in _peft_common as default
# These are defaults for FP8, enable them if using FP8 - FP8 is not enabled by default
# cfg.mixed_precision.fp8_recipe = "tensorwise" # default, uncomment to enable
# cfg.mixed_precision.fp8 = None # not enabled by default
# cfg.mixed_precision.fp8_param_gather = False # default
# cfg.mixed_precision.reuse_grad_buf_for_mxfp8_param_ag = False # default
# Checkpoint config
cfg.checkpoint.save_interval = 100
cfg.checkpoint.ckpt_format = "torch_dist"
cfg.checkpoint.fully_parallel_save = True
# DDP config
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
cfg.ddp.check_for_nan_in_grad = True
cfg.ddp.use_distributed_optimizer = True
cfg.ddp.average_in_collective = True
# Optimizer config - PEFT uses higher learning rate
cfg.optimizer.adam_beta2 = 0.95
cfg.optimizer.use_precision_aware_optimizer = False
cfg.optimizer.main_grads_dtype = torch.float32
cfg.optimizer.main_params_dtype = torch.float32
cfg.optimizer.exp_avg_dtype = torch.float32
cfg.optimizer.exp_avg_sq_dtype = torch.float32
cfg.optimizer.use_distributed_optimizer = False # PEFT disables distributed optimizer
# Scheduler config - PEFT uses higher learning rate
cfg.scheduler.lr_warmup_iters = 10
cfg.scheduler.lr_decay_iters = 100
cfg.scheduler.max_lr = 1e-4
# TE (Transformer Engine)
cfg.model.transformer_impl = "transformer_engine"
# CUDA Graph
cfg.model.cuda_graph_impl = "none"
cfg.model.cuda_graph_scope = "full"
cfg.model.cuda_graph_warmup_steps = 3
# Kernel selections - PEFT disables cross_entropy_loss_fusion
cfg.model.attention_backend = None
cfg.model.cross_entropy_loss_fusion = False # PEFT disables this
cfg.model.cross_entropy_fusion_impl = "native" # Gemma3 uses native
# Memory saving (recompute & offloading)
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None
# FP8 & MXFP8 (mixed_precision settings)
# Note: mixed_precision="bf16_mixed" is set in _peft_common as default
# These are defaults for FP8, enable them if using FP8 - FP8 is not enabled by default
# cfg.mixed_precision.fp8_recipe = "tensorwise" # default, uncomment to enable
# cfg.mixed_precision.fp8 = None # not enabled by default
# cfg.mixed_precision.fp8_param_gather = False # default
# cfg.mixed_precision.reuse_grad_buf_for_mxfp8_param_ag = False # default
# Checkpoint config
cfg.checkpoint.save_interval = 100
cfg.checkpoint.ckpt_format = "torch_dist"
cfg.checkpoint.fully_parallel_save = True
# DDP config
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
cfg.ddp.check_for_nan_in_grad = True
cfg.ddp.use_distributed_optimizer = False
cfg.ddp.average_in_collective = True
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/gemma/gemma3.py` around lines 331 - 382, The DDP
config contradicts the optimizer config: PEFT sets
cfg.optimizer.use_distributed_optimizer = False but
cfg.ddp.use_distributed_optimizer remains True; update the DDP setting to match
the optimizer by setting cfg.ddp.use_distributed_optimizer to the optimizer
value (e.g., cfg.ddp.use_distributed_optimizer =
cfg.optimizer.use_distributed_optimizer) or explicitly to False so distributed
optimizer is disabled consistently (refer to
cfg.optimizer.use_distributed_optimizer and cfg.ddp.use_distributed_optimizer).


import os
from typing import List, Optional, Union
from typing import Union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use str | PEFT instead of Union[str, PEFT].

The coding guidelines require X | Y for union types instead of Union[X, Y] (Python 3.10+). The from typing import Union import (Line 15) can then be removed.

Proposed fix
-from typing import Union
-
 import torch
-def gpt_oss_20b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
-) -> ConfigContainer:
+def gpt_oss_20b_peft_config(
+    peft_scheme: str | PEFT = "lora",
+) -> ConfigContainer:
-def gpt_oss_120b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+def gpt_oss_120b_peft_config(
+    peft_scheme: str | PEFT = "lora",

As per coding guidelines, "Use 'X | Y' for union types instead of 'Union[X, Y]'".

Also applies to: 509-511, 640-642

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/gpt_oss/gpt_oss.py` at line 15, Replace the old
typing Union usage by removing the redundant import "from typing import Union"
and switching all occurrences of Union[str, PEFT] to the new Python 3.10 union
syntax "str | PEFT"; update the import list accordingly (remove Union) and
change the type annotations in gpt_oss.py where found (around the import area
and the annotations near the symbols referenced at lines ~509-511 and ~640-642)
so type hints use "str | PEFT" instead of Union[str, PEFT].

Comment on lines 1011 to +1013
return cfg

cfg.mixed_precision = bf16_mixed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Unreachable dead code: bf16_mixed() assignment after return.

Line 1013 sets cfg.mixed_precision = bf16_mixed() after return cfg on line 1012 inside llama31_8b_pretrain_config. This line will never execute. Either it should be moved before the return, or removed if the default from _pretrain_common() ("bf16_mixed") is sufficient.

Note: Other Llama pretrain configs (8B base, 1B, 3B) don't set mixed_precision explicitly and rely on the _pretrain_common default, while the 70B/405B variants explicitly use bf16_mixed(). If the intent matches the other 8B configs, just delete the dead line.

Proposed fix — remove the dead code
     cfg.ddp.data_parallel_sharding_strategy = "no_shard"
 
     return cfg
-
-    cfg.mixed_precision = bf16_mixed()
 
 def llama31_70b_pretrain_config() -> ConfigContainer:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return cfg
cfg.mixed_precision = bf16_mixed()
return cfg
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/llama/llama3.py` around lines 1011 - 1013, The
assignment cfg.mixed_precision = bf16_mixed() is dead code because it appears
after return in llama31_8b_pretrain_config; remove that line (or if you intended
to override the default, move the assignment before the return). Prefer removing
the line to match other 8B configs that rely on _pretrain_common()'s default,
unless you specifically want to force bf16_mixed().

Comment on lines +1182 to +1189
# Parallelism settings - needs more parallelism for 72B
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 4
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
cfg.model.virtual_pipeline_model_parallel_size = None
cfg.model.context_parallel_size = 1
cfg.model.sequence_parallel = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing pipeline_dtype for pipeline_model_parallel_size > 1 in SFT configs.

qwen2_72b_sft_config sets pipeline_model_parallel_size = 4 but leaves pipeline_dtype = None (Line 1186). The corresponding pretrain config (qwen2_72b_pretrain_config, Line 312) correctly sets pipeline_dtype = torch.bfloat16 with the comment "Required for PP > 1".

The same issue applies to:

  • qwen25_32b_sft_config (Line 1671, PP=2)
  • qwen25_72b_sft_config (Line 1767, PP=4)

This will likely cause a runtime error or silent misconfiguration when pipeline parallelism is active.

Proposed fix for all affected SFT configs
 # qwen2_72b_sft_config (line 1186)
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16  # Required for PP > 1

 # qwen25_32b_sft_config (line 1671)
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16  # Required for PP > 1

 # qwen25_72b_sft_config (line 1767)
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16  # Required for PP > 1
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/qwen/qwen2.py` around lines 1182 - 1189, In the
SFT configs (e.g., qwen2_72b_sft_config, qwen25_32b_sft_config,
qwen25_72b_sft_config) you left cfg.model.pipeline_dtype = None while setting
cfg.model.pipeline_model_parallel_size > 1; update those config objects to set
cfg.model.pipeline_dtype = torch.bfloat16 (same as qwen2_72b_pretrain_config)
whenever pipeline_model_parallel_size > 1 to match the pretrain requirement and
avoid runtime misconfiguration.

Comment on lines +257 to +259
cfg.model.pipeline_model_parallel_size = 2
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing pipeline_dtype = torch.bfloat16 when pipeline_model_parallel_size > 1 in SFT/PEFT configs.

Same issue as in qwen2.py. The pretrain configs correctly set pipeline_dtype = torch.bfloat16 when PP > 1, but the new SFT/PEFT configs set pipeline_dtype = None:

Function PP pipeline_dtype Line
qwen3_30b_a3b_sft_config 2 None 259
qwen3_235b_a22b_sft_config 16 None 381
qwen3_235b_a22b_peft_config 4 None 652

The corresponding pretrain configs (qwen3_30b_a3b_pretrain_config Line 46, qwen3_235b_a22b_pretrain_config Line 151) both set pipeline_dtype = torch.bfloat16.

Proposed fix
 # qwen3_30b_a3b_sft_config (line 259)
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16  # Required for PP > 1

 # qwen3_235b_a22b_sft_config (line 381)
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16  # Required for PP > 1

 # qwen3_235b_a22b_peft_config (line 652)
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16  # Required for PP > 1

Also applies to: 378-381, 649-652

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/qwen/qwen3_moe.py` around lines 257 - 259, In the
SFT/PEFT config functions qwen3_30b_a3b_sft_config, qwen3_235b_a22b_sft_config,
and qwen3_235b_a22b_peft_config ensure that when
cfg.model.pipeline_model_parallel_size > 1 you set cfg.model.pipeline_dtype =
torch.bfloat16 (instead of None); mirror the pretrain configs
(qwen3_30b_a3b_pretrain_config and qwen3_235b_a22b_pretrain_config) which set
pipeline_dtype=torch.bfloat16 for multi-stage PP so the pipeline uses bfloat16
when pipeline parallelism is enabled.

# limitations under the License.

import os
from typing import Union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use str | PEFT instead of Union[str, PEFT].

Same guidelines violation as in gpt_oss.py. Remove from typing import Union and use the pipe syntax.

Proposed fix
-from typing import Union
-
 import torch
 def qwen3_next_80b_a3b_peft_config(
-    peft_scheme: Union[str, PEFT] = "lora",
+    peft_scheme: str | PEFT = "lora",
 ) -> ConfigContainer:

As per coding guidelines, "Use 'X | Y' for union types instead of 'Union[X, Y]'".

Also applies to: 279-281

🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/qwen/qwen3_next.py` at line 15, Remove the unused
"from typing import Union" import and update any type annotations that use
Union[str, PEFT] to the modern union syntax str | PEFT; search for occurrences
in this file (e.g., in the function/class signatures that reference PEFT) and
replace Union[str, PEFT] with str | PEFT so the code follows the guideline and
the typing remains correct.

Comment on lines +946 to +950
# Parallelism settings - needs more parallelism for 32B
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 2
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing pipeline_dtype for PP=2 — will likely cause a runtime error.

The pretrain config for 32B (line 446) correctly sets pipeline_dtype = torch.bfloat16 with the comment # Required for PP > 1, but this SFT config sets pipeline_dtype = None despite also using pipeline_model_parallel_size = 2.

🐛 Proposed fix
     cfg.model.pipeline_model_parallel_size = 2
     cfg.model.pipeline_model_parallel_layout = None
-    cfg.model.pipeline_dtype = None
+    cfg.model.pipeline_dtype = torch.bfloat16  # Required for PP > 1
     cfg.model.virtual_pipeline_model_parallel_size = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Parallelism settings - needs more parallelism for 32B
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 2
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = None
# Parallelism settings - needs more parallelism for 32B
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 2
cfg.model.pipeline_model_parallel_layout = None
cfg.model.pipeline_dtype = torch.bfloat16 # Required for PP > 1
🤖 Prompt for AI Agents
In `@src/megatron/bridge/recipes/qwen/qwen3.py` around lines 946 - 950, The SFT
config sets cfg.model.pipeline_model_parallel_size = 2 but leaves
cfg.model.pipeline_dtype = None which will cause a runtime error for PP>1;
update the SFT config so cfg.model.pipeline_dtype is set to torch.bfloat16 (same
as the pretrain 32B setting) or conditionally set cfg.model.pipeline_dtype =
torch.bfloat16 whenever cfg.model.pipeline_model_parallel_size > 1; adjust any
imports if needed and ensure the assignment occurs near the existing parallelism
settings (references: cfg.model.pipeline_model_parallel_size and
cfg.model.pipeline_dtype).

Signed-off-by: Abhishree <abhishreetm@gmail.com>
@athitten
Copy link
Contributor Author

athitten commented Feb 7, 2026

/ok to test f90bcb4

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
@athitten
Copy link
Contributor Author

athitten commented Feb 7, 2026

/ok to test df33dbd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant