[M4] feat: Add M4 end2end qwen3_vl example by shifangx · Pull Request #2117 · NVIDIA-NeMo/Megatron-Bridge

shifangx · 2026-01-29T03:56:40Z

What does this PR do ?

Add M4 end2end qwen3vl example

Pass pg_collection from Qwen3VLMoEModelProvider(or Qwen3VLModelProvider) to Qwen3VLModel.
pass pg_collection to track_moe_metrics for moe model, in train_utils.py.

How to Run

# 8 GPUs: EP8
uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-01-29T03:56:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-29T04:07:18Z

📝 Walkthrough

Walkthrough

This PR threads a ProcessGroupCollection parameter through the Qwen3VL model initialization hierarchy, from provider level through the main model, language model, and transformer blocks. The parameter is stored as an instance attribute and propagated downward without altering core logic.

Changes

Cohort / File(s)	Summary
Model Hierarchy `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`, `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py`, `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`	Added `pg_collection: ProcessGroupCollection` parameter to Qwen3VLModel and Qwen3VLGPTModel constructors; propagated through initialization chain to transformer blocks. Added import for ProcessGroupCollection type.
Model Provider `src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py`	Updated Qwen3VLModel constructor calls in both standard and MoE provider paths to pass `pg_collection=self._pg_collection`.
Training Utilities `src/megatron/bridge/training/utils/train_utils.py`	Passed `pg_collection` argument to `track_moe_metrics()` function call in training logging.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

M4: add pg_collection into setup and wire into train.py #1062: Wires ProcessGroupCollection through global state and training setup, complementing this PR's model-level propagation of the same parameter.
[M4] feat: Add M4 end2end support and qwen3 examples #2011: Introduces and propagates ProcessGroupCollection through Qwen3VL model constructors using matching parameter-threading patterns.

Suggested reviewers

skyw
cuichenx
ananthsub

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR adds M4 end-to-end Qwen3VL example with hundreds of files, but PR description lacks test results, performance metrics, or regression testing documentation required for major changes.	Update PR description to include test results summary, relevant performance benchmarks for the M4 example, and confirmation of no regressions introduced.
Title check	⚠️ Warning	The PR title '[M4] feat: Add M4 end2end qwen3_vl example' does not match the actual changeset, which focuses on adding ProcessGroupCollection support to Qwen models, not creating an end-to-end example.	Update the title to reflect the actual changes, such as '[M4] feat: Add ProcessGroupCollection support to Qwen3VL models' or similar to accurately describe the infrastructure changes being made.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/megatron/bridge/training/utils/train_utils.py (1)

571-584: Remove pg_collection parameter from track_moe_metrics() call. The function does not accept this parameter and will raise TypeError at runtime. The pg_collection parameter is for other Megatron-Core APIs like TransformerModel and get_megatron_optimizer, not for track_moe_metrics().

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`:
- Line 32: Remove the unused import ProcessGroupCollection from
megatron.core.process_groups_config in transformer_block.py to fix the F401 lint
error; alternatively, if it was intended for typing, add it to a type-only
import (e.g., from typing import TYPE_CHECKING / inside an if TYPE_CHECKING
block) and reference it in relevant function/class annotations such as any
transformer block constructor or method that would accept process group types.

🧹 Nitpick comments (2)

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py (1)
63-64: Use ProcessGroupCollection | None for nullable pg_collection.
♻️ Proposed fix
-        pg_collection: ProcessGroupCollection = None,
+        pg_collection: ProcessGroupCollection | None = None,
As per coding guidelines: Use 'T | None' for nullable types instead of 'Optional[T]'.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)
63-64: Use ProcessGroupCollection | None for nullable pg_collection.
♻️ Proposed fix
-        pg_collection: ProcessGroupCollection = None,
+        pg_collection: ProcessGroupCollection | None = None,
As per coding guidelines: Use 'T | None' for nullable types instead of 'Optional[T]'.

coderabbitai · 2026-01-29T04:07:21Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py

 from megatron.core.packed_seq_params import PackedSeqParams
 from megatron.core.transformer.transformer_block import TransformerBlock
 from megatron.core.utils import WrappedTensor, deprecate_inference_params, make_viewless_tensor
+from megatron.core.process_groups_config import ProcessGroupCollection


⚠️ Potential issue | 🟡 Minor

Remove unused ProcessGroupCollection import (F401).
Static analysis flags this as unused; drop it or wire it into type hints.

🧹 Proposed fix

-from megatron.core.process_groups_config import ProcessGroupCollection

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from megatron.core.process_groups_config import ProcessGroupCollection

🧰 Tools

🪛 Flake8 (7.3.0)

[error] 32-32: 'megatron.core.process_groups_config.ProcessGroupCollection' imported but unused

(F401)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py` at line 32, Remove the unused import ProcessGroupCollection from megatron.core.process_groups_config in transformer_block.py to fix the F401 lint error; alternatively, if it was intended for typing, add it to a type-only import (e.g., from typing import TYPE_CHECKING / inside an if TYPE_CHECKING block) and reference it in relevant function/class annotations such as any transformer block constructor or method that would accept process group types.

shifangx · 2026-01-29T05:56:59Z

/ok to test 603f213

shifangx · 2026-01-29T07:19:54Z

/ok to test 8fdb5a4

shifangx · 2026-01-29T10:31:00Z

/ok to test 49569e7

shifangx · 2026-01-29T11:08:11Z

/ok to test cc9b85f

Signed-off-by: conver334 <conver334@gmail.com>

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

shifangx changed the title ~~[M4] feat: Add M4 end2end qwen3vl example~~ [M4] feat: Add M4 end2end qwen3_vl example Jan 29, 2026

shifangx force-pushed the shifang/qwen3_vl_perf_m4 branch 3 times, most recently from 2cce048 to ee07d70 Compare January 29, 2026 05:48

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 05:57 Inactive

copy-pr-bot bot temporarily deployed to test January 29, 2026 05:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 06:08 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 06:15 Failure

[M4] feat: Add M4 end2end qwen3vl example

8fdb5a4

shifangx force-pushed the shifang/qwen3_vl_perf_m4 branch from 22efc7d to 8fdb5a4 Compare January 29, 2026 07:19

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 07:20 Inactive

copy-pr-bot bot temporarily deployed to test January 29, 2026 07:20 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 07:28 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 07:36 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 10:31 Inactive

copy-pr-bot bot had a problem deploying to test January 29, 2026 10:31 Error

for ci test

cc9b85f

shifangx force-pushed the shifang/qwen3_vl_perf_m4 branch from 49569e7 to cc9b85f Compare January 29, 2026 11:07

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 11:08 Inactive

copy-pr-bot bot temporarily deployed to test January 29, 2026 11:09 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 12:40 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 12:48 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 12:58 Inactive

yaoyu-33 approved these changes Jan 29, 2026

View reviewed changes

yaoyu-33 merged commit 194ab6c into NVIDIA-NeMo:main Jan 29, 2026
49 checks passed

conver334 pushed a commit to conver334/Megatron-Bridge that referenced this pull request Jan 30, 2026

[M4] feat: Add M4 end2end qwen3_vl example (NVIDIA-NeMo#2117)

5f159a9

Signed-off-by: conver334 <conver334@gmail.com>

This was referenced Feb 8, 2026

[Deprecated] Adding CUDA Graph Support for Vision Encoder #2274

Closed

M4 leftover for QWen3-VL with MCore vision encoder #2370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[M4] feat: Add M4 end2end qwen3_vl example#2117

[M4] feat: Add M4 end2end qwen3_vl example#2117
yaoyu-33 merged 2 commits intoNVIDIA-NeMo:mainfrom
shifangx:shifang/qwen3_vl_perf_m4

shifangx commented Jan 29, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 29, 2026

Uh oh!

shifangx commented Jan 29, 2026

Uh oh!

shifangx commented Jan 29, 2026

Uh oh!

shifangx commented Jan 29, 2026

Uh oh!

shifangx commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shifangx commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Add M4 end2end qwen3vl example

How to Run

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

shifangx commented Jan 29, 2026

Uh oh!

shifangx commented Jan 29, 2026

Uh oh!

shifangx commented Jan 29, 2026

Uh oh!

shifangx commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shifangx commented Jan 29, 2026 •

edited

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading