[LoRA] fix: shared_experts with moe_shared_expert_overlap #1800

HollowMan6 · 2025-12-23T18:14:04Z

What does this PR do ?

fix shared_experts layers when moe_shared_expert_overlap is enabled

Changelog

shared_experts are not actually sharded using ETP, so we should get it excluded from is_expert_linear.
In some modules (notably MoE shared_experts when moe_shared_expert_overlap is enabled), Megatron disables TP-related communications on the base linear layer by setting parallel_mode=None (TE) or explicit_expert_comm=True (legacy). https://github.com/NVIDIA/Megatron-LM/blob/5b1ef0703184299fbf71f6131bf2f9a5331e7238/megatron/core/transformer/moe/shared_experts.py#L95-L104 This will need some special handling on lin_out_gather_output to keep shape matches.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

_{✨ Presented to you with Mind Lab - A Lab for Experiential Intelligence.}

`shared_experts` are not actually sharded using ETP, so we should get it excluded from `is_expert_linear`. In some modules (notably MoE shared_experts when moe_shared_expert_overlap is enabled), Megatron disables TP-related communications on the base linear layer by setting `parallel_mode=None` (TE) or `explicit_expert_comm=True` (legacy). https://github.com/NVIDIA/Megatron-LM/blob/5b1ef0703184299fbf71f6131bf2f9a5331e7238/megatron/core/transformer/moe/shared_experts.py#L95-L104 This will need some special handling on lin_out_gather_output to keep shape matches. Signed-off-by: Hollow Man <[email protected]>

copy-pr-bot · 2025-12-23T18:14:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2025-12-26T08:17:37Z

src/megatron/bridge/models/conversion/model_bridge.py

                    adapter_name = local_param_name.removeprefix(local_base_prefix + ".adapter.").split(".")[0]
                    adapter = adapter[adapter_name]
-                input_is_parallel, _, _, _, base_linear_is_parallel = get_adapter_attributes_from_linear(to_wrap)
+                input_is_parallel, _, _, _, _, base_linear_is_parallel = get_adapter_attributes_from_linear(to_wrap)


feels like should return a dict or dataclass now, since there are many things and could potentially increase in the future.

I will let this in. if you can plz file another pr, otherwise i will change next week.

yaoyu-33 · 2025-12-26T08:18:25Z

/ok to test d05d42d

github-actions bot added the community-request label Dec 23, 2025

HollowMan6 mentioned this pull request Dec 23, 2025

Peft Bridge #1766

Merged

5 tasks

yaoyu-33 reviewed Dec 26, 2025

View reviewed changes

yaoyu-33 approved these changes Dec 26, 2025

View reviewed changes

yaoyu-33 enabled auto-merge (squash) December 26, 2025 08:18

copy-pr-bot bot temporarily deployed to nemo-ci December 26, 2025 08:18 Inactive

copy-pr-bot bot temporarily deployed to test December 26, 2025 08:19 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 27, 2025 05:26 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 27, 2025 05:40 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 27, 2025 05:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 27, 2025 05:56 Inactive

yaoyu-33 merged commit 54e60ba into NVIDIA-NeMo:main Dec 27, 2025
49 checks passed

HollowMan6 deleted the shared_experts branch December 27, 2025 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoRA] fix: shared_experts with moe_shared_expert_overlap #1800

[LoRA] fix: shared_experts with moe_shared_expert_overlap #1800

Uh oh!

HollowMan6 commented Dec 23, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 23, 2025

Uh oh!

yaoyu-33 Dec 26, 2025

Uh oh!

yaoyu-33 commented Dec 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[LoRA] fix: shared_experts with moe_shared_expert_overlap #1800

[LoRA] fix: shared_experts with moe_shared_expert_overlap #1800

Uh oh!

Conversation

HollowMan6 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Dec 23, 2025

Uh oh!

yaoyu-33 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Dec 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HollowMan6 commented Dec 23, 2025 •

edited

Loading