Skip to content

Conversation

@HollowMan6
Copy link
Contributor

@HollowMan6 HollowMan6 commented Dec 23, 2025

What does this PR do ?

fix shared_experts layers when moe_shared_expert_overlap is enabled

Changelog

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

✨ Presented to you with Mind Lab - A Lab for Experiential Intelligence.

`shared_experts` are not actually sharded using ETP, so
we should get it excluded from `is_expert_linear`.

In some modules (notably MoE shared_experts when moe_shared_expert_overlap is enabled),
Megatron disables TP-related communications on the base linear layer by
setting `parallel_mode=None` (TE) or `explicit_expert_comm=True` (legacy).
https://github.com/NVIDIA/Megatron-LM/blob/5b1ef0703184299fbf71f6131bf2f9a5331e7238/megatron/core/transformer/moe/shared_experts.py#L95-L104

This will need some special handling on lin_out_gather_output to keep shape matches.

Signed-off-by: Hollow Man <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HollowMan6 HollowMan6 mentioned this pull request Dec 23, 2025
5 tasks
adapter_name = local_param_name.removeprefix(local_base_prefix + ".adapter.").split(".")[0]
adapter = adapter[adapter_name]
input_is_parallel, _, _, _, base_linear_is_parallel = get_adapter_attributes_from_linear(to_wrap)
input_is_parallel, _, _, _, _, base_linear_is_parallel = get_adapter_attributes_from_linear(to_wrap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like should return a dict or dataclass now, since there are many things and could potentially increase in the future.

I will let this in. if you can plz file another pr, otherwise i will change next week.

@yaoyu-33
Copy link
Contributor

/ok to test d05d42d

@yaoyu-33 yaoyu-33 enabled auto-merge (squash) December 26, 2025 08:18
@yaoyu-33 yaoyu-33 merged commit 54e60ba into NVIDIA-NeMo:main Dec 27, 2025
49 checks passed
@HollowMan6 HollowMan6 deleted the shared_experts branch December 27, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants