allow expert_parallel wrapper to handel kwargs #1620

rakkit · 2025-08-22T00:02:39Z

Currently, all MoE models rely on the same forward logic (_run_experts_for_loop or _run_experts_grouped_mm), which is hardcoded to use Swiglu.

This PR allows expert_parallel to accept args, allowing users flexibility to define custom expert models. For example, users could specify a different activation function and implement their own forward function, while still reusing the upstream expert_parallel logic:

@expert_parallel
def cunstom_experts_grouped_mm(
    w1: torch.Tensor,
    w2: torch.Tensor,
    w3: torch.Tensor,
    x: torch.Tensor,
    num_tokens_per_expert: torch.Tensor,
    act_fn: Callable | nn.Module,
) -> torch.Tensor:

tianyu-l

I'm doing a relatively major refactor in #1569
Would appreciate if you can check the new indices_permutation_wrapper is still OK for you to extend.

rakkit · 2025-08-22T00:55:06Z

thanks @tianyu-l ! Yes, conceptually I think new indices_permutation_wrapper also works for this extension.

allow expert_parallel wrapper to handel kwargs

b2549c3

rakkit requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 22, 2025 00:02

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 22, 2025

tianyu-l reviewed Aug 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

allow expert_parallel wrapper to handel kwargs #1620

allow expert_parallel wrapper to handel kwargs #1620

Uh oh!

rakkit commented Aug 22, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

rakkit commented Aug 22, 2025

Uh oh!

Uh oh!

allow expert_parallel wrapper to handel kwargs #1620

Are you sure you want to change the base?

allow expert_parallel wrapper to handel kwargs #1620

Uh oh!

Conversation

rakkit commented Aug 22, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

rakkit commented Aug 22, 2025

Uh oh!

Uh oh!