Skip to content

Discussion: Fuse gate/up expert projections via sanitize for MoE inference speedup #956

@BurntToastGPT

Description

@BurntToastGPT

Following up on #952, where @angeloskath suggested handling gate/up projection fusion at the model layer via sanitize rather than in SwitchGLU.__call__.

The Optimization

SwitchGLU currently calls gather_qmm twice per MoE layer with identical input and expert indices — once for gate_proj, once for up_proj. Concatenating these weights and using a single gather_qmm call eliminates one kernel dispatch per layer per token.

Measured Results

Benchmarked on Mac Studio M3 Ultra, 512GB, N=10 runs, mean ± std:

Model Family MoE Layers Baseline tok/s Fused tok/s Improvement
Qwen3-30B-A3B Qwen 48 108.7 ± 0.3 118.0 ± 0.3 +8.6%
MiniMax M2.5 (456B) MiniMax 62 51.6 ± 0.1 54.2 ± 0.2 +5.1%
GPT-OSS-120B GPT-OSS 36 85.4 89.6 +5.0%
Qwen3.5-122B-A10B Qwen 48 48.7 51.1 +5.0%
OLMoE-1B-7B OLMoE 16 357.7 371.3 +3.8%
Qwen3.5-397B-A17B Qwen 60 7.5 7.5 +0.8%

Token-exact correctness confirmed on all models.

Proposed Approach (per @angeloskath's feedback)

Handle fusion at the model layer through sanitize, similar to how QKV projections are fused:

  1. In each MoE model's sanitize(), keep gate_proj and up_proj weights concatenated instead of splitting them (several models like Qwen3.5 already ship weights as fused gate_up_proj — the current sanitize splits them apart)
  2. Create a FusedSwitchGLU variant (or modify SwitchGLU) that accepts pre-fused weights and does a single gather_qmm + split

Questions for maintainers

  • Should this be a new FusedSwitchGLU class, or a mode within the existing SwitchGLU?
  • Should fusion be opt-in per model, or default for all MoE models that use SwitchGLU?
  • For models that ship weights already fused (e.g. Qwen3.5 gate_up_proj): should sanitize simply stop splitting them?
  • For models that ship weights separately: should sanitize concatenate them during loading?
  • Any concerns about backward compatibility with existing checkpoints?

Happy to implement whichever approach the team prefers.

Benchmark scripts and per-layer microbenchmarks available if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions