-
Notifications
You must be signed in to change notification settings - Fork 482
Open
Description
Following up on #952, where @angeloskath suggested handling gate/up projection fusion at the model layer via sanitize rather than in SwitchGLU.__call__.
The Optimization
SwitchGLU currently calls gather_qmm twice per MoE layer with identical input and expert indices — once for gate_proj, once for up_proj. Concatenating these weights and using a single gather_qmm call eliminates one kernel dispatch per layer per token.
Measured Results
Benchmarked on Mac Studio M3 Ultra, 512GB, N=10 runs, mean ± std:
| Model | Family | MoE Layers | Baseline tok/s | Fused tok/s | Improvement |
|---|---|---|---|---|---|
| Qwen3-30B-A3B | Qwen | 48 | 108.7 ± 0.3 | 118.0 ± 0.3 | +8.6% |
| MiniMax M2.5 (456B) | MiniMax | 62 | 51.6 ± 0.1 | 54.2 ± 0.2 | +5.1% |
| GPT-OSS-120B | GPT-OSS | 36 | 85.4 | 89.6 | +5.0% |
| Qwen3.5-122B-A10B | Qwen | 48 | 48.7 | 51.1 | +5.0% |
| OLMoE-1B-7B | OLMoE | 16 | 357.7 | 371.3 | +3.8% |
| Qwen3.5-397B-A17B | Qwen | 60 | 7.5 | 7.5 | +0.8% |
Token-exact correctness confirmed on all models.
Proposed Approach (per @angeloskath's feedback)
Handle fusion at the model layer through sanitize, similar to how QKV projections are fused:
- In each MoE model's
sanitize(), keepgate_projandup_projweights concatenated instead of splitting them (several models like Qwen3.5 already ship weights as fusedgate_up_proj— the current sanitize splits them apart) - Create a
FusedSwitchGLUvariant (or modifySwitchGLU) that accepts pre-fused weights and does a singlegather_qmm+ split
Questions for maintainers
- Should this be a new
FusedSwitchGLUclass, or a mode within the existingSwitchGLU? - Should fusion be opt-in per model, or default for all MoE models that use
SwitchGLU? - For models that ship weights already fused (e.g. Qwen3.5
gate_up_proj): shouldsanitizesimply stop splitting them? - For models that ship weights separately: should
sanitizeconcatenate them during loading? - Any concerns about backward compatibility with existing checkpoints?
Happy to implement whichever approach the team prefers.
Benchmark scripts and per-layer microbenchmarks available if useful.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels