Discussion: Fuse gate/up expert projections via sanitize for MoE inference speedup

Following up on #952, where @angeloskath suggested handling gate/up projection fusion at the model layer via `sanitize` rather than in `SwitchGLU.__call__`.

## The Optimization

SwitchGLU currently calls `gather_qmm` twice per MoE layer with identical input and expert indices — once for `gate_proj`, once for `up_proj`. Concatenating these weights and using a single `gather_qmm` call eliminates one kernel dispatch per layer per token.

## Measured Results

Benchmarked on Mac Studio M3 Ultra, 512GB, N=10 runs, mean ± std:

| Model | Family | MoE Layers | Baseline tok/s | Fused tok/s | Improvement |
|-------|--------|------------|---------------|------------|-------------|
| Qwen3-30B-A3B | Qwen | 48 | 108.7 ± 0.3 | 118.0 ± 0.3 | +8.6% |
| MiniMax M2.5 (456B) | MiniMax | 62 | 51.6 ± 0.1 | 54.2 ± 0.2 | +5.1% |
| GPT-OSS-120B | GPT-OSS | 36 | 85.4 | 89.6 | +5.0% |
| Qwen3.5-122B-A10B | Qwen | 48 | 48.7 | 51.1 | +5.0% |
| OLMoE-1B-7B | OLMoE | 16 | 357.7 | 371.3 | +3.8% |
| Qwen3.5-397B-A17B | Qwen | 60 | 7.5 | 7.5 | +0.8% |

Token-exact correctness confirmed on all models.

## Proposed Approach (per @angeloskath's feedback)

Handle fusion at the model layer through `sanitize`, similar to how QKV projections are fused:

1. In each MoE model's `sanitize()`, keep `gate_proj` and `up_proj` weights concatenated instead of splitting them (several models like Qwen3.5 already ship weights as fused `gate_up_proj` — the current sanitize splits them apart)
2. Create a `FusedSwitchGLU` variant (or modify `SwitchGLU`) that accepts pre-fused weights and does a single `gather_qmm` + split

## Questions for maintainers

- Should this be a new `FusedSwitchGLU` class, or a mode within the existing `SwitchGLU`?
- Should fusion be opt-in per model, or default for all MoE models that use `SwitchGLU`?
- For models that ship weights already fused (e.g. Qwen3.5 `gate_up_proj`): should `sanitize` simply stop splitting them?
- For models that ship weights separately: should `sanitize` concatenate them during loading?
- Any concerns about backward compatibility with existing checkpoints?

Happy to implement whichever approach the team prefers. 

Benchmark scripts and per-layer microbenchmarks available if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Fuse gate/up expert projections via sanitize for MoE inference speedup #956

The Optimization

Measured Results

Proposed Approach (per @angeloskath's feedback)

Questions for maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Family	MoE Layers	Baseline tok/s	Fused tok/s	Improvement
Qwen3-30B-A3B	Qwen	48	108.7 ± 0.3	118.0 ± 0.3	+8.6%
MiniMax M2.5 (456B)	MiniMax	62	51.6 ± 0.1	54.2 ± 0.2	+5.1%
GPT-OSS-120B	GPT-OSS	36	85.4	89.6	+5.0%
Qwen3.5-122B-A10B	Qwen	48	48.7	51.1	+5.0%
OLMoE-1B-7B	OLMoE	16	357.7	371.3	+3.8%
Qwen3.5-397B-A17B	Qwen	60	7.5	7.5	+0.8%

Discussion: Fuse gate/up expert projections via sanitize for MoE inference speedup #956

Description

The Optimization

Measured Results

Proposed Approach (per @angeloskath's feedback)

Questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions