Skip to content

Commit 4a5ef84

Browse files
jganganiJatin Gangani
andauthored
[None] [doc] Document perfect MoE router feature for perf analysis (#10303)
Signed-off-by: Jatin Gangani <[email protected]> Co-authored-by: Jatin Gangani <[email protected]>
1 parent 14554ab commit 4a5ef84

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed

docs/source/developer-guide/perf-analysis.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,3 +98,80 @@ TLLM_PROFILE_START_STOP=100-150 nsys profile \
9898
The Nsight Systems reports will be saved to `trace.nsys-rep`. Use NVIDIA Nsight Systems application to open it.
9999

100100
The PyTorch profiler results will be saved to `trace.json`. Use [chrome://tracing/](chrome://tracing/) to inspect the saved profile.
101+
102+
## MoE Expert Load Balance Analysis (Perfect Router)
103+
104+
For Mixture-of-Experts (MoE) models, performance can vary significantly based on how tokens are routed to experts. Uneven expert load distribution can cause some GPUs to be overloaded while others are underutilized, leading to suboptimal throughput.
105+
106+
TensorRT-LLM provides the `ENABLE_PERFECT_ROUTER` environment variable to help analyze and isolate expert load balancing issues from kernel performance.
107+
108+
### What It Does
109+
110+
When enabled, this feature **bypasses the learned router** and replaces it with pre-computed, perfectly load-balanced routing logits. This creates an idealized scenario where tokens are distributed evenly across all experts and GPUs.
111+
112+
Key behaviors:
113+
- The learned gate/router is still computed (to maintain realistic timing)
114+
- The gate output is **discarded** and replaced with ideal balanced logits
115+
- Logits are pre-computed and cached for common batch sizes to minimize overhead
116+
- Works with all MoE backends (CUTLASS, TRTLLM, TRITON)
117+
118+
```{warning}
119+
This feature is for **performance analysis only**. It produces **incorrect model outputs** because the learned router decisions are discarded. Never use this in production inference.
120+
```
121+
122+
### When to Use It
123+
124+
Use `ENABLE_PERFECT_ROUTER` when you want to:
125+
126+
1. **Establish performance upper bounds**: Measure the theoretical best-case MoE throughput when expert loads are perfectly balanced.
127+
128+
2. **Isolate routing bottlenecks**: Compare performance with vs. without perfect routing to determine if the learned router is causing load imbalance issues.
129+
130+
3. **Test different load balancing strategies**: Validate that MoE kernels and communication patterns behave correctly with balanced loads before implementing custom routing logic.
131+
132+
4. **Benchmark kernel efficiency**: Remove routing variability to get consistent, reproducible kernel performance measurements.
133+
134+
### How to Enable
135+
136+
Set the environment variable before running your workload. This works with both `trtllm-bench` and `trtllm-serve`:
137+
138+
```bash
139+
export ENABLE_PERFECT_ROUTER=1
140+
```
141+
142+
### Example Workflow
143+
144+
```bash
145+
# Step 1: Benchmark with normal (learned) routing
146+
trtllm-bench ...
147+
# or
148+
trtllm-serve ...
149+
150+
# Step 2: Benchmark with perfect routing (upper bound)
151+
ENABLE_PERFECT_ROUTER=1 trtllm-bench ...
152+
# or
153+
ENABLE_PERFECT_ROUTER=1 trtllm-serve ...
154+
155+
# Step 3: Compare the throughput numbers
156+
# If perfect router shows >10% improvement, routing imbalance is significant
157+
```
158+
159+
### Interpreting Results
160+
161+
| Scenario | Interpretation |
162+
|----------|----------------|
163+
| Similar performance with/without perfect router | Router load balancing is not a bottleneck; focus optimization efforts elsewhere |
164+
| Significant improvement with perfect router | The learned router is causing load imbalance; consider router optimization or load balancing strategies |
165+
166+
### Supported Models
167+
168+
```{note}
169+
This feature currently requires model-specific integration. The plumbing to support perfect routing must be added to each MoE model implementation. If you need this feature for a model that doesn't yet support it, you will need to add the integration following the pattern used in existing implementations.
170+
```
171+
172+
```{note}
173+
The perfect router logits are specifically designed for `RenormalizeMoeRoutingMethod` (TopK first, then Softmax). Models using other routing methods such as `DefaultMoeRoutingMethod` or `DeepSeekV3MoeRoutingMethod` would require adapting the logit generation logic to match their routing behavior.
174+
```
175+
176+
Currently supported:
177+
- GPT-OSS (uses `RenormalizeMoeRoutingMethod`)

0 commit comments

Comments
 (0)