-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
Bug Description
ModelOptNvFp4FusedMoEMethod.process_weights_after_loading() in modelopt_quant.py crashes with EP > 1 in the else branch (when none of enable_flashinfer_cutlass_moe, enable_flashinfer_trtllm_moe, enable_flashinfer_cutedsl_moe is active).
w13_input_scale and w2_input_scale are allocated globally (num_experts) but multiplied against EP-local w13_weight_scale_2 (num_local_experts), causing a shape mismatch.
The cutedsl branch handles this correctly via _slice_scale(), but that helper is scoped inside the elif block and not reachable from else.
Reproduction
- Model:
nvidia/MiniMax-M2.5-NVFP4(256 experts) - Config: TP=2, EP=2, no explicit MoE runner backend (hits the
elsebranch) - SGLang version: 0.5.9-dev2 (commit acab24a), also reproducible on current main
Error
File ".../sglang/srt/layers/quantization/modelopt_quant.py", line 1560, in process_weights_after_loading
(w13_input_scale * w13_weight_scale_2).to(torch.float32),
RuntimeError: The size of tensor a (256) must match the size of tensor b (128) at non-singleton dimension 0
Suggested Fix
Add EP-aware slicing in the else branch, same logic as _slice_scale():
else:
w13_input_scale = layer.w13_input_scale.max(dim=-1).values.to(torch.float32)
w2_input_scale = layer.w2_input_scale
# EP-aware slicing (no-op when ep_size=1)
if layer.moe_ep_size > 1:
_ep_start = layer.moe_ep_rank * layer.num_local_experts
_ep_end = _ep_start + layer.num_local_experts
w13_input_scale = w13_input_scale[_ep_start:_ep_end]
w2_input_scale = w2_input_scale[_ep_start:_ep_end]Note
PR #20963 (Nvidia modelopt refactoring) is currently migrating this code as-is into modelopt/schemes/modelopt_fp4.py — the bug will carry over unless fixed.