[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) #26268
+328
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
This PR adds tuned Fused-MoE Triton configs for Qwen3-30B A3 (E=64) and A3B (E=128) on NVIDIA H100 80GB, covering both BF16 and FP8 (
fp8_w8a8
). It’s part of #22294.New files (under
vllm/model_executor/layers/fused_moe/configs/
):E=64,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json
E=64,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json
E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json
E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json
Each table is tuned per (E, dtype) and only keeps the “turning-point” batches to avoid bloat. For FP8, I used zero-init weights + unit scales (PyTorch can’t
randn
float8 yet); this doesn’t affect which kernel gets picked—selection depends on shapes and launch params.Test Plan (short)
Env: H100 80GB (SXM) · CUDA 12.8 · PyTorch 2.8.0 · vLLM wheel (with Triton)
What I verified
Repro
Loader logs (proof of pickup)