Skip to content

Conversation

shivampr
Copy link

@shivampr shivampr commented Oct 6, 2025

Purpose

This PR adds tuned Fused-MoE Triton configs for Qwen3-30B A3 (E=64) and A3B (E=128) on NVIDIA H100 80GB, covering both BF16 and FP8 (fp8_w8a8). It’s part of #22294.

New files (under vllm/model_executor/layers/fused_moe/configs/):

  • E=64,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json
  • E=64,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json
  • E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json
  • E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json

Each table is tuned per (E, dtype) and only keeps the “turning-point” batches to avoid bloat. For FP8, I used zero-init weights + unit scales (PyTorch can’t randn float8 yet); this doesn’t affect which kernel gets picked—selection depends on shapes and launch params.

Test Plan (short)

Env: H100 80GB (SXM) · CUDA 12.8 · PyTorch 2.8.0 · vLLM wheel (with Triton)

What I verified

  • Loader picks up the tuned JSONs for each (E, dtype).
  • If a file is hidden, vLLM falls back to default.
  • The four JSONs are not identical.

Repro

python - <<'PY'
import logging, torch
from vllm.model_executor.layers.fused_moe.fused_moe import fused_topk, fused_experts
logging.getLogger("vllm.model_executor.layers.fused_moe.fused_moe").setLevel(logging.INFO)

H, N, top_k, B = 7168, 8960, 8, 32
def run(E, dtype):
    xdtype = torch.bfloat16 if dtype=="bf16" else torch.float16
    wdt    = xdtype if dtype=="bf16" else torch.float8_e4m3fn
    x  = torch.randn(B, H, dtype=xdtype, device="cuda")
    w1 = torch.zeros(E, N,   H, dtype=wdt, device="cuda")
    w2 = torch.zeros(E, H, N//2, dtype=wdt, device="cuda")
    gate = torch.randn(B, E, dtype=torch.float32, device="cuda")
    tw, ti, _ = fused_topk(x, gate, top_k, renormalize=True)
    fused_experts(x, w1, w2, tw, ti, inplace=True, quant_config=None)

for E in (64, 128):
  for dt in ("bf16", "fp8_w8a8"):
    print(f"\n=== E={E} {dt} ===")
    run(E, dt)
PY

Loader logs (proof of pickup)

INFO ... Using configuration from .../E=64,N=8960,...,dtype=bf16.json ...
INFO ... Using configuration from .../E=64,N=8960,...,dtype=fp8_w8a8.json ...
INFO ... Using configuration from .../E=128,N=8960,...,dtype=bf16.json ...
INFO ... Using configuration from .../E=128,N=8960,...,dtype=fp8_w8a8.json ...

@shivampr shivampr requested a review from mgoin as a code owner October 6, 2025 01:18
Copy link

github-actions bot commented Oct 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the qwen Related to Qwen models label Oct 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tuned Triton configurations for Fused-MoE kernels to optimize performance for Qwen3-30B models on H100 GPUs. While the intent is to provide specific configs for FP8 and BF16 data types with expert counts of 64 and 128, a critical issue has been identified: all four newly added JSON configuration files are identical. This suggests a copy-paste error, which would lead to using incorrect and non-optimal kernel parameters for at least three of the four scenarios, potentially degrading performance instead of improving it. It is crucial to replace the duplicated content with the correctly tuned configurations for each specific file.

@shivampr shivampr force-pushed the feat/qwen3-h100-moe-configs branch from b0146ac to 086ce44 Compare October 6, 2025 01:22
- Add JSONs for E=64 and E=128, N=8960 with device_name=NVIDIA_H100_80GB_HBM3
- Verified loader pickup on H100 80GB (logs show “Using configuration from …” for fp8_w8a8 and bf16)

Refs: vllm-project#22294
Signed-off-by: Shivam <[email protected]>
@shivampr shivampr force-pushed the feat/qwen3-h100-moe-configs branch from 086ce44 to 14699ba Compare October 6, 2025 05:21
…BF16 & FP8); per-(E,dtype) distinct tables

Signed-off-by: Shivam <[email protected]>
@shivampr shivampr force-pushed the feat/qwen3-h100-moe-configs branch from 7c2e05c to fde213e Compare October 6, 2025 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qwen Related to Qwen models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant