Add fused MoE config for H200 E160 N192 fp8 #29182

FlintyLemming · 2025-11-21T13:25:58Z

Purpose

This PR adds a tuned fused MoE kernel configuration for the GLM-4 MoE architecture on NVIDIA H200 GPUs using FP8 quantization.

Specifically, it targets the configuration:

Experts (E): 160
Intermediate Size: 1536 (Sharded size N=192 for TP=8)
Device: NVIDIA H200
Dtype: fp8_w8a8

Previously, vLLM lacked a static configuration for this specific shape (E=160, N=192) on H200, causing it to fallback to heuristics or require JIT tuning during startup. This config improves startup time and ensures optimal kernel parameters are used for GLM-4 variants when running with tensor_parallel_size=8.

Test Plan

Generation:
The configuration was generated using the official benchmark script on an 8x H200 node:

python benchmarks/kernels/benchmark_moe.py \
  --model /path/to/ZhipuAI/GLM-4.6-FP8 \
  --dtype fp8_w8a8 \
  --tp-size 8 \
  --tune \
  --trust-remote-code \
  --save-dir ./configs

Test Result

The optimal configuration file E=160,N=192,device_name=NVIDIA_H200,dtype=fp8_w8a8.json was generated.

Performance:
Confirmed that vllm serve initializes the engine significantly faster (skipping the tuning/benchmarking phase).
Inference runs stably on 8x H200.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a new fused MoE kernel configuration for NVIDIA H200 GPUs with FP8 precision, specifically tuned for the GLM-4 MoE architecture with E=160 experts and a sharded intermediate size of N=192. This is a valuable addition as it provides a pre-tuned static configuration, which will improve startup times by avoiding runtime heuristics or JIT tuning. The change consists of a single JSON configuration file, which appears to be correctly generated and follows the established conventions in the repository. The provided test plan and results confirm the file generation and performance benefits. The change is straightforward and looks good to merge.

Signed-off-by: FlintyLemming <[email protected]>

FlintyLemming requested review from mgoin and pavanimajety as code owners November 21, 2025 13:25

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

Add fused MoE config for H200 E160 N192 fp8

4f5df90

Signed-off-by: FlintyLemming <[email protected]>

FlintyLemming force-pushed the add-h200-e160-config branch from 89132e0 to 4f5df90 Compare November 21, 2025 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add fused MoE config for H200 E160 N192 fp8 #29182

Add fused MoE config for H200 E160 N192 fp8 #29182

FlintyLemming commented Nov 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add fused MoE config for H200 E160 N192 fp8 #29182

Are you sure you want to change the base?

Add fused MoE config for H200 E160 N192 fp8 #29182

Conversation

FlintyLemming commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FlintyLemming commented Nov 21, 2025 •

edited by github-actions bot

Loading