Skip to content

Conversation

@FlintyLemming
Copy link

@FlintyLemming FlintyLemming commented Nov 21, 2025

Purpose

This PR adds a tuned fused MoE kernel configuration for the GLM-4 MoE architecture on NVIDIA H200 GPUs using FP8 quantization.

Specifically, it targets the configuration:

  • Experts (E): 160
  • Intermediate Size: 1536 (Sharded size N=192 for TP=8)
  • Device: NVIDIA H200
  • Dtype: fp8_w8a8

Previously, vLLM lacked a static configuration for this specific shape (E=160, N=192) on H200, causing it to fallback to heuristics or require JIT tuning during startup. This config improves startup time and ensures optimal kernel parameters are used for GLM-4 variants when running with tensor_parallel_size=8.

Test Plan

Generation:
The configuration was generated using the official benchmark script on an 8x H200 node:

python benchmarks/kernels/benchmark_moe.py \
  --model /path/to/ZhipuAI/GLM-4.6-FP8 \
  --dtype fp8_w8a8 \
  --tp-size 8 \
  --tune \
  --trust-remote-code \
  --save-dir ./configs

Test Result

iShot_2025-11-21_21 04 35

The optimal configuration file E=160,N=192,device_name=NVIDIA_H200,dtype=fp8_w8a8.json was generated.

Performance:
Confirmed that vllm serve initializes the engine significantly faster (skipping the tuning/benchmarking phase).
Inference runs stably on 8x H200.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new fused MoE kernel configuration for NVIDIA H200 GPUs with FP8 precision, specifically tuned for the GLM-4 MoE architecture with E=160 experts and a sharded intermediate size of N=192. This is a valuable addition as it provides a pre-tuned static configuration, which will improve startup times by avoiding runtime heuristics or JIT tuning. The change consists of a single JSON configuration file, which appears to be correctly generated and follows the established conventions in the repository. The provided test plan and results confirm the file generation and performance benefits. The change is straightforward and looks good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant