[Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209

IwakuraRein · 2025-08-19T22:01:36Z

Can be merged after flashinfer address the AOT installation.

Purpose

The flashinfer fp4 autotuner is merged. Need to update the api call in the mxfp4 moe.

Fix the x_scale shape and use a hardcoded max tunning number of tokens in the mxfp4 moe.
Move kernel_warmup above the self.model_runner.capture_model()
~~bump flashinfer tag to 0.2.13~~

Test Plan

python benchmarks/benchmark_throughput.py \
    --backend vllm \
    --async-engine \
    --model openai/gpt-oss-120b \
    --num-prompts 2048 \
    --input-len 1024 \
    --output-len 1024 \
    --max-num-seqs 512 \
    --max_model_len 3072 \
    --compilation-config='{"pass_config": {"enable_fi_allreduce_fusion": true, "fi_allreduce_fusion_max_token_num": 3072}, "custom_ops": ["+rms_norm"], "level":3}' \
    -tp 1

Test Result

On B200, VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1:

without autotuner

Throughput: 11.72 requests/s, 23988.62 total tokens/s, 12000.38 output tokens/s
Total num prompt tokens:  2095032
Total num output tokens:  2097152

with autotuner

Throughput: 12.91 requests/s, 26420.97 total tokens/s, 13215.44 output tokens/s
Total num prompt tokens:  2095580
Total num output tokens:  2097152

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

github-actions · 2025-08-19T22:01:44Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/worker/gpu_worker.py

mgoin · 2025-08-19T23:25:20Z

vllm/v1/worker/gpu_worker.py

        # Warmup kernels used during model execution
-        kernel_warmup(self)
+        kernel_warmup(self, do_autotune=False)


Why do we need this flag and to run twice? I think we can just move this before cuda graphs like your original commit

Because I was thinking that auto-tuning and warm-up serve two different purposes here. Auto-tuning is meant to store the best kernel function index, so I placed it before cuda graph capture to make sure cuda graph sees the correct kernel. Warm-up is a dry run before actual job starts, so I added it right before the real execution just like original codes. Based on your earlier comment, I thought you were suggesting that warm-up is necessary (maybe DeepGEMM requires it?). Please correct me if I’ve misunderstood.

mergify · 2025-08-21T06:34:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @IwakuraRein.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Siyuan Fu <[email protected]>

Signed-off-by: siyuanf <[email protected]>

vllm/model_executor/layers/quantization/mxfp4.py

Signed-off-by: Siyuan Fu <[email protected]>

yewentao256

Looks good, could you also add a E2E accuracy test using lm-eval?

setup.py

IwakuraRein · 2025-08-21T21:28:45Z

Looks good, could you also add a E2E accuracy test using lm-eval?

Hi @yewentao256 . I have experimented with simple_evals:

metric	env	max model len	tp	reasoning effort	result
mmlu	VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8	32768	1	high	0.886483
mmlu	VLLM_USE_FLASHINFER_MOE_MXFP4_BF16	32768	1	high	0.889118

Signed-off-by: Siyuan Fu <[email protected]>

ProExpertProg

Looks good except address todo

vllm/model_executor/layers/quantization/mxfp4.py

Signed-off-by: Siyuan Fu <[email protected]>

mergify · 2025-08-23T03:02:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @IwakuraRein.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

yewentao256

LGTM, thanks for the work!

IwakuraRein · 2025-08-25T16:23:44Z

Closed after #23537

mergify bot added the v1 label Aug 19, 2025

mgoin reviewed Aug 19, 2025

View reviewed changes

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

mgoin reviewed Aug 19, 2025

View reviewed changes

IwakuraRein mentioned this pull request Aug 21, 2025

tuner: Trtllm-gen Fp4 MoE Autotunner flashinfer-ai/flashinfer#1475

Merged

5 tasks

mergify bot added the ci/build label Aug 21, 2025

mergify bot added the needs-rebase label Aug 21, 2025

IwakuraRein added 4 commits August 20, 2025 23:39

fix after flashinfer autotuner

35a24b3

Signed-off-by: Siyuan Fu <[email protected]>

add warmup

6cc1f6e

Signed-off-by: Siyuan Fu <[email protected]>

address comment

23055b8

Signed-off-by: Siyuan Fu <[email protected]>

Update flashinfer tag

7e1fb28

Signed-off-by: siyuanf <[email protected]>

IwakuraRein force-pushed the flashinfer-autotuner-upd branch from c0b3809 to 7e1fb28 Compare August 21, 2025 06:39

IwakuraRein marked this pull request as ready for review August 21, 2025 06:40

IwakuraRein requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, tlrmchlsmth and yewentao256 as code owners August 21, 2025 06:40

mergify bot removed the needs-rebase label Aug 21, 2025

nvpohanh reviewed Aug 21, 2025

View reviewed changes

vllm/model_executor/layers/quantization/mxfp4.py Outdated Show resolved Hide resolved

IwakuraRein added 2 commits August 21, 2025 11:24

address comment

3d9f10e

Signed-off-by: Siyuan Fu <[email protected]>

address comment

8c78c12

Signed-off-by: Siyuan Fu <[email protected]>

yewentao256 reviewed Aug 21, 2025

View reviewed changes

mgoin reviewed Aug 21, 2025

View reviewed changes

setup.py Show resolved Hide resolved

update dockerfile

7319792

Signed-off-by: Siyuan Fu <[email protected]>

Merge branch 'main' into flashinfer-autotuner-upd

cf60161

ProExpertProg approved these changes Aug 22, 2025

View reviewed changes

vllm/model_executor/layers/quantization/mxfp4.py Outdated Show resolved Hide resolved

IwakuraRein added 2 commits August 22, 2025 09:22

address todo

9988632

Signed-off-by: Siyuan Fu <[email protected]>

address todo

ba9b2ea

Signed-off-by: Siyuan Fu <[email protected]>

IwakuraRein changed the title ~~Fix after flashinfer fp4 autotuner pr~~ [Do Not Merge]Fix after flashinfer fp4 autotuner pr Aug 22, 2025

mergify bot added the needs-rebase label Aug 23, 2025

yewentao256 approved these changes Aug 23, 2025

View reviewed changes

weireweire mentioned this pull request Aug 25, 2025

Update Flashinfer to 0.2.14.post1 #23537

Merged

4 tasks

IwakuraRein closed this Aug 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209

[Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209

IwakuraRein commented Aug 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

Uh oh!

mgoin Aug 19, 2025

Uh oh!

IwakuraRein Aug 20, 2025 •

edited

Loading

Uh oh!

mergify bot commented Aug 21, 2025

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

IwakuraRein commented Aug 21, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

mergify bot commented Aug 23, 2025

Uh oh!

yewentao256 left a comment

Uh oh!

IwakuraRein commented Aug 25, 2025

Uh oh!

Uh oh!

Uh oh!

[Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209

[Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209

Conversation

IwakuraRein commented Aug 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

Uh oh!

mgoin Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

IwakuraRein Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 21, 2025

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IwakuraRein commented Aug 21, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Aug 23, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

IwakuraRein commented Aug 25, 2025

Uh oh!

Uh oh!

IwakuraRein commented Aug 19, 2025 •

edited by github-actions bot

Loading

IwakuraRein Aug 20, 2025 •

edited

Loading