[FEAT] Add support for AITER bpreshuffle block scale gemm #717

tjtanaavllm · 2025-09-27T03:20:49Z

Purpose

This PR add the use of aiter.gemm_a8w8_blockscale_bpreshuffle if shuffle is enabled.

It also remove the direct_register_custom_op overhead from aiter.gemm_a8w8_blockscale

aiter.gemm_a8w8_blockscale_bpreshuffle is used if and only if use_swizzle is True.

How to Tune

Alternative guide: https://github.com/EmbeddedLLM/vllmtests/tree/main/kernels/blockscalegemm

Test Plan

Evaluate the lm_eval score of Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 before and after.

Evaluate the benchmark performance.

Test Result

lm_eval score

Baseline

local-completions (model=Qwen/Qwen3-235B-A22B-Instruct-2507-FP8,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8992|±  |0.0083|
|     |       |strict-match    |     5|exact_match|↑  |0.8946|±  |0.0085|

After

local-completions (model=Qwen/Qwen3-235B-A22B-Instruct-2507-FP8,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9075|±  |0.0080|
|     |       |strict-match    |     5|exact_match|↑  |0.8976|±  |0.0083|

Performance

There are a few cases being evaluated

Before removing vLLM’s direct_register_custom_op overhead
Before removing vLLM’s direct_register_custom_op overhead (Tuned block scaled gemm)
After removing vLLM’s direct_register_custom_op overhead (Tuned block scaled gemm)
Using tuned bpreshuffle block scaled gemm instead of tuned block scaled gemm

Metric	Before Optimization	Before (Tuned Block)	After (Tuned Block)	Tuned BPreshuffle	Best Improvement
Throughput Metrics
Request throughput (req/s)	1.01	1.04	1.05	1.07	+5.9%
Output token throughput (tok/s)	979.19	1006.80	1017.31	1032.85	+5.5%
Total token throughput (tok/s)	4008.30	4125.60	4170.56	4232.04	+5.6%
Latency Metrics
Benchmark duration (s)	316.54	307.44	304.08	299.71	-5.3%
Mean TTFT (ms)	2342.34	2071.60	2258.43	2168.61	-11.6%
Median TTFT (ms)	2567.93	1733.94	2389.88	2310.13	-32.5%
P99 TTFT (ms)	7618.99	7087.36	7044.93	7102.27	-7.5%
Mean TPOT (ms)	32.30	33.76	30.98	32.20	-4.1%
Median TPOT (ms)	29.74	29.26	28.48	28.13	-5.4%
P99 TPOT (ms)	32.16	31.51	31.03	32.56	-3.5%
Mean ITL (ms)	29.57	28.93	28.42	28.06	-5.1%
Median ITL (ms)	21.53	21.31	21.10	20.48	-4.9%
P99 ITL (ms)	392.15	395.10	389.15	368.55	-6.0%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: tjtanaavllm <[email protected]>

tjtanaavllm added 2 commits September 26, 2025 16:55

add aiter bpreshuffle block scaled gemm support

986918f

Signed-off-by: tjtanaavllm <[email protected]>

use quant_hip to preshufflescale

f71e80b

Signed-off-by: tjtanaavllm <[email protected]>

tjtanaavllm requested review from wuhuikx, zejunchen-zejun and kliuae-amd September 27, 2025 03:20

tjtanaavllm requested review from charlifu, mawong-amd, shajrawi, gshtras, maleksan85, sunway513 and hongxiayang as code owners September 27, 2025 03:20

clean code

4bbea11

Signed-off-by: tjtanaavllm <[email protected]>

tjtanaavllm changed the title ~~[FEAT] Add support for AITER bpreshuffle block scale~~ [FEAT] Add support for AITER bpreshuffle block scale gemm Sep 27, 2025

tjtanaa mentioned this pull request Sep 28, 2025

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter vllm-project/vllm#25693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] Add support for AITER bpreshuffle block scale gemm #717

[FEAT] Add support for AITER bpreshuffle block scale gemm #717

Uh oh!

tjtanaavllm commented Sep 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

[FEAT] Add support for AITER bpreshuffle block scale gemm #717

Are you sure you want to change the base?

[FEAT] Add support for AITER bpreshuffle block scale gemm #717

Uh oh!

Conversation

tjtanaavllm commented Sep 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

How to Tune

Test Plan

Test Result

Performance

Uh oh!

Uh oh!

tjtanaavllm commented Sep 27, 2025 •

edited by github-actions bot

Loading