Skip to content

Conversation

tjtanaavllm
Copy link

@tjtanaavllm tjtanaavllm commented Sep 27, 2025

Purpose

This PR add the use of aiter.gemm_a8w8_blockscale_bpreshuffle if shuffle is enabled.

It also remove the direct_register_custom_op overhead from aiter.gemm_a8w8_blockscale

aiter.gemm_a8w8_blockscale_bpreshuffle is used if and only if use_swizzle is True.

How to Tune

Alternative guide: https://github.com/EmbeddedLLM/vllmtests/tree/main/kernels/blockscalegemm

Test Plan

Evaluate the lm_eval score of Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 before and after.

Evaluate the benchmark performance.

Test Result

lm_eval score

  • Baseline
local-completions (model=Qwen/Qwen3-235B-A22B-Instruct-2507-FP8,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8992|±  |0.0083|
|     |       |strict-match    |     5|exact_match|↑  |0.8946|±  |0.0085|
  • After
local-completions (model=Qwen/Qwen3-235B-A22B-Instruct-2507-FP8,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9075|±  |0.0080|
|     |       |strict-match    |     5|exact_match|↑  |0.8976|±  |0.0083|

Performance

There are a few cases being evaluated

  • Before removing vLLM’s direct_register_custom_op overhead
  • Before removing vLLM’s direct_register_custom_op overhead (Tuned block scaled gemm)
  • After removing vLLM’s direct_register_custom_op overhead (Tuned block scaled gemm)
  • Using tuned bpreshuffle block scaled gemm instead of tuned block scaled gemm
Metric Before Optimization Before (Tuned Block) After (Tuned Block) Tuned BPreshuffle Best Improvement
Throughput Metrics
Request throughput (req/s) 1.01 1.04 1.05 1.07 +5.9%
Output token throughput (tok/s) 979.19 1006.80 1017.31 1032.85 +5.5%
Total token throughput (tok/s) 4008.30 4125.60 4170.56 4232.04 +5.6%
Latency Metrics
Benchmark duration (s) 316.54 307.44 304.08 299.71 -5.3%
Mean TTFT (ms) 2342.34 2071.60 2258.43 2168.61 -11.6%
Median TTFT (ms) 2567.93 1733.94 2389.88 2310.13 -32.5%
P99 TTFT (ms) 7618.99 7087.36 7044.93 7102.27 -7.5%
Mean TPOT (ms) 32.30 33.76 30.98 32.20 -4.1%
Median TPOT (ms) 29.74 29.26 28.48 28.13 -5.4%
P99 TPOT (ms) 32.16 31.51 31.03 32.56 -3.5%
Mean ITL (ms) 29.57 28.93 28.42 28.06 -5.1%
Median ITL (ms) 21.53 21.31 21.10 20.48 -4.9%
P99 ITL (ms) 392.15 395.10 389.15 368.55 -6.0%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: tjtanaavllm <[email protected]>
@tjtanaavllm tjtanaavllm changed the title [FEAT] Add support for AITER bpreshuffle block scale [FEAT] Add support for AITER bpreshuffle block scale gemm Sep 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant