[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config #34130

mgehre-amd · 2026-02-09T10:14:35Z

Add int4_w4a16 quantization dtype to the MoE benchmark/tuning script with proper uint8-packed weight generation and group-wise scales.

The group_size is automatically extracted from the model config, supporting AWQ/GPTQ style (direct 'group_size' key) and compressed-tensors style (nested in 'config_groups').

Add tuned Triton MoE kernel configurations for int4_w4a16 quantization on the AMD Radeon 8060S Graphics (Strix Halo, gfx1151) for batch sizes 1-32. Model shape: E=128, N=768 (e.g. Qwen3-30B-A3B MoE layers).

Benchmarked on AMD Strix Halo (gfx1151, LPDDR5X-8000 128 GB):

Model: RedHatAI/Qwen3-30B-A3B-Instruct-2507.w4a16
input-len=128, output-len=128, num-prompts=5:
Median TTFT: 182.17 ms -> 141.69 ms (22% faster)
Median TPOT: 41.88 ms -> 39.73 ms (5% faster)

gemini-code-assist

Code Review

This pull request introduces support for int4_w4a16 quantization in the Fused MoE benchmark and tuning script, which is a great addition for performance evaluation on newer hardware. The changes are well-structured, including proper weight generation, scale handling, and a new utility to extract group_size from model configurations. The added tuning configuration for AMD Radeon 8060S Graphics is also valuable. I have one suggestion to improve the robustness of the new get_quantization_group_size function to prevent potential crashes from malformed configuration files.

benchmarks/kernels/benchmark_moe.py

Add int4_w4a16 quantization dtype to the MoE benchmark/tuning script with proper uint8-packed weight generation and group-wise scales. The group_size is automatically extracted from the model config, supporting AWQ/GPTQ style (direct 'group_size' key) and compressed-tensors style (nested in 'config_groups'). Add tuned Triton MoE kernel configurations for int4_w4a16 quantization on the AMD Radeon 8060S Graphics (Strix Halo, gfx1151) for batch sizes 1-32. Model shape: E=128, N=768 (e.g. Qwen3-30B-A3B MoE layers). Benchmarked on AMD Strix Halo (gfx1151, LPDDR5X-8000 128 GB): Model: RedHatAI/Qwen3-30B-A3B-Instruct-2507.w4a16 input-len=128, output-len=128, num-prompts=5: Median TTFT: 182.17 ms -> 141.69 ms (22% faster) Median TPOT: 41.88 ms -> 39.73 ms (5% faster) Signed-off-by: Matthias Gehre <[email protected]>

benchmarks/kernels/benchmark_moe.py

Use the general FusedMoEQuantConfig.make() builder with weight_dtype="int4" instead of the dedicated int4_w4a16_moe_quant_config() helper, simplifying the code and removing the special-cased branch. Signed-off-by: Matthias Gehre <[email protected]>

mergify · 2026-02-09T16:43:31Z

Hi @mgehre-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

The block_quant_shape filtering in get_configs_compute_bound is designed for fp8 block quantization, where BLOCK_SIZE_K must be a multiple of the block size. For int4_w4a16, block_quant_shape is [0, group_size], but the gptq_awq kernel handles arbitrary BLOCK_SIZE_K regardless of group_size. Applying this filter incorrectly eliminates valid configs (e.g. BLOCK_SIZE_K=64 with group_size=128) and also causes a ZeroDivisionError due to block_n=0. Skip block_quant_shape filtering entirely for int4_w4a16 to keep the full search space during tuning. Signed-off-by: Matthias Gehre <[email protected]>

tjtanaa

LGTM

mgehre-amd requested review from mgoin and pavanimajety as code owners February 9, 2026 10:14

mergify bot added the performance Performance-related issues label Feb 9, 2026

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

benchmarks/kernels/benchmark_moe.py Show resolved Hide resolved

mgehre-amd force-pushed the matthias.moe_int4_benchmark branch from cc32e48 to 7469b13 Compare February 9, 2026 10:26

tjtanaa reviewed Feb 9, 2026

View reviewed changes

benchmarks/kernels/benchmark_moe.py Outdated Show resolved Hide resolved

mgehre-amd requested a review from tjtanaa February 9, 2026 16:39

mgehre-amd added 2 commits February 9, 2026 18:08

Merge branch 'main' into matthias.moe_int4_benchmark

9a3b08f

tjtanaa approved these changes Feb 12, 2026

View reviewed changes

Merge branch 'main' into matthias.moe_int4_benchmark

6381a36

tjtanaa enabled auto-merge (squash) February 12, 2026 07:24

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2026

mgehre-amd added 2 commits February 12, 2026 08:45

Merge branch 'main' into matthias.moe_int4_benchmark

8c95017

Merge branch 'main' into matthias.moe_int4_benchmark

5b5ce75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config #34130

[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config #34130

mgehre-amd commented Feb 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 9, 2026

Uh oh!

tjtanaa left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config #34130

Are you sure you want to change the base?

[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config #34130

Conversation

mgehre-amd commented Feb 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 9, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mgehre-amd commented Feb 9, 2026 •

edited by github-actions bot

Loading