Skip to content

Conversation

@mgehre-amd
Copy link
Contributor

@mgehre-amd mgehre-amd commented Feb 9, 2026

Add int4_w4a16 quantization dtype to the MoE benchmark/tuning script with proper uint8-packed weight generation and group-wise scales.

The group_size is automatically extracted from the model config, supporting AWQ/GPTQ style (direct 'group_size' key) and compressed-tensors style (nested in 'config_groups').

Add tuned Triton MoE kernel configurations for int4_w4a16 quantization on the AMD Radeon 8060S Graphics (Strix Halo, gfx1151) for batch sizes 1-32. Model shape: E=128, N=768 (e.g. Qwen3-30B-A3B MoE layers).

Benchmarked on AMD Strix Halo (gfx1151, LPDDR5X-8000 128 GB):

Model: RedHatAI/Qwen3-30B-A3B-Instruct-2507.w4a16
input-len=128, output-len=128, num-prompts=5:
Median TTFT: 182.17 ms -> 141.69 ms (22% faster)
Median TPOT: 41.88 ms -> 39.73 ms (5% faster)

@mergify mergify bot added the performance Performance-related issues label Feb 9, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for int4_w4a16 quantization in the Fused MoE benchmark and tuning script, which is a great addition for performance evaluation on newer hardware. The changes are well-structured, including proper weight generation, scale handling, and a new utility to extract group_size from model configurations. The added tuning configuration for AMD Radeon 8060S Graphics is also valuable. I have one suggestion to improve the robustness of the new get_quantization_group_size function to prevent potential crashes from malformed configuration files.

Add int4_w4a16 quantization dtype to the MoE benchmark/tuning script
with proper uint8-packed weight generation and group-wise scales.

The group_size is automatically extracted from the model config,
supporting AWQ/GPTQ style (direct 'group_size' key) and
compressed-tensors style (nested in 'config_groups').

Add tuned Triton MoE kernel configurations for int4_w4a16 quantization
on the AMD Radeon 8060S Graphics (Strix Halo, gfx1151) for batch sizes
1-32. Model shape: E=128, N=768 (e.g. Qwen3-30B-A3B MoE layers).

Benchmarked on AMD Strix Halo (gfx1151, LPDDR5X-8000 128 GB):

Model: RedHatAI/Qwen3-30B-A3B-Instruct-2507.w4a16
input-len=128, output-len=128, num-prompts=5:
  Median TTFT: 182.17 ms -> 141.69 ms (22% faster)
  Median TPOT: 41.88 ms -> 39.73 ms (5% faster)

Signed-off-by: Matthias Gehre <[email protected]>
@mgehre-amd mgehre-amd force-pushed the matthias.moe_int4_benchmark branch from cc32e48 to 7469b13 Compare February 9, 2026 10:26
Use the general FusedMoEQuantConfig.make() builder with
weight_dtype="int4" instead of the dedicated
int4_w4a16_moe_quant_config() helper, simplifying the code
and removing the special-cased branch.

Signed-off-by: Matthias Gehre <[email protected]>
@mgehre-amd mgehre-amd requested a review from tjtanaa February 9, 2026 16:39
@mergify
Copy link

mergify bot commented Feb 9, 2026

Hi @mgehre-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

The block_quant_shape filtering in get_configs_compute_bound is designed
for fp8 block quantization, where BLOCK_SIZE_K must be a multiple of the
block size. For int4_w4a16, block_quant_shape is [0, group_size], but
the gptq_awq kernel handles arbitrary BLOCK_SIZE_K regardless of
group_size. Applying this filter incorrectly eliminates valid configs
(e.g. BLOCK_SIZE_K=64 with group_size=128) and also causes a
ZeroDivisionError due to block_n=0.

Skip block_quant_shape filtering entirely for int4_w4a16 to keep the
full search space during tuning.

Signed-off-by: Matthias Gehre <[email protected]>
Copy link
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa enabled auto-merge (squash) February 12, 2026 07:24
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants