-
-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config #34130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config #34130
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for int4_w4a16 quantization in the Fused MoE benchmark and tuning script, which is a great addition for performance evaluation on newer hardware. The changes are well-structured, including proper weight generation, scale handling, and a new utility to extract group_size from model configurations. The added tuning configuration for AMD Radeon 8060S Graphics is also valuable. I have one suggestion to improve the robustness of the new get_quantization_group_size function to prevent potential crashes from malformed configuration files.
Add int4_w4a16 quantization dtype to the MoE benchmark/tuning script with proper uint8-packed weight generation and group-wise scales. The group_size is automatically extracted from the model config, supporting AWQ/GPTQ style (direct 'group_size' key) and compressed-tensors style (nested in 'config_groups'). Add tuned Triton MoE kernel configurations for int4_w4a16 quantization on the AMD Radeon 8060S Graphics (Strix Halo, gfx1151) for batch sizes 1-32. Model shape: E=128, N=768 (e.g. Qwen3-30B-A3B MoE layers). Benchmarked on AMD Strix Halo (gfx1151, LPDDR5X-8000 128 GB): Model: RedHatAI/Qwen3-30B-A3B-Instruct-2507.w4a16 input-len=128, output-len=128, num-prompts=5: Median TTFT: 182.17 ms -> 141.69 ms (22% faster) Median TPOT: 41.88 ms -> 39.73 ms (5% faster) Signed-off-by: Matthias Gehre <[email protected]>
cc32e48 to
7469b13
Compare
Use the general FusedMoEQuantConfig.make() builder with weight_dtype="int4" instead of the dedicated int4_w4a16_moe_quant_config() helper, simplifying the code and removing the special-cased branch. Signed-off-by: Matthias Gehre <[email protected]>
|
Hi @mgehre-amd, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
The block_quant_shape filtering in get_configs_compute_bound is designed for fp8 block quantization, where BLOCK_SIZE_K must be a multiple of the block size. For int4_w4a16, block_quant_shape is [0, group_size], but the gptq_awq kernel handles arbitrary BLOCK_SIZE_K regardless of group_size. Applying this filter incorrectly eliminates valid configs (e.g. BLOCK_SIZE_K=64 with group_size=128) and also causes a ZeroDivisionError due to block_n=0. Skip block_quant_shape filtering entirely for int4_w4a16 to keep the full search space during tuning. Signed-off-by: Matthias Gehre <[email protected]>
tjtanaa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add int4_w4a16 quantization dtype to the MoE benchmark/tuning script with proper uint8-packed weight generation and group-wise scales.
The group_size is automatically extracted from the model config, supporting AWQ/GPTQ style (direct 'group_size' key) and compressed-tensors style (nested in 'config_groups').
Add tuned Triton MoE kernel configurations for int4_w4a16 quantization on the AMD Radeon 8060S Graphics (Strix Halo, gfx1151) for batch sizes 1-32. Model shape: E=128, N=768 (e.g. Qwen3-30B-A3B MoE layers).
Benchmarked on AMD Strix Halo (gfx1151, LPDDR5X-8000 128 GB):
Model: RedHatAI/Qwen3-30B-A3B-Instruct-2507.w4a16
input-len=128, output-len=128, num-prompts=5:
Median TTFT: 182.17 ms -> 141.69 ms (22% faster)
Median TPOT: 41.88 ms -> 39.73 ms (5% faster)