[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

charlifu · 2025-09-25T16:51:53Z

This PR adds a few fusion passes for Aiter to fusion layernorm + fp8 block quant and silu + fp8 block quant.

Signed-off-by: charlifu <[email protected]>

gemini-code-assist

Code Review

This pull request introduces new fusion passes for ROCm AITer, specifically for layernorm + fp8 block quant and silu + fp8 block quant. This is achieved by adding a new pattern AiterSiluMulFp8BlockQuantPattern and registering a new custom operator. Additionally, the changes in fp8_utils.py extend AITer support to non-MI300 series GPUs by providing a Triton-based fallback, which is a great enhancement.

My main feedback is on a performance concern in fp8_utils.py where an import is performed inside a performance-critical function. I've suggested a refactoring to move the import to the module level to avoid repeated overhead.

gemini-code-assist · 2025-09-25T16:53:42Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

+    # MI300's fp8nuz should be enough to detect if we call ck vs triton
+    if current_platform.is_fp8_fnuz():
+        from aiter import gemm_a8w8_blockscale
+    else:
+        from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale
+    return gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)


Importing inside a function that is on a hot path, like this custom op implementation, can introduce performance overhead. It's best practice to move imports to the module level to ensure they are only executed once.

I'd recommend defining a module-level variable that holds the correct gemm_a8w8_blockscale function based on the platform, and then using that variable within this function. This avoids repeated import lookups.

For example, you could add the following logic at the module level (e.g., near the top of the file):

_gemm_a8w8_blockscale = None if current_platform.is_rocm(): try: # MI300's fp8nuz should be enough to detect if we call ck vs triton if current_platform.is_fp8_fnuz(): from aiter import gemm_a8w8_blockscale else: from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale _gemm_a8w8_blockscale = gemm_a8w8_blockscale except ImportError: # aiter is not installed, which is fine. # The error will be raised when the op is actually used. pass

And then this function's body can be simplified as suggested.

Suggested change

# MI300's fp8nuz should be enough to detect if we call ck vs triton

if current_platform.is_fp8_fnuz():

from aiter import gemm_a8w8_blockscale

else:

from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale

return gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)

if _gemm_a8w8_blockscale is None:

raise ImportError(

"Aiter backend for gemm_a8w8_blockscale not available. "

"Please install aiter.")

return _gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)

charlifu · 2025-09-25T16:56:08Z

#25688 (comment)

Signed-off-by: Micah Williamson <[email protected]>

ProExpertProg · 2025-09-25T19:38:58Z

I'm currently overhauling custom op matching in #24604. We also recently added a torch implementation of group quant, could you compare its performance with AITER? Also could you compare the perf of the fused AITER kernel to the fused torch.compile kernel for rmsnorm+quant. Happy to help out with instructions, but overall:

you'll need [Performance] Move apply_w8a8_block_fp8_linear to an op class #24666 reapplied (it was recently reverted) - now in [Perf] Fix and reapply move apply w8a8 block fp8 linear to class #25696
you'll need to disable quant_fp8 using-O.custom_ops+=-quant_fp8
you'll have to replace the AITER block quant with QuantFP8
- we should refactor this after [Performance] Move apply_w8a8_block_fp8_linear to an op class #24666 is re-merged so that the aiter op is under QuantFP8 as well

gshtras · 2025-09-26T14:52:53Z

vllm/compilation/activation_quant_fusion.py

                                            SiluMulFp8StaticQuantPattern,
-                                            SiluMulNvfp4QuantPattern)
+                                            SiluMulNvfp4QuantPattern,
+                                            AiterSiluMulFp8BlockQuantPattern)


This symbol definition is conditional on is_rocm_aiter_linear_enabled():
Any run will fail here if not enabled.

Should be fixed now cd059b9

tjtanaa · 2025-09-28T14:19:59Z

vllm/compilation/activation_quant_fusion.py

+        return x_fp8, out_bs
+
+    direct_register_custom_op(
+        op_name="rocm_aiter_act_mul_and_fp8_group_quant",


Can you check if the latest aiter allows you to skip direct register custom ops? I remember most ops now should be able to work without calling direct_register_custom_ops on vLLM side as it is done in AITER repository. Moreover, removing the direct_register_custom_ops wrappers can reduce additional CPU overhead. Doing direct_register_custom_ops can be costly in terms of overhead.

Please take a look at the benchmarking results in this PR ROCm#717 (the second and third case) where it shows that removing the direct_register_custom_ops on vLLM side improves the perf.

Hey, thanks for the feedback. Is there a version of aiter which has aiter.ops.triton.fused_fp8_quant and also has these direct_register_custom_ops that you mentioned? I wasn't able to figure out how to call act_mul_and_fp8_group_quant without calling direct_register_custom_op first. Would be happy to investigate further if you can point me in the right direction, otherwise I think we can always come back and get rid of these direct_register_custom_op calls if needed.

Signed-off-by: Micah Williamson <[email protected]>

mergify · 2025-10-07T19:33:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: charlifu <[email protected]>

charlifu added 6 commits September 25, 2025 16:41

add aiter silu fused kernel

2f538fa

Signed-off-by: charlifu <[email protected]>

add silu fusion pass

9d6507b

Signed-off-by: charlifu <[email protected]>

fix pass

b901f27

Signed-off-by: charlifu <[email protected]>

workable silu_mul fusion pass

1d11425

Signed-off-by: charlifu <[email protected]>

fix aiter fp8 linear support

b48f84d

Signed-off-by: charlifu <[email protected]>

add is rocm aiter linear enabled

41e7e2f

Signed-off-by: charlifu <[email protected]>

charlifu requested review from zou3519, youkaichao, ProExpertProg, mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 25, 2025 16:51

mergify bot added the rocm Related to AMD ROCm label Sep 25, 2025

charlifu mentioned this pull request Sep 25, 2025

New PR number #25693 #25688

Closed

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

fusion for AITER group quant RMSNorm and AITER w8a8 gemm

9940a40

Signed-off-by: Micah Williamson <[email protected]>

gshtras reviewed Sep 26, 2025

View reviewed changes

tjtanaa reviewed Sep 28, 2025

View reviewed changes

micah-wil added 2 commits October 3, 2025 23:39

fix undefined symbol conditional on is_rocm_aiter_linear_enabled

cd059b9

Signed-off-by: Micah Williamson <[email protected]>

only add aiter rmsnorm fusion patterns if aiter is enabled

6cf02a9

Signed-off-by: Micah Williamson <[email protected]>

mergify bot added the needs-rebase label Oct 7, 2025

charlifu added 2 commits October 8, 2025 16:30

Merge branch 'main' into amd/aiter_fusion_pass

f2cd510

Signed-off-by: charlifu <[email protected]>

fix silu + fp8 block quant pass

7298b55

Signed-off-by: charlifu <[email protected]>

mergify bot removed the needs-rebase label Oct 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

charlifu commented Sep 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

charlifu commented Sep 25, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

gshtras Sep 26, 2025

Uh oh!

micah-wil Oct 3, 2025

Uh oh!

tjtanaa Sep 28, 2025 •

edited

Loading

Uh oh!

micah-wil Oct 3, 2025

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

Are you sure you want to change the base?

[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter #25693

Conversation

charlifu commented Sep 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

charlifu commented Sep 25, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

gshtras Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

micah-wil Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

micah-wil Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

Uh oh!

tjtanaa Sep 28, 2025 •

edited

Loading