[FEAT][ROCm]: Support AITER MLA on V1 Engine by vllmellm · Pull Request #17523 · vllm-project/vllm

vllmellm · 2025-05-01T08:33:41Z

AITER MLA Support for V1 Engine

This PR implements AITER MLA attention backend support for the V1 engine. The implementation mirrors the V0 engine's established approach from PR #15893.

This PR also introduces a new environment variable, VLLM_ROCM_EXECUTE_MODEL_TIMEOUT, which specifies the model execution timeout in seconds. This allows for flexible adjustment of execution time, which is helpful since a timeout error was encountered during graph building when enabling AITER MLA ops on the v1 engine.

Accuracy Validation

using the command below:
VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=1 lm_eval \ --model vllm \ --model_args pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,trust_remote_code=True,max_model_len=32768,block_size=1,enforce_eager=False \ --tasks gsm8k --num_fewshot 5 --batch_size auto

Results:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9492	±	0.0060
		strict-match	5	exact_match	↑	0.9477	±	0.0061

Performance:

The results of benchmarks/benchmark_serving.py

using the commands below:
v0 engine = VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=0 python benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-V3 --trust-remote-code --dataset-name random
v1 engine = VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=1 python benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-V3 --trust-remote-code --dataset-name random

Metric	ROCm AITER MLA V1	ROCm AITER MLA V0
Successful requests	1000	1000
Benchmark duration (s)	179.78	180.92
Total input tokens	1024000	1024000
Total generated tokens	39667	38739
Request throughput (req/s)	5.56	5.53
Output token throughput (tok/s)	220.64	214.12
Total Token throughput (tok/s)	5916.41	5873.93
----------------------------------------------	-------------------------------------	-------------------
Mean TTFT (ms)	85618.95	87984.79
Median TTFT (ms)	85528.56	91664.85
P99 TTFT (ms)	166979.65	96858.61
----------------------------------------------	-------------------------------------	-------------------
Mean TPOT (ms)	1048.37	4727.25
Median TPOT (ms)	1242.70	1189.05
P99 TPOT (ms)	1593.63	38742.17
----------------------------------------------	-------------------------------------	-------------------
Mean ITL (ms)	698.66	857.94
Median ITL (ms)	1226.67	110.7
P99 ITL (ms)	1664.92	10027.72

Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Co-authored-by: ArthurAMD yajhuang@amd.com Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

… if/else statements Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

…tention selector backend Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

houseroad · 2025-05-05T16:12:09Z

vllm/v1/executor/multiproc_executor.py

 POLLING_TIMEOUT_S = POLLING_TIMEOUT_MS // 1000

-EXECUTE_MODEL_TIMEOUT_S = 40
+EXECUTE_MODEL_TIMEOUT_S = (envs.VLLM_ROCM_EXECUTE_MODEL_TIMEOUT


wondering why rocm needs a much larger timeout here?

on first time run when the graph is being created it might take between 100-250 seconds depending on how many AITER kernels are enabled. Thus we kept the default timeout to 250s.

I'm not crazy about requiring another environment variable when running AITER. Can you just set the timeout to 250 here instead of asking the user to increase the timeout? Feel free to give a "safe" timeout.

Will these very long runs only happen during the profile and graph capture runs? Or can they happen while processing real requests?

houseroad · 2025-05-05T16:12:47Z

Could you check the failed tests?

hongxiayang

Approve with comment.

To make Deepseek V1 performant, it needs additional work.

Based on my test, it can improve TTFT if additional AITER environment variables are used. Otherwise, the TTFT is not as good comparing to V0. Throughput is not good yet comparing to V0.

hongxiayang · 2025-05-02T16:34:34Z

vllm/attention/backends/rocm_aiter_mla.py

                                           compute_slot_mapping_start_idx,
                                           is_block_tables_empty)
-from vllm.attention.ops.rocm_aiter_mla import (aiter_mla_decode_fwd,
+from vllm.attention.ops.rocm_aiter_mla import (aiter_mla_decode_forward,


nit: this change from fwd -> forward seems not necessary, in order to minimize the number of files changed in this PR.

it has been addressed in the latest commit.

hongxiayang · 2025-05-02T16:37:49Z

vllm/attention/ops/rocm_aiter_mla.py



-def aiter_mla_decode_fwd(
+def aiter_mla_decode_forward(


We can keep the name as _fwd (see below line 37 decode_fwd)

vllm/platforms/rocm.py

hongxiayang · 2025-05-05T20:21:05Z

@houseroad I found using below environment variables can fix the huge TTFT issue described in the PR.

VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1  VLLM_ROCM_USE_AITER_MLA=1

My command is below:

VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1  VLLM_ROCM_USE_AITER_MLA=1  vllm serve /amdhome/models/DeepSeek-R1 --trust-remote-code -tp 8 --max-model-len 32768 --block-size 1 --no-enable-prefix-caching --max-num-batched-tokens 32768 --max-num-seqs 1024

For input-len/output-len/concurrency/prompts 1000/1000/1/2, the TTFT is changed from 57419.91 to 154.8.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

SageMoore

Looks reasonable. Just a few nits and questions

SageMoore · 2025-05-06T18:03:23Z

vllm/v1/attention/backends/mla/rocm_aiter_mla.py

+        from aiter import flash_attn_varlen_func
+        self.flash_attn_varlen_func = flash_attn_varlen_func
+
+    def _flash_attn_varlen_diff_headdims(self,


Where is this used?

@SageMoore you may want to check coomon.py the method _flash_attn_varlen_diff_headdims is defined there and overridden in this class.

I see now. I must have mistyped the string when I searched for it :).

SageMoore · 2025-05-06T18:07:06Z

vllm/v1/attention/backends/mla/rocm_aiter_mla.py

+        assert max_model_len == 32768,\
+            "AITER MLA requires max_model_len=32768"
+        assert self.runner.block_size == 1, "AITER MLA" \
+            "requires only block size 1."


Nit: "only supports block size 1."

SageMoore · 2025-05-06T18:09:38Z

vllm/v1/executor/multiproc_executor.py

 POLLING_TIMEOUT_S = POLLING_TIMEOUT_MS // 1000

-EXECUTE_MODEL_TIMEOUT_S = 40
+EXECUTE_MODEL_TIMEOUT_S = (envs.VLLM_ROCM_EXECUTE_MODEL_TIMEOUT


I'm not crazy about requiring another environment variable when running AITER. Can you just set the timeout to 250 here instead of asking the user to increase the timeout? Feel free to give a "safe" timeout.

SageMoore · 2025-05-06T18:11:08Z

vllm/envs.py

    lambda: int(os.getenv("VLLM_RPC_TIMEOUT", "10000")),

+    # Time in seconds for the model execution in ROCm platforms.
+    "VLLM_ROCM_EXECUTE_MODEL_TIMEOUT":


Let's remove this. See below comment.

@SageMoore At this moment we can't find "safe" timeout because depending on number of AITER kernels are enable knowing the "safe" timeout is difficult as tracing the AITER jit files might be time consuming during execution time and might change as AITER ops might change based on different versions would be used in future. Thus, rather than having a hardcoded timeout that might trouble the user where to spot it in the code they are able to control this value with environment variable.

@SageMoore The environment variable has been removed.

@SageMoore : the env change is removed as we discussed. Please merge this asap if there are no other blockers.

SageMoore · 2025-05-06T18:14:06Z

vllm/v1/attention/backends/mla/common.py

-                # `context_chunk_starts` that are not aligned to page_size
-                max_context_chunk = round_down(max_context_chunk,
-                                               self.page_size)
+                if self.aot_schedule:


Could you explain this a bit? Why was this change necessary?

@SageMoore the self.page_size if only defined in __init__ with the condition self.aot_schedule while on ROCm this condition is not true and it encounters the error self.page_size is not defined.

vllm/vllm/v1/attention/backends/mla/common.py

Lines 355 to 359 in ba7703e

self.aot_schedule = is_vllm_fa and (get_flash_attn_version() == 3)

# Dont try to access the runner on AMD

if self.aot_schedule:

self.page_size = self.runner.block_size

You may want to ask the author about this as these line changes were added in this PR.

anyways if self.page_size is defined without this self.aot_schedule condition it does not have any effect on ROCm at least for AITER MLA which is the only MLA backend in V1 currently.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

mergify · 2025-05-07T19:08:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

SageMoore

Looks reasonable. Thanks for taking out the timeout changes!

chaunceyjiang · 2025-05-09T07:11:37Z

It seems that this PR introduced a static check error.

https://github.com/vllm-project/vllm/actions/runs/14923384538/job/41922811885?pr=17845

Error: vllm/v1/attention/backends/mla/rocm_aiter_mla.py:98: error: Signature of "_build_decode" incompatible with supertype "MLACommonMetadataBuilder"  [override]
vllm/v1/attention/backends/mla/rocm_aiter_mla.py:98: note:      Superclass:
vllm/v1/attention/backends/mla/rocm_aiter_mla.py:98: note:          def _build_decode(self, block_table: Any, seq_lens: Any) -> Any
vllm/v1/attention/backends/mla/rocm_aiter_mla.py:98: note:      Subclass:
vllm/v1/attention/backends/mla/rocm_aiter_mla.py:98: note:          def _build_decode(self, input_positions: Any, block_table: Any, seq_lens: Any) -> AiterMLADecodeMetadata
Error: vllm/v1/attention/backends/mla/rocm_aiter_mla.py:108: error: Unexpected keyword argument "input_positions" for "AiterMLADecodeMetadata"  [call-arg]
Found 2 errors in 1 file (checked 83 source files)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

tjtanaa · 2025-05-10T06:32:50Z

EXECUTE_MODEL_TIMEOUT_S

fixed by #17880

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

vllmellm and others added 29 commits March 28, 2025 08:19

add AITER MLA implementation in attention backend

f782c66

Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

remove unused arguments in aiter mla decode fwd kernel

42d5c62

Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

add unittest for AITER MLA backend in attention selector

565a3fd

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

add unittest for MLA attention backend selector

645f400

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

code cleaning

22c8726

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update AITER version

5dc1348

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

12f8023

add ck flash attn in prefill mla computation

da8c69f

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

further code cleaning

1ea5718

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

681d777

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

fix mypy typing errors

9ada055

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

1ceb3b9

fix mypy error on Iterable typing error

20a3f07

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

remove padding for v tensor in AITER MLA which improves performance

194a42a

Co-authored-by: ArthurAMD yajhuang@amd.com Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

upgrade aiter package version

a9a02d5

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

only support AITER FA in AITER MLA backend to avoid latency caused by…

02a4fb3

… if/else statements Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-integration

95213e2

add missing data types of arguments in aiter_mla_decode_fwd

6e48433

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

support AITER MLA backend on V1

0265f20

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

uncomment the required packages in common.txt

693c870

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

bugfix in building decode metadata for AITER MLA decode forward pass

a5a1a54

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

optimize the AITER decode metadata build

38c67c7

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-v1

74c9cb3

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

bugfix caused by merging with main

643d07f

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Handle v1 AITER MLA backend in rocm platform

6171e50

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update AITER MLA decode metadata build

905cec9

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update AITER commit

20e769e

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-v1

455bbf2

update proper logging info in selected backend as well as updating at…

90daf6e

…tention selector backend Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm requested a review from tlrmchlsmth as a code owner May 1, 2025 08:33

houseroad reviewed May 5, 2025

View reviewed changes

hongxiayang approved these changes May 5, 2025

View reviewed changes

hongxiayang reviewed May 5, 2025

View reviewed changes

vllm/platforms/rocm.py Outdated Show resolved Hide resolved

vllmellm and others added 5 commits May 6, 2025 03:30

fix unit-test

2218bbc

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-mla-v1

cb98504

bugfix to update AITER MLA V1 decode forward after sync with main

44d813f

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Update vllm/platforms/rocm.py

423c0be

Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>

address PR comments

58d79bd

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

SageMoore suggested changes May 6, 2025

View reviewed changes

update assertion message

95644ea

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

mergify bot added the needs-rebase label May 7, 2025

remove env variable for model execution timeout

f41d616

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

mergify bot removed the needs-rebase label May 8, 2025

vllmellm added 3 commits May 8, 2025 04:14

Merge remote-tracking branch 'origin/main' into aiter-mla-v1

56d2254

remove unnecessary warning

3ee787e

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

keep model execution timeout as original value in main branch

f688418

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

SageMoore approved these changes May 8, 2025

View reviewed changes

DarkLight1337 merged commit 3c9396a into vllm-project:main May 9, 2025
57 of 58 checks passed

vllmellm mentioned this pull request Aug 26, 2025

[Feature] [ROCm]: AITER Kernel Integration vllmellm/vllm#51

Open

61 tasks

	self.aot_schedule = is_vllm_fa and (get_flash_attn_version() == 3)

	# Dont try to access the runner on AMD
	if self.aot_schedule:
	self.page_size = self.runner.block_size

Uh oh!

Conversation

vllmellm commented May 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AITER MLA Support for V1 Engine

Accuracy Validation

Performance:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houseroad commented May 5, 2025

Uh oh!

hongxiayang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongxiayang May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hongxiayang commented May 5, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vllmellm May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vllmellm May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vllmellm May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented May 7, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chaunceyjiang commented May 9, 2025

Uh oh!

tjtanaa commented May 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

vllmellm commented May 1, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth May 7, 2025 •

edited

Loading

hongxiayang left a comment •

edited

Loading

hongxiayang May 2, 2025 •

edited

Loading

vllmellm May 7, 2025 •

edited

Loading

vllmellm May 8, 2025 •

edited

Loading

vllmellm May 7, 2025 •

edited

Loading