[FEAT][ROCm]: Support AITER MLA on V1 Engine#17523
[FEAT][ROCm]: Support AITER MLA on V1 Engine#17523DarkLight1337 merged 48 commits intovllm-project:mainfrom
Conversation
Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: qli88 <qiang.li2@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: ArthurAMD yajhuang@amd.com Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
… if/else statements Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
…tention selector backend Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
| POLLING_TIMEOUT_S = POLLING_TIMEOUT_MS // 1000 | ||
|
|
||
| EXECUTE_MODEL_TIMEOUT_S = 40 | ||
| EXECUTE_MODEL_TIMEOUT_S = (envs.VLLM_ROCM_EXECUTE_MODEL_TIMEOUT |
There was a problem hiding this comment.
wondering why rocm needs a much larger timeout here?
There was a problem hiding this comment.
on first time run when the graph is being created it might take between 100-250 seconds depending on how many AITER kernels are enabled. Thus we kept the default timeout to 250s.
There was a problem hiding this comment.
I'm not crazy about requiring another environment variable when running AITER. Can you just set the timeout to 250 here instead of asking the user to increase the timeout? Feel free to give a "safe" timeout.
There was a problem hiding this comment.
Will these very long runs only happen during the profile and graph capture runs? Or can they happen while processing real requests?
|
Could you check the failed tests? |
There was a problem hiding this comment.
Approve with comment.
To make Deepseek V1 performant, it needs additional work.
Based on my test, it can improve TTFT if additional AITER environment variables are used. Otherwise, the TTFT is not as good comparing to V0. Throughput is not good yet comparing to V0.
| compute_slot_mapping_start_idx, | ||
| is_block_tables_empty) | ||
| from vllm.attention.ops.rocm_aiter_mla import (aiter_mla_decode_fwd, | ||
| from vllm.attention.ops.rocm_aiter_mla import (aiter_mla_decode_forward, |
There was a problem hiding this comment.
nit: this change from fwd -> forward seems not necessary, in order to minimize the number of files changed in this PR.
There was a problem hiding this comment.
it has been addressed in the latest commit.
vllm/attention/ops/rocm_aiter_mla.py
Outdated
|
|
||
|
|
||
| def aiter_mla_decode_fwd( | ||
| def aiter_mla_decode_forward( |
There was a problem hiding this comment.
We can keep the name as _fwd (see below line 37 decode_fwd)
|
@houseroad I found using below environment variables can fix the huge TTFT issue described in the PR. My command is below: For input-len/output-len/concurrency/prompts 1000/1000/1/2, the TTFT is changed from 57419.91 to 154.8. |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
SageMoore
left a comment
There was a problem hiding this comment.
Looks reasonable. Just a few nits and questions
| from aiter import flash_attn_varlen_func | ||
| self.flash_attn_varlen_func = flash_attn_varlen_func | ||
|
|
||
| def _flash_attn_varlen_diff_headdims(self, |
There was a problem hiding this comment.
@SageMoore you may want to check coomon.py the method _flash_attn_varlen_diff_headdims is defined there and overridden in this class.
There was a problem hiding this comment.
I see now. I must have mistyped the string when I searched for it :).
| assert max_model_len == 32768,\ | ||
| "AITER MLA requires max_model_len=32768" | ||
| assert self.runner.block_size == 1, "AITER MLA" \ | ||
| "requires only block size 1." |
There was a problem hiding this comment.
Nit: "only supports block size 1."
| POLLING_TIMEOUT_S = POLLING_TIMEOUT_MS // 1000 | ||
|
|
||
| EXECUTE_MODEL_TIMEOUT_S = 40 | ||
| EXECUTE_MODEL_TIMEOUT_S = (envs.VLLM_ROCM_EXECUTE_MODEL_TIMEOUT |
There was a problem hiding this comment.
I'm not crazy about requiring another environment variable when running AITER. Can you just set the timeout to 250 here instead of asking the user to increase the timeout? Feel free to give a "safe" timeout.
vllm/envs.py
Outdated
| lambda: int(os.getenv("VLLM_RPC_TIMEOUT", "10000")), | ||
|
|
||
| # Time in seconds for the model execution in ROCm platforms. | ||
| "VLLM_ROCM_EXECUTE_MODEL_TIMEOUT": |
There was a problem hiding this comment.
Let's remove this. See below comment.
There was a problem hiding this comment.
@SageMoore At this moment we can't find "safe" timeout because depending on number of AITER kernels are enable knowing the "safe" timeout is difficult as tracing the AITER jit files might be time consuming during execution time and might change as AITER ops might change based on different versions would be used in future. Thus, rather than having a hardcoded timeout that might trouble the user where to spot it in the code they are able to control this value with environment variable.
There was a problem hiding this comment.
@SageMoore The environment variable has been removed.
There was a problem hiding this comment.
@SageMoore : the env change is removed as we discussed. Please merge this asap if there are no other blockers.
| # `context_chunk_starts` that are not aligned to page_size | ||
| max_context_chunk = round_down(max_context_chunk, | ||
| self.page_size) | ||
| if self.aot_schedule: |
There was a problem hiding this comment.
Could you explain this a bit? Why was this change necessary?
There was a problem hiding this comment.
@SageMoore the self.page_size if only defined in __init__ with the condition self.aot_schedule while on ROCm this condition is not true and it encounters the error self.page_size is not defined.
vllm/vllm/v1/attention/backends/mla/common.py
Lines 355 to 359 in ba7703e
You may want to ask the author about this as these line changes were added in this PR.
anyways if self.page_size is defined without this self.aot_schedule condition it does not have any effect on ROCm at least for AITER MLA which is the only MLA backend in V1 currently.
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
SageMoore
left a comment
There was a problem hiding this comment.
Looks reasonable. Thanks for taking out the timeout changes!
|
It seems that this PR introduced a static check error. https://github.com/vllm-project/vllm/actions/runs/14923384538/job/41922811885?pr=17845 |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
fixed by #17880 |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
AITER MLA Support for V1 Engine
This PR implements AITER MLA attention backend support for the V1 engine. The implementation mirrors the V0 engine's established approach from PR #15893.
This PR also introduces a new environment variable,
VLLM_ROCM_EXECUTE_MODEL_TIMEOUT, which specifies the model execution timeout in seconds. This allows for flexible adjustment of execution time, which is helpful since a timeout error was encountered during graph building when enabling AITER MLA ops on the v1 engine.Accuracy Validation
using the command below:
VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=1 lm_eval \ --model vllm \ --model_args pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,trust_remote_code=True,max_model_len=32768,block_size=1,enforce_eager=False \ --tasks gsm8k --num_fewshot 5 --batch_size autoResults:
Performance:
The results of
benchmarks/benchmark_serving.pyusing the commands below:
v0 engine =
VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=0 python benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-V3 --trust-remote-code --dataset-name randomv1 engine =
VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=1 python benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-V3 --trust-remote-code --dataset-name random