Skip to content

Commit 28a1529

Browse files
authored
[cherry-pick][v0.11.0-dev][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4099)
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This is cherry-pick from #4097 . Currently, we set `seq_lens` in dummy attn_metadata to be `max_model_len` to get max workspace for attention during capturing. However, setting it consistently to be `max_model_len` causing dummy_run to execute a long attention when running actual inference. For example, if there is a single req with `seqs_lens` as [8] but `max_model_len` is 131072, the whole process will be slow down by dummy_run as it execute a fake long-seq attention. Therefore, we instead set it to max_query_len, which is also consistent with vLLM gpu implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: Angazenn <[email protected]>
1 parent 7732a89 commit 28a1529

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

vllm_ascend/worker/model_runner_v1.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2258,7 +2258,7 @@ def _build_dummy_attn_metadata(
22582258

22592259
attn_metadata = {}
22602260

2261-
seq_lens = self.model_config.max_model_len
2261+
seq_lens = max_query_len
22622262
self.seq_lens_np[:num_reqs] = seq_lens
22632263
self.seq_lens_np[num_reqs:] = 0
22642264

0 commit comments

Comments
 (0)