Skip to content

Commit 0f7492d

Browse files
[Bugfix] fix the oom when chunkprefill with long context like 64k (#2319)
The attn mask was declared in the mla.py,we don't need the splitfuse mask when mla chunkprefill, and this mask will cause memory problem when long context like 64k or 128k - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 --------- Signed-off-by: haojiangzheng <[email protected]>
1 parent 8bfd16a commit 0f7492d

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

vllm_ascend/worker/model_runner_v1.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -842,7 +842,7 @@ def get_supported_tasks(self) -> "tuple[SupportedTask, ...]":
842842
def _make_attention_mask(self, seq_lens, query_lens, position,
843843
attn_state) -> torch.Tensor:
844844
# Chunk Prefill situation.
845-
if attn_state == AscendAttentionState.ChunkedPrefill:
845+
if attn_state == AscendAttentionState.ChunkedPrefill and not self.vllm_config.model_config.use_mla:
846846
return self.attn_mask_builder.get_splitfuse_attn_mask(
847847
seq_lens, query_lens, position, self.dtype, self.device)
848848
# Prefill without cache situation.

0 commit comments

Comments
 (0)