[Bugfix] Fix model run _npu_flash_attention hang issue #4410

Semmer2 · 2025-11-24T12:19:38Z

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang issue, it was caused by wrong attention mask dtype.

What this PR does / why we need it?

Does this PR introduce any user-facing change?

No

How was this patch tested?

Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@2918c1b

github-actions · 2025-11-24T12:21:34Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

The pull request effectively addresses a reported hang issue in _npu_flash_attention by refining the handling of attention mask dtypes. The key change involves refactoring the logic for retrieving the chunked_prefill_attn_mask into a dedicated method, get_chunked_prefill_attn_mask. This new method explicitly ensures the mask is converted to torch.bool, which is crucial for the correct operation of the attention mechanism. This refactoring not only fixes the bug but also enhances code clarity and maintainability by centralizing the dtype conversion for this specific mask.

gemini-code-assist · 2025-11-24T12:21:45Z

vllm_ascend/attention/attention_mask.py

+    def get_chunked_prefill_attn_mask(self):
+        return self.chunked_prefill_attn_mask.to(torch.bool)
+
    def get_attn_mask(self, max_seq_len: int, dtype: torch.dtype,


The introduction of get_chunked_prefill_attn_mask and the removal of the conditional logic from get_attn_mask is a significant improvement. This refactoring clearly separates the responsibility of providing the chunked prefill attention mask and explicitly ensures its torch.bool dtype. This directly addresses the reported "wrong attention mask dtype" issue, which was causing hangs in _npu_flash_attention, by enforcing the correct data type for this specific mask. It also makes the get_attn_mask method more focused on its general purpose.

gemini-code-assist · 2025-11-24T12:21:45Z

vllm_ascend/worker/model_runner_v1.py

        elif attn_state == AscendAttentionState.PrefillCacheHit:
-            return self.attn_mask_builder.get_attn_mask(
-                2048, self.dtype, self.device)
+            return self.attn_mask_builder.get_chunked_prefill_attn_mask()


Updating the call to use the new get_chunked_prefill_attn_mask() method is a correct and consistent application of the refactored logic. This change ensures that the attention mask used in the PrefillCacheHit state consistently has the torch.bool dtype, which is essential for preventing the _npu_flash_attention hang issue as described in the PR.

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang issue, it was caused by wrong attention mask dtype. Signed-off-by: Ting FU <[email protected]>

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

Semmer2 force-pushed the FixFAHangIssue branch 3 times, most recently from f5f2eac to 36d1b5e Compare November 25, 2025 07:30

[Bugfix] Fix model run _npu_flash_attention hang issue

5b7ef34

Fix model run _npu_flash_attention in _forward_prefill_no_cache hang issue, it was caused by wrong attention mask dtype. Signed-off-by: Ting FU <[email protected]>

Semmer2 force-pushed the FixFAHangIssue branch from 36d1b5e to 5b7ef34 Compare November 25, 2025 08:29

github-actions bot added the module:tests label Nov 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix model run _npu_flash_attention hang issue #4410

[Bugfix] Fix model run _npu_flash_attention hang issue #4410

Semmer2 commented Nov 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Uh oh!

gemini-code-assist bot Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Bugfix] Fix model run _npu_flash_attention hang issue #4410

Are you sure you want to change the base?

[Bugfix] Fix model run _npu_flash_attention hang issue #4410

Conversation

Semmer2 commented Nov 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Semmer2 commented Nov 24, 2025 •

edited by github-actions bot

Loading