Skip to content

Commit d100776

Browse files
mgoingemini-code-assist[bot]
authored andcommitted
[Bugfix] Disable cascade attention with FlashInfer (#26130)
Signed-off-by: mgoin <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: simon-mo <[email protected]>
1 parent c75c2e7 commit d100776

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

vllm/v1/attention/backends/flashinfer.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@
2929
flashinfer_disable_q_quantization,
3030
supports_trtllm_attention,
3131
use_trtllm_attention)
32-
from vllm.v1.attention.backends.flash_attn import use_cascade_attention
3332
# yapf conflicts with isort for this block
3433
# yapf: disable
3534
from vllm.v1.attention.backends.utils import (AttentionCGSupport,
@@ -677,7 +676,9 @@ def use_cascade_attention(self, *args, **kwargs) -> bool:
677676
# TODO: The cascade wrapper currently does not support setting
678677
# kv cache dtype to something different from query dtype.
679678
return False
680-
return use_cascade_attention(*args, **kwargs)
679+
# TODO: Cascade attention doesn't work, disable it for now
680+
# return use_cascade_attention(*args, **kwargs)
681+
return False
681682

682683

683684
class FlashInferImpl(AttentionImpl):

0 commit comments

Comments
 (0)