-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[CI/Build] tests(v1): feed Triton attention the (num_blocks, 2, …) KV cache layout in backend-correctness tests #26663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Could you explain exactly which test is failing? Could it just be that the test is using the "wrong" layout? |
Thanks @tdoublep ! The failed test is from https://github.com/vllm-project/vllm/blob/a6049be73cb965bad04f6657de6c4d98261a5237/tests/v1/attention/test_attention_backends.py where all 15 tests are failing on H100 with the same unpack error |
OK. I think that the test should be modified to provide the KV cache with the correct layout. e.g., we can look at how it works for Flashinfer which has the same (num_blocks, 2, ...) layout. |
Signed-off-by: Huamin Li <[email protected]>
7798745
to
81e81de
Compare
Thanks @tdoublep for suggestions! I updated my PR to only change the test. PTAL |
Thank you. Curious now why this test failure wasn't caught as part of CI. Are we failing to trigger this test when we change the attention backend code? |
I don't think this test is currently running in CI. We are trying to enabling them from #26649 , so I found the failure. |
This seems to be covered already by #26597. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
How come we can't run the attention tests on L4 where the other tests run? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (needs follow-up to enable test to run in CI)
Hmm CI failure suggests we are trying to run this test on CPU now?? Looking at the test job definition, I don't understand why the attention test is running? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu tests pick up attention tests after this pr :(
@yeqcharlotte Do you understand how that can be happening? I'm a bit baffled tbh |
It's like the CI job is trying to execute commands that are different to what is checked into the branch. It's really weird. I tried to create a clean branch (added you as co-author @hl475) with this change and triggered the failing job in CI to see if it is reproducible. |
So after some investigation it looks like we are now generating the test pipeline automatically based on the files that have changed vllm-project/ci-infra#184 This PR changes a single test that should be run on GPU, and it trying to run it in the CPU jobs. |
I have been investigating this. This isn't really a test filtering issue. The main problem is that these tests are orphaned and not being run anywhere in the first place. Test filtering is doing a "best guess" but ends up putting it in the wrong test group. |
I fixed the CI issue above is now resolved in vllm-project/ci-infra#194. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@rzabarazesh Thank you for the investigation and fix! @yeqcharlotte It looks like we can't merge until you approve since you requested changes. |
Purpose
This PR makes K/V cache unbinding robust across cache layouts by detecting the axis of size 2 at runtime instead of assuming it sits atdim=1
. This fixes unpacking errors seen whenkv_cache
is shaped with the K/V dimension elsewhere (e.g.,dim=0
).When running tests
tests/v1/attention/test_attention_backends.py
on H100, the following linefailed with
ValueError: too many values to unpack (expected 2)
This change avoids relying on a specific layout and works with both older and newer cache shapes.Per discussion below, we updated the test to provide Triton the same (num_blocks, 2, …) KV cache layout that FlashInfer uses.
Test Plan
on H100
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.