Skip to content

Conversation

hl475
Copy link
Contributor

@hl475 hl475 commented Oct 12, 2025

Purpose

This PR makes K/V cache unbinding robust across cache layouts by detecting the axis of size 2 at runtime instead of assuming it sits at dim=1. This fixes unpacking errors seen when kv_cache is shaped with the K/V dimension elsewhere (e.g., dim=0).

When running tests tests/v1/attention/test_attention_backends.py on H100, the following line

key_cache, value_cache = kv_cache.unbind(1)

failed with ValueError: too many values to unpack (expected 2)

This change avoids relying on a specific layout and works with both older and newer cache shapes.

Per discussion below, we updated the test to provide Triton the same (num_blocks, 2, …) KV cache layout that FlashInfer uses.

Test Plan

pytest -v -s tests/v1/attention/test_attention_backends.py

on H100

Test Result

============================================================= 15 passed, 1 warning in 48.49s ==============================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the v1 label Oct 12, 2025
@hl475 hl475 marked this pull request as ready for review October 12, 2025 18:23
@hl475 hl475 requested a review from tdoublep as a code owner October 12, 2025 18:23
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@tdoublep
Copy link
Member

Could you explain exactly which test is failing? Could it just be that the test is using the "wrong" layout?

@hl475
Copy link
Contributor Author

hl475 commented Oct 12, 2025

Could you explain exactly which test is failing? Could it just be that the test is using the "wrong" layout?

Thanks @tdoublep ! The failed test is from https://github.com/vllm-project/vllm/blob/a6049be73cb965bad04f6657de6c4d98261a5237/tests/v1/attention/test_attention_backends.py where all 15 tests are failing on H100 with the same unpack error

@tdoublep
Copy link
Member

tdoublep commented Oct 12, 2025

OK. I think that the test should be modified to provide the KV cache with the correct layout. e.g., we can look at how it works for Flashinfer which has the same (num_blocks, 2, ...) layout.

Signed-off-by: Huamin Li <[email protected]>
@hl475 hl475 force-pushed the fix_kv_cache_unbind branch from 7798745 to 81e81de Compare October 12, 2025 18:57
@hl475 hl475 changed the title Triton attention: detect K/V axis when unbinding kv_cache (avoid hardcoded dim=1) tests(v1): feed Triton attention the (num_blocks, 2, …) KV cache layout in backend-correctness tests Oct 12, 2025
@hl475
Copy link
Contributor Author

hl475 commented Oct 12, 2025

Thanks @tdoublep for suggestions! I updated my PR to only change the test. PTAL

@tdoublep
Copy link
Member

Thank you. Curious now why this test failure wasn't caught as part of CI. Are we failing to trigger this test when we change the attention backend code?

@hl475
Copy link
Contributor Author

hl475 commented Oct 12, 2025

Thank you. Curious now why this test failure wasn't caught as part of CI. Are we failing to trigger this test when we change the attention backend code?

I don't think this test is currently running in CI. We are trying to enabling them from #26649 , so I found the failure.

@yeqcharlotte yeqcharlotte changed the title tests(v1): feed Triton attention the (num_blocks, 2, …) KV cache layout in backend-correctness tests [CI/Build] tests(v1): feed Triton attention the (num_blocks, 2, …) KV cache layout in backend-correctness tests Oct 13, 2025
@yeqcharlotte
Copy link
Collaborator

@tdoublep there's no H100 in the test queue. so pretty much most attention tests that are relevant have not been running. @hl475 and @simon-mo got a new one just set up this week.

@seindum
Copy link

seindum commented Oct 13, 2025

This seems to be covered already by #26597.

Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yeqcharlotte yeqcharlotte enabled auto-merge (squash) October 14, 2025 15:37
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025
@tdoublep
Copy link
Member

there's no H100 in the test queue. so pretty much most attention tests that are relevant have not been running.

How come we can't run the attention tests on L4 where the other tests run?

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (needs follow-up to enable test to run in CI)

@tdoublep
Copy link
Member

tdoublep commented Oct 14, 2025

Hmm CI failure suggests we are trying to run this test on CPU now?? Looking at the test job definition, I don't understand why the attention test is running?

Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu tests pick up attention tests after this pr :(

@tdoublep
Copy link
Member

@yeqcharlotte Do you understand how that can be happening? I'm a bit baffled tbh

@tdoublep
Copy link
Member

It's like the CI job is trying to execute commands that are different to what is checked into the branch. It's really weird. I tried to create a clean branch (added you as co-author @hl475) with this change and triggered the failing job in CI to see if it is reproducible.

@tdoublep
Copy link
Member

tdoublep commented Oct 15, 2025

So after some investigation it looks like we are now generating the test pipeline automatically based on the files that have changed vllm-project/ci-infra#184

This PR changes a single test that should be run on GPU, and it trying to run it in the CPU jobs.

@rzabarazesh
Copy link
Collaborator

I have been investigating this. This isn't really a test filtering issue. The main problem is that these tests are orphaned and not being run anywhere in the first place. Test filtering is doing a "best guess" but ends up putting it in the wrong test group.

@rzabarazesh
Copy link
Collaborator

I fixed the CI issue above is now resolved in vllm-project/ci-infra#194.
As far as the signals are concerned this test is still orphaned however

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tdoublep
Copy link
Member

@rzabarazesh Thank you for the investigation and fix!

@yeqcharlotte It looks like we can't merge until you approve since you requested changes.

@yeqcharlotte yeqcharlotte merged commit c312320 into vllm-project:main Oct 18, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants