-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
[feat] Enable mm caching for transformers backend #21358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: raushan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly decouples multi-modal (MM) caching from prefix caching. The changes are logical and well-targeted. By switching the condition in need_extra_keys
from request.mm_positions
to request.mm_hashes
, the prefix caching for MM inputs is now correctly triggered only when MM hashes are available. The removal of the error-raising check in the transformers
backend complements this by allowing it to gracefully opt-out of MM caching. The documentation update is also consistent with these changes. The fix appears solid and addresses the described issue effectively.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Could you elaborate a bit? Do you mean that because you return |
Exactly! In current version there is a dependency if |
Hmm, then what happens if the user explicitly sets |
Btw, why do we have to skip computing the hashes in the first place? Even if it is not used in the multimodal processor, we can still use it for caching multimodal embeddings which is used for chunked prefill. |
We can't cache during processing the same way as vLLM processors do, because vLLM applies text-only/mm-only processing and then adds placeholders manually. Transformers doesn't support it per model and it will be too much work to implement, so yesterday we decided processor caching will not an option for Transformers backend with @hmellor
Do we still need to compute caches from processor code for that? Hmm, I assumed prefix caching and chunked prefill re-compute the cache again. In that case, how is mm caching linked to other parts of vLLM?
It won't change anything for users, |
IIRC we only compute the cache once in the multimodal processor in P0 and pass the hash directly to P1 where the model is run. So it is still helpful to compute the hashes |
@ywang96 can help verify in case something changed. It has been a while since I last looked at that code and I'm outside rn |
Ah oke, lemme see, would be great to add a test with prefix caching as well |
Signed-off-by: raushan <[email protected]>
Inference works fine, though I don't see any special prefix caching tests on multimodal models. Should I add a new testcase or existing ones are enough? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
It's enabled by default so no need to add tests |
We have batching same images inputs test case to cover prefix-caching, so existing ones is enough:
|
The transformers models tests passed, merging early |
Signed-off-by: raushan <[email protected]>
Signed-off-by: raushan <[email protected]> Signed-off-by: qizixi <[email protected]>
Signed-off-by: raushan <[email protected]>
Signed-off-by: raushan <[email protected]> Signed-off-by: avigny <[email protected]>
Signed-off-by: raushan <[email protected]> Signed-off-by: shuw <[email protected]>
Signed-off-by: raushan <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: raushan <[email protected]>
Signed-off-by: raushan <[email protected]>
Signed-off-by: raushan <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: raushan <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: raushan <[email protected]>
Signed-off-by: raushan <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
As per title, I didn't find a reason for the two caching to be tied. After this change, we can also stop asking users to explicitly set
disable_mm_caching=True
in transformers backend and always use the no-cache code path when processingAfter it is merged, this blogpost has to be updated as well: vllm-project/vllm-project.github.io#61
cc @hmellor