Skip to content

[feat] Enable mm caching for transformers backend #21358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 22, 2025

Conversation

zucchini-nlp
Copy link
Contributor

@zucchini-nlp zucchini-nlp commented Jul 22, 2025

As per title, I didn't find a reason for the two caching to be tied. After this change, we can also stop asking users to explicitly set disable_mm_caching=True in transformers backend and always use the no-cache code path when processing

After it is merged, this blogpost has to be updated as well: vllm-project/vllm-project.github.io#61

cc @hmellor

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Jul 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly decouples multi-modal (MM) caching from prefix caching. The changes are logical and well-targeted. By switching the condition in need_extra_keys from request.mm_positions to request.mm_hashes, the prefix caching for MM inputs is now correctly triggered only when MM hashes are available. The removal of the error-raising check in the transformers backend complements this by allowing it to gracefully opt-out of MM caching. The documentation update is also consistent with these changes. The fix appears solid and addresses the described issue effectively.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

Could you elaborate a bit? Do you mean that because you return mm_hashes=None, the multi-modal cache is automatically disabled so there is no need for users to set the flag?

@DarkLight1337 DarkLight1337 requested a review from Isotr0py July 22, 2025 09:57
@zucchini-nlp
Copy link
Contributor Author

Do you mean that because you return mm_hashes=None, the multi-modal cache is automatically disabled so there is no need for users to set the flag?

Exactly! In current version there is a dependency if disable_mm_cache=True then we must as well enable_prefix_cache=False, otherwise transformers model fails. It is due to the updated lines we assume that mm cache is present if mm positions are in the request. The bug wasn't seen before because I had disabled prefix caching and without it cache logic is entirely skipped

@DarkLight1337
Copy link
Member

Hmm, then what happens if the user explicitly sets disable_mm_cache after this change? Won't they still get this error?

@DarkLight1337
Copy link
Member

Btw, why do we have to skip computing the hashes in the first place? Even if it is not used in the multimodal processor, we can still use it for caching multimodal embeddings which is used for chunked prefill.

@zucchini-nlp
Copy link
Contributor Author

Btw, why do we have to skip computing the hashes in the first place?

We can't cache during processing the same way as vLLM processors do, because vLLM applies text-only/mm-only processing and then adds placeholders manually. Transformers doesn't support it per model and it will be too much work to implement, so yesterday we decided processor caching will not an option for Transformers backend with @hmellor

Even if it is not used in the multimodal processor, we can still use it for caching multimodal embeddings which is used for chunked prefill.

Do we still need to compute caches from processor code for that? Hmm, I assumed prefix caching and chunked prefill re-compute the cache again. In that case, how is mm caching linked to other parts of vLLM?

Hmm, then what happens if the user explicitly sets disable_mm_cache after this change? Won't they still get this error?

It won't change anything for users, disable_mm_cache will become a no-op. It is also no-op now, but we just enforce users to pass flags every time

@DarkLight1337
Copy link
Member

IIRC we only compute the cache once in the multimodal processor in P0 and pass the hash directly to P1 where the model is run. So it is still helpful to compute the hashes

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 22, 2025

@ywang96 can help verify in case something changed. It has been a while since I last looked at that code and I'm outside rn

@zucchini-nlp
Copy link
Contributor Author

Ah oke, lemme see, would be great to add a test with prefix caching as well

Signed-off-by: raushan <[email protected]>
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Jul 22, 2025
@zucchini-nlp
Copy link
Contributor Author

Inference works fine, though I don't see any special prefix caching tests on multimodal models. Should I add a new testcase or existing ones are enough?

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Isotr0py Isotr0py enabled auto-merge (squash) July 22, 2025 13:13
@DarkLight1337
Copy link
Member

Inference works fine, though I don't see any special prefix caching tests on multimodal models. Should I add a new testcase or existing ones are enough?

It's enabled by default so no need to add tests

@DarkLight1337 DarkLight1337 added this to the v0.10.0 milestone Jul 22, 2025
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 22, 2025
@Isotr0py
Copy link
Member

Inference works fine, though I don't see any special prefix caching tests on multimodal models. Should I add a new testcase or existing ones are enough?

We have batching same images inputs test case to cover prefix-caching, so existing ones is enough:

IMAGE_SIZE_FACTORS = [(), (1.0, ), (1.0, 1.0, 1.0), (0.25, 0.5, 1.0)]

@zucchini-nlp zucchini-nlp changed the title [Bugfix] mm caching isn't tied to prefix caching [feat] Enable mm caching for transformers backend Jul 22, 2025
@vllm-bot vllm-bot merged commit f38ee34 into vllm-project:main Jul 22, 2025
65 of 75 checks passed
@DarkLight1337
Copy link
Member

The transformers models tests passed, merging early

yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jul 23, 2025
zixi-qi pushed a commit to zixi-qi/vllm that referenced this pull request Jul 23, 2025
LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
wenscarl pushed a commit to wenscarl/vllm that referenced this pull request Aug 4, 2025
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants