[feat] Enable mm caching for transformers backend #21358

zucchini-nlp · 2025-07-22T08:18:03Z

As per title, I didn't find a reason for the two caching to be tied. After this change, we can also stop asking users to explicitly set disable_mm_caching=True in transformers backend and always use the no-cache code path when processing

After it is merged, this blogpost has to be updated as well: vllm-project/vllm-project.github.io#61

cc @hmellor

Signed-off-by: raushan <[email protected]>

gemini-code-assist

Code Review

This pull request correctly decouples multi-modal (MM) caching from prefix caching. The changes are logical and well-targeted. By switching the condition in need_extra_keys from request.mm_positions to request.mm_hashes, the prefix caching for MM inputs is now correctly triggered only when MM hashes are available. The removal of the error-raising check in the transformers backend complements this by allowing it to gracefully opt-out of MM caching. The documentation update is also consistent with these changes. The fix appears solid and addresses the described issue effectively.

github-actions · 2025-07-22T08:32:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-07-22T09:56:56Z

Could you elaborate a bit? Do you mean that because you return mm_hashes=None, the multi-modal cache is automatically disabled so there is no need for users to set the flag?

zucchini-nlp · 2025-07-22T10:09:51Z

Do you mean that because you return mm_hashes=None, the multi-modal cache is automatically disabled so there is no need for users to set the flag?

Exactly! In current version there is a dependency if disable_mm_cache=True then we must as well enable_prefix_cache=False, otherwise transformers model fails. It is due to the updated lines we assume that mm cache is present if mm positions are in the request. The bug wasn't seen before because I had disabled prefix caching and without it cache logic is entirely skipped

DarkLight1337 · 2025-07-22T10:26:03Z

Hmm, then what happens if the user explicitly sets disable_mm_cache after this change? Won't they still get this error?

DarkLight1337 · 2025-07-22T10:28:52Z

Btw, why do we have to skip computing the hashes in the first place? Even if it is not used in the multimodal processor, we can still use it for caching multimodal embeddings which is used for chunked prefill.

zucchini-nlp · 2025-07-22T10:35:55Z

Btw, why do we have to skip computing the hashes in the first place?

We can't cache during processing the same way as vLLM processors do, because vLLM applies text-only/mm-only processing and then adds placeholders manually. Transformers doesn't support it per model and it will be too much work to implement, so yesterday we decided processor caching will not an option for Transformers backend with @hmellor

Even if it is not used in the multimodal processor, we can still use it for caching multimodal embeddings which is used for chunked prefill.

Do we still need to compute caches from processor code for that? Hmm, I assumed prefix caching and chunked prefill re-compute the cache again. In that case, how is mm caching linked to other parts of vLLM?

Hmm, then what happens if the user explicitly sets disable_mm_cache after this change? Won't they still get this error?

It won't change anything for users, disable_mm_cache will become a no-op. It is also no-op now, but we just enforce users to pass flags every time

DarkLight1337 · 2025-07-22T10:50:54Z

IIRC we only compute the cache once in the multimodal processor in P0 and pass the hash directly to P1 where the model is run. So it is still helpful to compute the hashes

DarkLight1337 · 2025-07-22T10:51:43Z

@ywang96 can help verify in case something changed. It has been a while since I last looked at that code and I'm outside rn

zucchini-nlp · 2025-07-22T11:04:59Z

Ah oke, lemme see, would be great to add a test with prefix caching as well

Signed-off-by: raushan <[email protected]>

zucchini-nlp · 2025-07-22T13:06:18Z

Inference works fine, though I don't see any special prefix caching tests on multimodal models. Should I add a new testcase or existing ones are enough?

Isotr0py

LGTM!

DarkLight1337 · 2025-07-22T13:14:14Z

Inference works fine, though I don't see any special prefix caching tests on multimodal models. Should I add a new testcase or existing ones are enough?

It's enabled by default so no need to add tests

Isotr0py · 2025-07-22T13:17:23Z

Inference works fine, though I don't see any special prefix caching tests on multimodal models. Should I add a new testcase or existing ones are enough?

We have batching same images inputs test case to cover prefix-caching, so existing ones is enough:

vllm/tests/models/multimodal/generation/vlm_utils/types.py

Line 42 in 10904e6

IMAGE_SIZE_FACTORS = [(), (1.0, ), (1.0, 1.0, 1.0), (0.25, 0.5, 1.0)]

DarkLight1337 · 2025-07-22T15:19:04Z

The transformers models tests passed, merging early

Signed-off-by: raushan <[email protected]> Signed-off-by: qizixi <[email protected]>

Signed-off-by: raushan <[email protected]> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: raushan <[email protected]>

Signed-off-by: raushan <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

Signed-off-by: raushan <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: raushan <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

Signed-off-by: raushan <[email protected]>

dont ask to explicitly disable caching

ea65be5

Signed-off-by: raushan <[email protected]>

zucchini-nlp requested review from WoosukKwon, alexm-redhat, comaniac, hmellor, njhill, robertgshaw2-redhat and ywang96 as code owners July 22, 2025 08:18

mergify bot added documentation Improvements or additions to documentation v1 labels Jul 22, 2025

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

DarkLight1337 requested a review from Isotr0py July 22, 2025 09:57

return hashes

a4290b0

Signed-off-by: raushan <[email protected]>

zucchini-nlp requested a review from DarkLight1337 as a code owner July 22, 2025 13:04

mergify bot added the multi-modality Related to multi-modality (#4194) label Jul 22, 2025

Isotr0py approved these changes Jul 22, 2025

View reviewed changes

Isotr0py enabled auto-merge (squash) July 22, 2025 13:13

DarkLight1337 added this to the v0.10.0 milestone Jul 22, 2025

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 22, 2025

zucchini-nlp changed the title ~~[Bugfix] mm caching isn't tied to prefix caching~~ [feat] Enable mm caching for transformers backend Jul 22, 2025

vllm-bot merged commit f38ee34 into vllm-project:main Jul 22, 2025
65 of 75 checks passed

xinli-git mentioned this pull request Jul 22, 2025

Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num #21325

Merged

3 tasks

zixi-qi pushed a commit to zixi-qi/vllm that referenced this pull request Jul 23, 2025

[feat] Enable mm caching for transformers backend (vllm-project#21358)

6666593

Signed-off-by: raushan <[email protected]> Signed-off-by: qizixi <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[feat] Enable mm caching for transformers backend (vllm-project#21358)

e805e76

Signed-off-by: raushan <[email protected]> Signed-off-by: x22x22 <[email protected]>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[feat] Enable mm caching for transformers backend (vllm-project#21358)

b3fcf24

Signed-off-by: raushan <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[feat] Enable mm caching for transformers backend (vllm-project#21358)

e09954c

Signed-off-by: raushan <[email protected]>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[feat] Enable mm caching for transformers backend (vllm-project#21358)

17a9aae

Signed-off-by: raushan <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[feat] Enable mm caching for transformers backend (vllm-project#21358)

324c755

Signed-off-by: raushan <[email protected]> Signed-off-by: Paul Pak <[email protected]>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[feat] Enable mm caching for transformers backend (vllm-project#21358)

2c3d26b

Signed-off-by: raushan <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[feat] Enable mm caching for transformers backend (vllm-project#21358)

2ca9e1f

Signed-off-by: raushan <[email protected]>

hmellor added this to Transformers backend Oct 7, 2025

hmellor moved this to Done in Transformers backend Oct 7, 2025

Uh oh!

[feat] Enable mm caching for transformers backend #21358

[feat] Enable mm caching for transformers backend #21358

Uh oh!

Conversation

zucchini-nlp commented Jul 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

DarkLight1337 commented Jul 22, 2025

Uh oh!

zucchini-nlp commented Jul 22, 2025

Uh oh!

DarkLight1337 commented Jul 22, 2025

Uh oh!

DarkLight1337 commented Jul 22, 2025

Uh oh!

zucchini-nlp commented Jul 22, 2025

Uh oh!

DarkLight1337 commented Jul 22, 2025

Uh oh!

DarkLight1337 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Jul 22, 2025

Uh oh!

zucchini-nlp commented Jul 22, 2025

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jul 22, 2025

Uh oh!

Isotr0py commented Jul 22, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zucchini-nlp commented Jul 22, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Jul 22, 2025 •

edited

Loading