UUID based caching optimization by vincentzed · Pull Request #2 · vincentzed/vllm

vincentzed · 2025-08-02T15:14:25Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

vincentzed · 2025-08-02T15:14:31Z

/gemini review

github-actions · 2025-08-02T15:14:34Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vincentzed · 2025-08-02T15:17:56Z

/gemini review

cubic-dev-ai

cubic analysis

6 issues found across 5 files • Review in cubic

_{React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

vllm/multimodal/hasher.py

cubic-dev-ai · 2025-08-02T15:18:14Z

vllm/multimodal/hasher.py

Ensure every cache key is converted to str before appending to maintain the declared return type and avoid runtime type errors.

Prompt for AI agents

Address the following comment on vllm/multimodal/hasher.py at line 39: <comment>Ensure every cache key is converted to str before appending to maintain the declared return type and avoid runtime type errors.</comment> <file context> @@ -23,6 +23,45 @@ class MultiModalHasher: + @classmethod + def get_cache_keys_from_mm_data(cls, mm_data: Mapping[str, object]) -> list[str]: + """Extract cache keys from multimodal data, preferring UUIDs.""" + cache_keys = [] + + # Check for UUID fields first </file context>

Suggested change

cache_keys.append(uuids)

cache_keys.append(str(uuids))

vllm/multimodal/utils.py

cubic-dev-ai · 2025-08-02T15:18:14Z

vllm/multimodal/utils.py

Unused parameter makes the API misleading – url is accepted but never used

Prompt for AI agents

Address the following comment on vllm/multimodal/utils.py at line 291: <comment>Unused parameter makes the API misleading – url is accepted but never used</comment> <file context> @@ -286,6 +286,74 @@ def fetch_image_embedding( return image_embedding_io.load_base64("", data) + def fetch_with_cache_check( + self, + url: str, + uuid: Optional[str], + cache: Optional[object], </file context>

cubic-dev-ai · 2025-08-02T15:18:14Z

vllm/entrypoints/chat_utils.py

Docstring advertises support for "UUID-only" messages, but the parser will raise an error for such input; either implement the feature or remove the statement.

Prompt for AI agents

Address the following comment on vllm/entrypoints/chat_utils.py at line 135: <comment>Docstring advertises support for "UUID-only" messages, but the parser will raise an error for such input; either implement the feature or remove the statement.</comment> <file context> @@ -125,8 +125,20 @@ class CustomChatCompletionContentSimpleImageParam(TypedDict, total=False): { "image_url": "https://example.com/image.jpg" } + + With UUID for caching optimization: + { + "image_url": "https://example.com/image.jpg", + "uuid": "abcde" + } </file context>

vllm/entrypoints/chat_utils.py

gemini-code-assist

Code Review

This pull request introduces a UUID-based caching optimization for multimodal inputs. The changes span across entrypoint utils, multimodal processing, and hashing logic. While the intent is to speed up processing by re-using cached results for known UUIDs, the implementation appears to be incomplete and contains several critical flaws. The UUIDs provided in the API are currently ignored during parsing. The cache-hit path in the processing logic fails to correctly modify the prompt with placeholders for the multimodal data. Additionally, new utility functions for cache-aware media fetching are introduced but are not used. These issues need to be addressed for the feature to function correctly.

gemini-code-assist · 2025-08-02T15:19:07Z

vllm/entrypoints/chat_utils.py

The newly added uuid field is not being utilized during message parsing. The parsing logic in _parse_chat_message_content_mm_part only extracts the image_url and discards the uuid. This makes the UUID-based caching optimization for cache population ineffective.

To fix this, the parsing logic needs to be updated to handle both url and uuid, and pass them down to the MediaConnector to leverage methods like fetch_image_with_cache_check. This issue also applies to audio and video parameters.

gemini-code-assist · 2025-08-02T15:19:07Z

vllm/multimodal/processing.py

The UUID-based cache lookup optimization path is critically flawed. When a cache hit occurs, the logic in _reconstruct_from_uuid_cache is executed, which tokenizes the user prompt but fails to insert the necessary multimodal placeholder tokens (e.g., <image>). This means the model will not be aware of the multimodal inputs. The _create_minimal_placeholders function then creates placeholder ranges that point to arbitrary tokens in the original prompt, which is incorrect. The prompt must be modified to include the correct placeholders before being passed to the model, similar to how the non-cached path works.

gemini-code-assist · 2025-08-02T15:19:07Z

vllm/multimodal/utils.py

The new fetch_with_cache_check method and its variants (fetch_image_with_cache_check, etc.) are not used anywhere in the codebase. For the UUID caching optimization to work when populating the cache (i.e., skipping media download if the item is already cached), these methods must be called from the content parsers in vllm/entrypoints/chat_utils.py. This seems to be an incomplete implementation.

vincentzed · 2025-08-02T16:12:22Z

/gemini re-review

gemini-code-assist · 2025-08-02T16:12:40Z

Hello @vincentzed, I've received your request for a re-review. To trigger a new code review of the pull request in its current state, please use the /gemini review command. This will initiate a fresh review based on the latest changes. I am an automated assistant and cannot directly initiate a review from this comment, but the command will do so.

vincentzed · 2025-08-02T16:18:56Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a caching optimization for multimodal data using UUIDs. The changes modify the chat utilities to accept UUIDs alongside media URLs, and update the multimodal processing pipeline to leverage these UUIDs for cache lookups, bypassing expensive preprocessing steps when a cache hit occurs.

My review has identified a few areas for improvement:

There is some dead code in hasher.py that should be removed.
A new helper function in processing.py has unused parameters and a misleading signature, which should be corrected.
There is a direct access to a private member of ProcessingCache, which breaks encapsulation and should be refactored to use a public method.

These changes will improve the code's clarity, maintainability, and adherence to good software design principles.

gemini-code-assist · 2025-08-02T16:20:28Z

vllm/multimodal/hasher.py

The new method _hash_with_url_optimization appears to be unused within the codebase. Including unused code can lead to confusion and increase maintenance overhead. It's best to remove it. If it's intended for future use, it should be introduced in a separate pull request when it's actually needed.

gemini-code-assist · 2025-08-02T16:20:28Z

vllm/multimodal/processing.py

The method _create_uuid_hash_dict has an unused parameter cache_keys and an unused internal variable key_index. The function's logic only depends on mm_data, which makes the function signature and its docstring ("Convert UUID cache keys to MultiModalHashDict format.") misleading.

Please refactor the method to remove the unused cache_keys parameter and key_index variable. The docstring should also be updated to reflect that it creates the hash dict from mm_data. You will also need to update the call site at line 1922.

def _create_uuid_hash_dict( self, mm_data: MultiModalDataDict ) -> Mapping[str, list[str]]: """Create a hash dictionary from UUIDs in multimodal data.""" hash_dict = {} for modality in ['image', 'video', 'audio']: uuid_key = f"{modality}_uuids" if uuid_key in mm_data: uuids = mm_data[uuid_key] if isinstance(uuids, list): hash_dict[modality] = uuids else: hash_dict[modality] = [uuids] return hash_dict

gemini-code-assist · 2025-08-02T16:20:28Z

vllm/multimodal/processing.py

The code self.cache._cache.get(key) directly accesses the private member _cache of the ProcessingCache instance. This breaks encapsulation and makes the code brittle to future changes in ProcessingCache.

Please consider adding a public method to ProcessingCache for this purpose, for example get_by_key(key), and use it here. This would make the interaction with the cache clearer and safer.

vincentzed · 2025-08-02T18:53:40Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a UUID-based caching optimization to improve performance for multimodal inputs. The changes span across several files, modifying data structures to include UUIDs and adding logic to handle them in the caching and processing layers.

While the backend caching logic seems well-thought-out, I've identified a critical issue in the frontend parsing logic within vllm/entrypoints/chat_utils.py. The current implementation does not correctly handle UUIDs from chat messages, which prevents the caching optimization from working as intended. The implementation in vllm/entrypoints/chat_utils.py fails to produce the correctly structured MultiModalDataDict from chat messages. This disconnect means the caching logic will not be triggered, and the primary goal of this PR is not met for chat entrypoints.

gemini-code-assist · 2025-08-02T18:55:41Z

vllm/entrypoints/chat_utils.py

The logic for parsing multimodal message parts in this block has critical issues that prevent the UUID caching from working as intended:

UUID-only references are not handled: The code checks for the presence of image_url, audio_url, or video_url to identify the modality. If a message part contains only a uuid (e.g., {"uuid": "some-id"}), as documented in CustomChatCompletionContentSimpleImageParam, it will fail these checks, and a ValueError will be raised. This contradicts the documented API change for UUID-only cache references.

UUIDs are not used for caching: When a URL and a UUID are provided together, the UUID is packaged into a dictionary but is then effectively ignored by the mm_parser. The code proceeds to fetch the content from the URL, bypassing the intended caching mechanism. The mm_data dictionary returned to the engine is not populated with the ..._uuids keys required by the backend caching logic.

This means the core functionality of UUID-based caching optimization is not correctly implemented for chat entrypoints. The logic in this function and the associated _MultiModalParser classes needs to be revised to correctly handle UUIDs, populate the mm_data dictionary with ..._uuids keys, and avoid unnecessary data fetching when a UUID is present.

vincentzed · 2025-08-02T19:09:01Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces UUID-based caching for multimodal inputs. The changes in vllm/multimodal/processing.py and vllm/multimodal/hasher.py correctly implement the caching logic assuming the input MultiModalDataDict is properly formatted with UUID information. However, the changes in vllm/entrypoints/chat_utils.py for parsing chat messages are incomplete. The UUID information is parsed from chat requests but is then discarded in the MultiModalContentParser and AsyncMultiModalContentParser, which prevents the caching mechanism from working for chat-based inputs. Additionally, the documentation for UUID-only cache references is inconsistent with the implementation. These issues are critical as they break the new feature for the chat endpoint.

…ssing introduce uuid fields in multimodal data structures to enable efficient cache lookups and bypass expensive media fetching and processing when possible. this optimization reduces redundant operations by leveraging user-provided uuids for images, videos, and audio content. - add `uuid` field to chat utils param types - implement cache key extraction and url-only hashing in multimodal hasher - extend multimodal inputs with uuid fields for caching hints - enhance processor to handle uuid-only references and reconstruct inputs from cache - update media connector with cache-aware fetch methods to skip processing on cache hits this change improves performance for repeated media inputs by avoiding redundant downloads and decoding. rmv mypy add and discard update flags fix

vincentzed · 2025-08-02T19:40:26Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a caching optimization for multimodal data using UUIDs. The changes allow users to provide a UUID for image, audio, and video data, which can then be used to retrieve processed data from a cache, avoiding repeated fetching and processing. The implementation modifies the chat message parsing logic, data trackers, and multimodal data processing pipeline to support this new caching mechanism.

My review identifies a critical bug in the new UUID extraction logic that would prevent the feature from working correctly with typed content parts. I've also pointed out a design issue where UUID-only references are handled ambiguously, which could lead to incorrect behavior.

gemini-code-assist · 2025-08-02T19:41:54Z

vllm/entrypoints/chat_utils.py

+def _parse_image_url_with_uuid(parsed_part: dict[str, Any]) -> dict[str, str]:
+    """Extract both URL and UUID from image_url part."""
+    image_url_data = parsed_part.get("image_url", {})
+    result = {"url": image_url_data.get("url", "")}
+    if "uuid" in parsed_part:
+        result["uuid"] = parsed_part["uuid"]
+    return result


The logic to extract the uuid is incorrect. It checks for "uuid" in parsed_part, but it should be checking inside image_url_data. When a multimodal part with a type field is processed, parsed_part is the entire dictionary for that part (e.g., {'type': 'image_url', 'image_url': {'url': '...', 'uuid': '...'}}). The uuid is nested within the image_url value.

This bug will prevent UUIDs from being correctly extracted for typed multimodal content parts, undermining the caching optimization. The same issue exists for _parse_audio_url_with_uuid and _parse_video_url_with_uuid.

Suggested change

def _parse_image_url_with_uuid(parsed_part: dict[str, Any]) -> dict[str, str]:

"""Extract both URL and UUID from image_url part."""

image_url_data = parsed_part.get("image_url", {})

result = {"url": image_url_data.get("url", "")}

if "uuid" in parsed_part:

result["uuid"] = parsed_part["uuid"]

return result

def _parse_image_url_with_uuid(parsed_part: dict[str, Any]) -> dict[str, str]:

"""Extract both URL and UUID from image_url part."""

image_url_data = parsed_part.get("image_url", {})

result = {"url": image_url_data.get("url", "")}

if "uuid" in image_url_data:

result["uuid"] = image_url_data["uuid"]

return result

gemini-code-assist · 2025-08-02T19:41:54Z

vllm/entrypoints/chat_utils.py

+        if part.get("uuid") is not None and len(part) == 1:
+            # Pure UUID reference without URL - need to infer type from context
+            # Default to image_url for now, but this should ideally be specified
+            return "image_url", {"uuid": part["uuid"]}


When a UUID-only reference is provided (e.g., {"uuid": "..."}), the code defaults to treating it as an image_url. This is problematic because the UUID could be for an audio or video, leading to incorrect processing. The comment acknowledges this ("this should ideally be specified"), but the implementation can cause subtle bugs.

For a robust API, the modality of a UUID-only reference should be explicit, for example by requiring a type field (e.g., {"uuid": "...", "type": "audio"}). The current implementation's ambiguity can lead to unexpected behavior that is difficult to debug.

github-actions · 2025-11-01T02:50:58Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2025-12-01T03:19:36Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

cubic-dev-ai bot reviewed Aug 2, 2025

View reviewed changes

gemini-code-assist bot reviewed Aug 2, 2025

View reviewed changes

vincentzed force-pushed the uuid-based-caching-optimization branch from 10b68b3 to 8ea4f42 Compare August 2, 2025 16:12

gemini-code-assist bot reviewed Aug 2, 2025

View reviewed changes

vincentzed force-pushed the uuid-based-caching-optimization branch from 977572d to 807345f Compare August 2, 2025 19:18

gemini-code-assist bot reviewed Aug 2, 2025

View reviewed changes

github-actions bot added the stale label Nov 1, 2025

github-actions bot closed this Dec 1, 2025

Conversation

vincentzed commented Aug 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

vincentzed commented Aug 2, 2025

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

vincentzed commented Aug 2, 2025

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

cubic analysis

Uh oh!

Uh oh!

cubic-dev-ai bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

vincentzed commented Aug 2, 2025

Uh oh!

gemini-code-assist bot commented Aug 2, 2025

Uh oh!

vincentzed commented Aug 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

vincentzed commented Aug 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

vincentzed commented Aug 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

vincentzed commented Aug 2, 2025

Uh oh!

vincentzed commented Aug 2, 2025 •

edited by github-actions bot

Loading