Skip to content

feat(multimodal): add media caching support with UUID-based lookup#4

Closed
vincentzed wants to merge 1 commit intomainfrom
uiud-based-opm-opus
Closed

feat(multimodal): add media caching support with UUID-based lookup#4
vincentzed wants to merge 1 commit intomainfrom
uiud-based-opm-opus

Conversation

@vincentzed
Copy link
Owner

@vincentzed vincentzed commented Aug 2, 2025

  • Introduce global media cache for images, audio, and video
  • Add UUID field to multimodal content part parameters
  • Modify parsers to accept optional UUIDs for caching
  • Implement synchronous and asynchronous cache-aware fetching
  • Create MediaCache class with LRU eviction policy
  • Support content deduplication via UUID or URL hashing

Original Ticket:
Motivation.

The current processing flow of media inputs (image/video/audio/embeddings) has several inefficiencies when media is used repeatedly (e.g. multi-turn or when the same media is used in different requests). The current processing flow looks like this when a media is provided in the input:

MediaConnector runs to download the input (if a remote URL is provided), incurring network delays
MediaConnector loads the downloaded bytes into python objects (e.g. PIL.Image, audio/video decoding), incurring CPU delays
BaseMultiModalProcessor then hashes the entire media using its full bytes to do cache lookup from ProcessingCache, incurring CPU delays

All three are run before any cache is checked, therefore repeated media still incur these costs even if there’s a cache hit later on. Additionally, decoding long video & hashing based on all bytes can be slow (millisecs to tens of millisecs) and inefficient. For media that are sent repeatedly, we want to bypass all three steps. Proposed Change.
Proposal

Add a user provided “uuid” field in media inputs as the media’s identifier & hash key
Example:

{"type": "image_url", "image_url": {"url": image_url}, "uuid": “abcde”}
{"type": "image_embeds", "image_embeds": “xyz”, "uuid": “abcde”}}

For URL inputs without a uuid set, hash only the URL instead of the entire bytes. For other inputs, we fall back to full hashing of bytes. We should then allow empty media inputs if uuids are provided (i.e. if the client expects a cache hit).
During input processing, first check ProcessingCache to see if any media is already cached, and bypass MediaConnector altogether on cache hit.

Potential Issues

Compatibility with OAI API
OAI API currently doesn’t support user-provided uuids, so adding this field would introduce more discrepancy between the APIs. However, given that this is an additional field, the impact should be minimal.

Race Condition Between Cache Lookup Cache Read
(Needs confirmation). Given that the cache is checked when parsing contents but the lookup is performed later, we could potentially have a cache miss after the initial check (i.e. eviction between lookup & cache read). We could just fail the one single request if that happens.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

- Introduce global media cache for images, audio, and video
- Add UUID field to multimodal content part parameters
- Modify parsers to accept optional UUIDs for caching
- Implement synchronous and asynchronous cache-aware fetching
- Create MediaCache class with LRU eviction policy
- Support content deduplication via UUID or URL hashing

Original Ticket:
Motivation.

The current processing flow of media inputs (image/video/audio/embeddings) has several inefficiencies when media is used repeatedly (e.g. multi-turn or when the same media is used in different requests). The current processing flow looks like this when a media is provided in the input:

    MediaConnector runs to download the input (if a remote URL is provided), incurring network delays
    MediaConnector loads the downloaded bytes into python objects (e.g. PIL.Image, audio/video decoding), incurring CPU delays
    BaseMultiModalProcessor then hashes the entire media using its full bytes to do cache lookup from ProcessingCache, incurring CPU delays

All three are run before any cache is checked, therefore repeated media still incur these costs even if there’s a cache hit later on. Additionally, decoding long video & hashing based on all bytes can be slow (millisecs to tens of millisecs) and inefficient. For media that are sent repeatedly, we want to bypass all three steps.
Proposed Change.
Proposal

    Add a user provided “uuid” field in media inputs as the media’s identifier & hash key
    Example:

{"type": "image_url", "image_url": {"url": image_url}, "uuid": “abcde”}
{"type": "image_embeds", "image_embeds": “xyz”, "uuid": “abcde”}}

    For URL inputs without a uuid set, hash only the URL instead of the entire bytes. For other inputs, we fall back to full hashing of bytes. We should then allow empty media inputs if uuids are provided (i.e. if the client expects a cache hit).
    During input processing, first check ProcessingCache to see if any media is already cached, and bypass MediaConnector altogether on cache hit.

Potential Issues

Compatibility with OAI API
OAI API currently doesn’t support user-provided uuids, so adding this field would introduce more discrepancy between the APIs. However, given that this is an additional field, the impact should be minimal.

Race Condition Between Cache Lookup Cache Read
(Needs confirmation). Given that the cache is checked when parsing contents but the lookup is performed later, we could potentially have a cache miss after the initial check (i.e. eviction between lookup & cache read). We could just fail the one single request if that happens.
@github-actions
Copy link

github-actions bot commented Aug 2, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@vincentzed
Copy link
Owner Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a global media cache to improve performance for repeated media inputs. The changes are generally well-structured, but there are a few key areas for improvement. The new MediaCache class is not thread-safe, which is a critical issue for a server environment. There is also significant code duplication in the caching logic for different media types, which should be refactored for better maintainability. Finally, caching support is missing for PIL images and image embeddings, which is inconsistent with the PR's goals.

Comment on lines +4 to +72
from typing import Optional, Union
import hashlib
from vllm.utils import LRUCache, GiB_bytes
from vllm.logger import init_logger

logger = init_logger(__name__)

MediaData = Union[object, str, bytes] # Can be image, audio, video, embeddings, etc.

class MediaCache:
"""
Global cache for media data, using UUID or content hash as key.
This allows reusing media data across multiple requests.
"""

def __init__(self, capacity_gb: float = 1.0) -> None:
self._cache = LRUCache[str, MediaData](
capacity=int(GiB_bytes * capacity_gb)
)

def _get_url_hash(self, url: str) -> str:
"""Generate a hash for a URL to use as cache key."""
return hashlib.sha256(url.encode()).hexdigest()

def get(self, uuid: Optional[str], url: Optional[str] = None) -> Optional[MediaData]:
"""
Get media data from cache.

Args:
uuid: User-provided UUID for the media
url: URL of the media (used as fallback key if UUID not provided)

Returns:
Cached media data if found, None otherwise
"""
if uuid:
cache_key = uuid
elif url:
cache_key = self._get_url_hash(url)
else:
return None

return self._cache.get(cache_key)

def put(self, data: MediaData, uuid: Optional[str] = None, url: Optional[str] = None) -> str:
"""
Store media data in cache.

Args:
data: The media data to cache
uuid: User-provided UUID for the media
url: URL of the media (used to generate key if UUID not provided)

Returns:
The cache key used
"""
if uuid:
cache_key = uuid
elif url:
cache_key = self._get_url_hash(url)
else:
raise ValueError("Either uuid or url must be provided")

self._cache[cache_key] = data
return cache_key

def clear(self) -> None:
"""Clear all cached media data."""
self._cache.clear()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The MediaCache is used as a global singleton and will be accessed from multiple threads in a server environment. The current implementation using LRUCache is not thread-safe. Concurrent access to get, put, and clear methods can lead to race conditions and corrupt the internal state of the cache.

To prevent this, you should add a threading.Lock to synchronize access to the cache.

from typing import Optional, Union
import hashlib
import threading
from vllm.utils import LRUCache, GiB_bytes
from vllm.logger import init_logger

logger = init_logger(__name__)

MediaData = Union[object, str, bytes]  # Can be image, audio, video, embeddings, etc.

class MediaCache:
    """
    Global cache for media data, using UUID or content hash as key.
    This allows reusing media data across multiple requests.
    """
    
    def __init__(self, capacity_gb: float = 1.0) -> None:
        self._cache = LRUCache[str, MediaData](
            capacity=int(GiB_bytes * capacity_gb)
        )
        self._lock = threading.Lock()
    
    def _get_url_hash(self, url: str) -> str:
        """Generate a hash for a URL to use as cache key."""
        return hashlib.sha256(url.encode()).hexdigest()
    
    def get(self, uuid: Optional[str], url: Optional[str] = None) -> Optional[MediaData]:
        """
        Get media data from cache.
        
        Args:
            uuid: User-provided UUID for the media
            url: URL of the media (used as fallback key if UUID not provided)
        
        Returns:
            Cached media data if found, None otherwise
        """
        if uuid:
            cache_key = uuid
        elif url:
            cache_key = self._get_url_hash(url)
        else:
            return None
            
        with self._lock:
            return self._cache.get(cache_key)
    
    def put(self, data: MediaData, uuid: Optional[str] = None, url: Optional[str] = None) -> str:
        """
        Store media data in cache.
        
        Args:
            data: The media data to cache
            uuid: User-provided UUID for the media
            url: URL of the media (used to generate key if UUID not provided)
            
        Returns:
            The cache key used
        """
        if uuid:
            cache_key = uuid
        elif url:
            cache_key = self._get_url_hash(url)
        else:
            raise ValueError("Either uuid or url must be provided")
            
        with self._lock:
            self._cache[cache_key] = data
        return cache_key
    
    def clear(self) -> None:
        """Clear all cached media data."""
        with self._lock:
            self._cache.clear()

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces media caching support with UUID-based lookup to enhance the efficiency of multimodal content processing. The changes include adding UUID fields to content part parameters and implementing cache-aware fetching. The implementation looks good, but the exception messages could be improved to be more informative.

Comment on lines +742 to +743
# No URL provided but UUID given - expecting cache hit
raise ValueError(f"Image with UUID '{uuid}' not found in cache and no URL provided")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If no URL is provided but a UUID is given, a ValueError is raised, but the exception message could be improved to be more informative about the expected media type.

raise ValueError(f"Image with UUID '{uuid}' not found in cache and no URL provided. Please provide the image URL for the first time.")

Comment on lines +782 to +783
# No URL provided but UUID given - expecting cache hit
raise ValueError(f"Audio with UUID '{uuid}' not found in cache and no URL provided")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If no URL is provided but a UUID is given, a ValueError is raised, but the exception message could be improved to be more informative about the expected media type.

raise ValueError(f"Audio with UUID '{uuid}' not found in cache and no URL provided. Please provide the audio URL for the first time.")

Comment on lines +809 to +810
# No URL provided but UUID given - expecting cache hit
raise ValueError(f"Video with UUID '{uuid}' not found in cache and no URL provided")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If no URL is provided but a UUID is given, a ValueError is raised, but the exception message could be improved to be more informative about the expected media type.

raise ValueError(f"Video with UUID '{uuid}' not found in cache and no URL provided. Please provide the video URL for the first time.")

Comment on lines +850 to +851
# No URL provided but UUID given - expecting cache hit
raise ValueError(f"Image with UUID '{uuid}' not found in cache and no URL provided")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If no URL is provided but a UUID is given, a ValueError is raised, but the exception message could be improved to be more informative about the expected media type.

raise ValueError(f"Image with UUID '{uuid}' not found in cache and no URL provided. Please provide the image URL for the first time.")

elif url:
cache_key = self._get_url_hash(url)
else:
raise ValueError("Either uuid or url must be provided")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Consider raising a more specific exception type than ValueError, such as KeyError, to better reflect the nature of the error (cache miss).

raise KeyError("Either uuid or url must be provided")

@vincentzed
Copy link
Owner Author

/gemini is there an existing file that is ~ to mediacache i.e not requiring a new file?

@gemini-code-assist
Copy link

@vincentzed, that's a good question! The vllm/multimodal/media_cache.py file was introduced as a new component to specifically implement a global media cache for images, audio, and video, as outlined in the pull request description.

While the project may have other caching mechanisms (such as the ProcessingCache mentioned in the PR description), this new MediaCache class is designed to address specific inefficiencies related to repeated media inputs. Its purpose is to provide a dedicated, UUID-based or URL-hashed caching layer that can be checked before incurring costs like network downloads and decoding. This allows for content deduplication and significant performance improvements for multimodal inputs.

So, to directly answer your question, this new file was indeed necessary to introduce this specific media caching functionality and optimize the handling of multimodal data.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

2 issues found across 2 files • Review in cubic

React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.


def __init__(self, capacity_gb: float = 1.0) -> None:
self._cache = LRUCache[str, MediaData](
capacity=int(GiB_bytes * capacity_gb)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LRUCache is initialised without a getsizeof function, so eviction is based on item count instead of bytes, defeating the intended 1 GiB memory cap.

Prompt for AI agents
Address the following comment on vllm/multimodal/media_cache.py at line 21:

<comment>The LRUCache is initialised without a `getsizeof` function, so eviction is based on item count instead of bytes, defeating the intended 1 GiB memory cap.</comment>

<file context>
@@ -0,0 +1,80 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from typing import Optional, Union
+import hashlib
+from vllm.utils import LRUCache, GiB_bytes
+from vllm.logger import init_logger
+
+logger = init_logger(__name__)
</file context>

"""Generate a hash for a URL to use as cache key."""
return hashlib.sha256(url.encode()).hexdigest()

def get(self, uuid: Optional[str], url: Optional[str] = None) -> Optional[MediaData]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid parameter is declared Optional but still mandatory, leading to runtime errors when only a URL is provided.

Prompt for AI agents
Address the following comment on vllm/multimodal/media_cache.py at line 28:

<comment>`uuid` parameter is declared Optional but still mandatory, leading to runtime errors when only a URL is provided.</comment>

<file context>
@@ -0,0 +1,80 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from typing import Optional, Union
+import hashlib
+from vllm.utils import LRUCache, GiB_bytes
+from vllm.logger import init_logger
+
+logger = init_logger(__name__)
</file context>
Suggested change
def get(self, uuid: Optional[str], url: Optional[str] = None) -> Optional[MediaData]:
def get(self, uuid: Optional[str] = None, url: Optional[str] = None) -> Optional[MediaData]:

@github-actions
Copy link

github-actions bot commented Nov 1, 2025

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Nov 1, 2025
@github-actions
Copy link

github-actions bot commented Dec 1, 2025

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

@github-actions github-actions bot closed this Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant