Skip to content

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Apr 11, 2025

For easier debugging and optimization, this PR introduces metrics logging and reset API (reset_mm_cache) for the multi-modal processing cache. It is mostly based on the existing code for KV cache metrics.

Based on these stats, users can adjust the capacity of the multi-modal cache to achieve a better balance between memory usage and cache hit rate.

Notes

  • The inferface of vllm.metrics.loggers.StatLoggerBase has been updated to accept mm_cache_stats.
  • In V1, the three internal caches (P0 processor, P0 mirror, P1 mirror) may become desynced if a request is currently in progress when reset_mm_cache is called. Since it is meant to be just a debugging tool, you should only call it when the engine is not being used.
    • Also, for online serving this is only available if VLLM_SERVER_DEV_MODE=1.

Example logs

V0 Engine:

INFO 04-12 16:24:41 [metrics.py:518] Avg prompt throughput: 6051.4 tokens/s, Avg generation throughput: 10.3 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.2%, CPU KV cache usage: 0.0%.
INFO 04-12 16:24:41 [metrics.py:533] MM cache usage: 2.86% (13 items = 0.11 GiB)

V1 Engine:

INFO 04-12 16:32:05 [loggers.py:109] Engine 000: Avg prompt throughput: 12550.4 tokens/s, Avg generation throughput: 21.9 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 76.9%
INFO 04-12 16:32:05 [loggers.py:126] P0 Processor MM cache usage: 17.91% (81 items = 0.72 GiB), hit rate: 71.28%; P0 Mirrored MM cache usage: 17.91% (81 items = 0.72 GiB), hit rate: 71.28%; P1 Mirrored MM cache usage: 17.70% (80 items = 0.71 GiB), hit rate: 100.00%

Notes

  • In V1, the number of items and memory of the three internal caches should remain in sync with each other, but since the stats are collected at different times, it is possible for the logged metrics to have minor differences.
  • In V1, the items in P0 processor and P0 mirror MM caches are the same instances, therefore the memory between them is shared and there is no memory duplication.
  • In V1, the hit rate of P1 mirror MM cache should always be 100% because the cache is only queried with mm_hash if there is a cache hit in P0, which implies cache hit in P1 since each item from P0 cache is passed to P1.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added frontend multi-modality Related to multi-modality (#4194) v1 labels Apr 11, 2025
@DarkLight1337 DarkLight1337 force-pushed the log-mm-cache branch 3 times, most recently from fa40cbe to aea9da7 Compare April 11, 2025 16:02
@DarkLight1337 DarkLight1337 moved this to In Progress in Multi-modality Core Apr 12, 2025
@DarkLight1337 DarkLight1337 force-pushed the log-mm-cache branch 5 times, most recently from cb2eb10 to eea7385 Compare April 12, 2025 14:05
@DarkLight1337 DarkLight1337 marked this pull request as ready for review April 12, 2025 14:05
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to vllm.v1.metrics.stats.CachingMetrics

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 12, 2025

Just ran a mistral-eval... there is a problem where the mirrored caches hit rate remains at zero because we don't actually call .get for those caches.

Edit: I have updated those caches to call .get now so they actually function as LRU caches. This may also fix #16273 (comment)

Edit 2: This change has been split out into #16593

Copy link

mergify bot commented Apr 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @DarkLight1337.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 25, 2025
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 25, 2025

@markmc do you know why the test is getting a duplicated timeseries error?

Copy link

mergify bot commented Apr 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @DarkLight1337.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 26, 2025
@mergify mergify bot removed the needs-rebase label Apr 26, 2025
@DarkLight1337
Copy link
Member Author

Tests should pass now.

Copy link

mergify bot commented Apr 30, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @DarkLight1337.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 30, 2025
Signed-off-by: DarkLight1337 <[email protected]>
@mergify mergify bot removed the needs-rebase label Apr 30, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Copy link

mergify bot commented May 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @DarkLight1337.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link

github-actions bot commented Oct 2, 2025

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Oct 2, 2025
@DarkLight1337
Copy link
Member Author

Closing as superseded by #26285

@github-project-automation github-project-automation bot moved this from In Progress to Done in Multi-modality Core Oct 6, 2025
@DarkLight1337 DarkLight1337 deleted the log-mm-cache branch October 6, 2025 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) needs-rebase ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding stale Over 90 days of inactivity v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants