Skip to content

Commit 30d4e43

Browse files
committed
Add prefix caching
Signed-off-by: WoosukKwon <[email protected]>
1 parent 35c5a5e commit 30d4e43

File tree

2 files changed

+10
-1
lines changed

2 files changed

+10
-1
lines changed

_posts/2025-01-26-v1.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,16 @@ vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional d
5656

5757
## 3. Zero-Overhead Prefix Caching
5858

59-
vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%. **Thanks to this change, we now enable prefix caching by default in V1.**
59+
vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.
60+
61+
Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.**
62+
63+
<p align="center">
64+
<picture>
65+
<img src="/assets/figures/v1/v1_prefix_caching.png" width="100%">
66+
</picture>
67+
</p>
68+
6069

6170
## 4. Clean Architecture for Tensor-Parallel Inference
6271

122 KB
Loading

0 commit comments

Comments
 (0)