vllm-project
diff --git a/‎_posts/2025-01-26-v1.md
Lines changed: 10 additions & 1 deletion b/‎_posts/2025-01-26-v1.md
Lines changed: 10 additions & 1 deletion
diff --git a/‎assets/figures/v1/v1_prefix_caching.png
122 KB b/‎assets/figures/v1/v1_prefix_caching.png
122 KB
@@ -56,7 +56,16 @@ vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional d
 
 ## 3. Zero-Overhead Prefix Caching
 
-vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%. **Thanks to this change, we now enable prefix caching by default in V1.**
+vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.
+
+Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.**
+
+<p align="center">
+<picture>
+<img src="/assets/figures/v1/v1_prefix_caching.png" width="100%">
+</picture>
+</p>
+
 
 ## 4. Clean Architecture for Tensor-Parallel Inference