You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-01-26-v1.md
+15-6Lines changed: 15 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -58,14 +58,13 @@ vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional d
58
58
59
59
vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.
60
60
61
-
Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.**
Here are some benchmark results. In our experiments, we observed that V1's perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. **Thanks to the near-zero overhead, we now enable prefix caching by default in V1.**
69
68
70
69
## 4. Clean Architecture for Tensor-Parallel Inference
71
70
@@ -113,16 +112,26 @@ The final piece of the puzzle for vLLM V1 was integrating [FlashAttention 3](htt
113
112
114
113
# Performance
115
114
116
-
Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **x** higher throughput compared to V0 (*without multi-step scheduling*).
115
+
Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **1.7x** higher throughput compared to V0 (*without multi-step scheduling*).
117
116
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
118
117
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
0 commit comments