You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-01-26-v1.md
+12-2Lines changed: 12 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@ In the [v0.6.0 release](https://blog.vllm.ai/2024/09/05/perf-update.html), vLLM
54
54
55
55
vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional distinction between “prefill” and “decode” phases by treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions are represented as a simple dictionary, e.g., `{request_id: num_tokens}`, which specifies the number of tokens to process for each request at each step. We find that this representation is general enough to support features such as chunked prefills, prefix caching, and speculative decoding. For instance, chunked-prefill scheduling is seamlessly implemented: with a fixed token budget, the scheduler dynamically decides how many tokens to allocate to each request (as shown in the figure above).
56
56
57
-
## 3. Fast Prefix Caching
57
+
## 3. Zero-Overhead Prefix Caching
58
58
59
59
vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%. **Thanks to this change, we now enable prefix caching by default in V1.**
60
60
@@ -104,7 +104,16 @@ The final piece of the puzzle for vLLM V1 was integrating [FlashAttention 3](htt
104
104
105
105
# Performance
106
106
107
-
Thanks to the extensive improvements in vLLM V1, we have observed significant performance gains, achieving state-of-the-art throughput and latency.
107
+
Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **x** higher throughput compared to V0 (*without multi-step scheduling*).
108
+
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
109
+
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
110
+
111
+
-**Llama 3.1 8B, 1xH100**
112
+
113
+
-**Llama 3.3 70B, 4xH100**
114
+
115
+
-**Qwen2-VL (VLM), 1xH100**
116
+
108
117
109
118
# Limitations & Future Work
110
119
@@ -152,6 +161,7 @@ The V1 re-architecture is a continued joint effort across the entire vLLM team a
152
161
-[Nick Hill](https://github.com/njhill) optimized the engine loop and API server.
153
162
-[Ricky Xu](https://github.com/rickyyx) and [Chen Zhang](https://github.com/heheda12345) helped refactor the KV cache manager.
154
163
-[Jie Li](https://github.com/jeejeelee) and [Michael Goin](https://github.com/mgoin) helped with MLLM support and optimization.
164
+
-[Aaron Pham](https://github.com/aarnphm) is implementing the structured decoding support.
155
165
-[Varun Sundar Rabindranath](https://github.com/varun-sundar-rabindranath) is implementing the multi-LoRA support.
156
166
-[Andrew Feldman](https://github.com/afeldman-nm) is implementing the log probs and prompt log probs support.
157
167
-[Lily Liu](https://github.com/LiuXiaoxuanPKU) is implementing the speculative decoding support.
0 commit comments