Skip to content

Commit 35c5a5e

Browse files
committed
WIP
Signed-off-by: WoosukKwon <[email protected]>
1 parent 7e26b21 commit 35c5a5e

File tree

1 file changed

+12
-2
lines changed

1 file changed

+12
-2
lines changed

_posts/2025-01-26-v1.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ In the [v0.6.0 release](https://blog.vllm.ai/2024/09/05/perf-update.html), vLLM
5454

5555
vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional distinction between “prefill” and “decode” phases by treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions are represented as a simple dictionary, e.g., `{request_id: num_tokens}`, which specifies the number of tokens to process for each request at each step. We find that this representation is general enough to support features such as chunked prefills, prefix caching, and speculative decoding. For instance, chunked-prefill scheduling is seamlessly implemented: with a fixed token budget, the scheduler dynamically decides how many tokens to allocate to each request (as shown in the figure above).
5656

57-
## 3. Fast Prefix Caching
57+
## 3. Zero-Overhead Prefix Caching
5858

5959
vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%. **Thanks to this change, we now enable prefix caching by default in V1.**
6060

@@ -104,7 +104,16 @@ The final piece of the puzzle for vLLM V1 was integrating [FlashAttention 3](htt
104104

105105
# Performance
106106

107-
Thanks to the extensive improvements in vLLM V1, we have observed significant performance gains, achieving state-of-the-art throughput and latency.
107+
Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **x** higher throughput compared to V0 (*without multi-step scheduling*).
108+
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
109+
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
110+
111+
- **Llama 3.1 8B, 1xH100**
112+
113+
- **Llama 3.3 70B, 4xH100**
114+
115+
- **Qwen2-VL (VLM), 1xH100**
116+
108117

109118
# Limitations & Future Work
110119

@@ -152,6 +161,7 @@ The V1 re-architecture is a continued joint effort across the entire vLLM team a
152161
- [Nick Hill](https://github.com/njhill) optimized the engine loop and API server.
153162
- [Ricky Xu](https://github.com/rickyyx) and [Chen Zhang](https://github.com/heheda12345) helped refactor the KV cache manager.
154163
- [Jie Li](https://github.com/jeejeelee) and [Michael Goin](https://github.com/mgoin) helped with MLLM support and optimization.
164+
- [Aaron Pham](https://github.com/aarnphm) is implementing the structured decoding support.
155165
- [Varun Sundar Rabindranath](https://github.com/varun-sundar-rabindranath) is implementing the multi-LoRA support.
156166
- [Andrew Feldman](https://github.com/afeldman-nm) is implementing the log probs and prompt log probs support.
157167
- [Lily Liu](https://github.com/LiuXiaoxuanPKU) is implementing the speculative decoding support.

0 commit comments

Comments
 (0)