WIP

WoosukKwon · WoosukKwon · commit 35c5a5eee589 · 2025-01-26T18:16:22.000-08:00
Signed-off-by: WoosukKwon &lt;woosuk.kwon@berkeley.edu&gt;
diff --git a/_posts/2025-01-26-v1.md b/_posts/2025-01-26-v1.md
@@ -54,7 +54,7 @@ In the [v0.6.0 release](https://blog.vllm.ai/2024/09/05/perf-update.html), vLLM
 
 vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional distinction between “prefill” and “decode” phases by treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions are represented as a simple dictionary, e.g., `{request_id: num_tokens}`, which specifies the number of tokens to process for each request at each step. We find that this representation is general enough to support features such as chunked prefills, prefix caching, and speculative decoding. For instance, chunked-prefill scheduling is seamlessly implemented: with a fixed token budget, the scheduler dynamically decides how many tokens to allocate to each request (as shown in the figure above).
 
-## 3. Fast Prefix Caching
+## 3. Zero-Overhead Prefix Caching
 
 vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%. **Thanks to this change, we now enable prefix caching by default in V1.**
 
@@ -104,7 +104,16 @@ The final piece of the puzzle for vLLM V1 was integrating [FlashAttention 3](htt
 
 # Performance
 
-Thanks to the extensive improvements in vLLM V1, we have observed significant performance gains, achieving state-of-the-art throughput and latency.
+Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to **x** higher throughput compared to V0 (*without multi-step scheduling*).
+These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
+The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
+
+- **Llama 3.1 8B, 1xH100**
+
+- **Llama 3.3 70B, 4xH100**
+
+- **Qwen2-VL (VLM), 1xH100**
+
 
 # Limitations & Future Work
 
@@ -152,6 +161,7 @@ The V1 re-architecture is a continued joint effort across the entire vLLM team a
 - [Nick Hill](https://github.com/njhill) optimized the engine loop and API server.  
 - [Ricky Xu](https://github.com/rickyyx) and [Chen Zhang](https://github.com/heheda12345) helped refactor the KV cache manager.  
 - [Jie Li](https://github.com/jeejeelee) and [Michael Goin](https://github.com/mgoin) helped with MLLM support and optimization.  
+- [Aaron Pham](https://github.com/aarnphm) is implementing the structured decoding support.
 - [Varun Sundar Rabindranath](https://github.com/varun-sundar-rabindranath) is implementing the multi-LoRA support.  
 - [Andrew Feldman](https://github.com/afeldman-nm) is implementing the log probs and prompt log probs support.  
 - [Lily Liu](https://github.com/LiuXiaoxuanPKU) is implementing the speculative decoding support.