Skip to content

Commit e056d8e

Browse files
committed
txt
Signed-off-by: WoosukKwon <[email protected]>
1 parent 1aef324 commit e056d8e

File tree

1 file changed

+16
-4
lines changed

1 file changed

+16
-4
lines changed

_posts/2025-01-26-v1.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: post
3-
title: "vLLM V1 Alpha Release"
3+
title: "vLLM V1: A Major Upgrade to vLLM's Core Architecture"
44
author: "vLLM Team"
55
---
66

@@ -116,22 +116,34 @@ Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-th
116116
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
117117
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1's enhanced support for VLMs.
118118

119-
- **Llama 3.1 8B & Llama 3.3 70B**
119+
- **Text Models: Llama 3.1 8B & Llama 3.3 70B**
120120

121121
<p align="center">
122122
<picture>
123123
<img src="/assets/figures/v1/v1_llama.png" width="100%">
124124
</picture>
125125
</p>
126126

127-
- **Qwen2-VL (VLM), 1xH100**
127+
We measured the performance of vLLM V0 and V1 on Llama 3.1 8B and Llama 3.3 70B models using the ShareGPT dataset.
128+
V1 demonstrated consistently lower latency than V0 especially at high QPS, thanks to the higher throughput it achieves.
129+
Given that the kernels used for V0 and V1 are almost identical, the performance difference is mainly due to the architectural improvements (reduced CPU overheads) in V1.
130+
131+
- **Vision-language Models: Qwen2-VL, 1xH100**
128132

129133
<p align="center">
130134
<picture>
131135
<img src="/assets/figures/v1/v1_qwen2vl.png" width="50%">
132136
</picture>
133137
</p>
134138

139+
We evaluated the performance on VLMs by testing Qwen2-VL using the [VisionArena dataset](https://arxiv.org/abs/2412.08687).
140+
V1 delivered even larger speedups over V0, thanks its improved VLM support, driven by two key improvements: offloading input processing to a separate process and implementing more flexible scheduling for multimodal queries.
141+
142+
- Looking Forward
143+
144+
While these improvements are significant, we view them as just the beginning.
145+
Our new clean architecture provides a solid foundation for future enhancements, which we plan to roll out in the coming weeks.
146+
Stay tuned for more updates!
135147

136148
# Limitations & Future Work
137149

@@ -141,7 +153,7 @@ While vLLM V1 shows promising results, it is still in its alpha stage and lacks
141153
V1 supports decoder-only Transformers like Llama, mixture-of-experts (MoE) models like Mixtral, and several VLMs such as Qwen2-VL. All quantization methods are supported. However, V1 currently does not support encoder-decoder architectures like multimodal Llama 3.2, Mamba-based models like Jamba, or embedding models. Please check out [our documentation]() for a more detailed list of the supported models.
142154

143155
**Feature Limitations:**
144-
V1 currently lacks support for log probs, prompt log probs sampling parameters, pipeline parallelism, structured decoding, speculative decoding, prometheus metrics, and LoRA. We are actively working to close this feature gap and add new optimizations. Please stay tuned!
156+
V1 currently lacks support for log probs, prompt log probs sampling parameters, pipeline parallelism, structured decoding, speculative decoding, prometheus metrics, and LoRA. We are actively working to close this feature gap and add brand-new optimizations to the V1 engine.
145157

146158
**Hardware Support:**
147159
V1 currently supports only Ampere or later NVIDIA GPUs. We are actively working to extend support to other hardware backends such as TPU.

0 commit comments

Comments
 (0)