Skip to content

Commit 22a4b41

Browse files
SamitHuangywang96
andauthored
Add Qwen2.5VL Guide (#30)
Signed-off-by: SamitHuang <[email protected]> Co-authored-by: Roger Wang <[email protected]>
1 parent 84008ad commit 22a4b41

File tree

1 file changed

+233
-0
lines changed

1 file changed

+233
-0
lines changed

Qwen/Qwen2.5-VL.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# Qwen2.5-VL Usage Guide
2+
3+
This guide describes how to run Qwen2.5-VL series with native BF16 on NVIDIA GPUs.
4+
Since BF16 is the commonly used precision type for Qwen2.5-VL training, using BF16 in inference ensures the best accuracy.
5+
6+
7+
## Installing vLLM
8+
9+
```bash
10+
uv venv
11+
source .venv/bin/activate
12+
uv pip install -U vllm --torch-backend auto
13+
```
14+
15+
## Running Qwen2.5-VL with BF16 on 4xA100
16+
17+
There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel (TP) or (2) Data-parallel (DP). Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios, and data-parallel works better for cases where there is a lot of data with heavy loads.
18+
19+
To launch the online inference server for Qwen2.5-VL-72B:
20+
21+
```bash
22+
# Start server with BF16 model on 4 GPUs using TP=4
23+
export CUDA_VISIBLE_DEVICES=0,1,2,3
24+
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
25+
--host 0.0.0.0 \
26+
--port 8000 \
27+
--tensor-parallel-size 4 \
28+
--mm-encoder-tp-mode data \
29+
--limit-mm-per-prompt '{"image":2,"video":0}' \
30+
31+
```
32+
* You can set `--max-model-len` to preserve memory. By default the model's context length is 128K, but `--max-model-len=65536` is usually good for most scenarios.
33+
* You can set `--tensor-parallel-size` and `--data-parallel-size` to adjust the parallel strategy. But TP should be larger than 2 for A100-80GB devices to avoid OOM.
34+
* You can set `limit-mm-per-prompt` to limit how many multimodal data instances to allow for each prompt. This is useful if you want to control the incoming traffic of multimodal requests.
35+
* `--mm-encode-tp-mode` is set to "data", so as to deploy the multimodal encoder in DP fashion for better performance. This is because the multimodal encoder is very small compared to the language decoder (ViT 675M v.s. LM 72B in Qwen2.5-VL-72B), thus TP on ViT provides little gain but incurs significant communication overhead.
36+
* vLLM conservatively uses 90% of GPU memory. You can set `--gpu-memory-utilization=0.95` to maximize KVCache.
37+
38+
39+
For medium-size models like Qwen2.5-VL-7B, data parallelism usually provides better performance since it boosts throughput without the heavy communication costs seen in tensor parallelism. Here is an example of how to launch the server using DP=4:
40+
41+
```bash
42+
# Start server with BF16 model on 4 GPUs using DP=4
43+
export CUDA_VISIBLE_DEVICES=0,1,2,3
44+
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
45+
--host 0.0.0.0 \
46+
--port 8000 \
47+
--data-parallel-size 4 \
48+
--limit-mm-per-prompt '{"image":2,"video":0}' \
49+
```
50+
51+
## Benchmarking
52+
53+
For benchmarking, you first need to launch the server with prefix caching disabled by adding `--no-enable-prefix-caching` to the server command.
54+
55+
### Qwen2.5VL-72B Benchmark on VisionArena-Chat Dataset
56+
57+
Once the server for the 72B model is running, open another terminal and run the benchmark client:
58+
59+
```bash
60+
vllm bench serve \
61+
--host 0.0.0.0 \
62+
--port 8000 \
63+
--backend openai-chat \
64+
--endpoint /v1/chat/completions \
65+
--endpoint-type openai-chat \
66+
--model Qwen/Qwen2.5-VL-72B-Instruct \
67+
--dataset-name hf \
68+
--dataset-path lmarena-ai/VisionArena-Chat \
69+
--num-prompts 128
70+
```
71+
* Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512
72+
73+
#### Expected Output
74+
75+
76+
```shell
77+
============ Serving Benchmark Result ============
78+
Successful requests: 128
79+
Benchmark duration (s): 33.40
80+
Total input tokens: 9653
81+
Total generated tokens: 14611
82+
Request throughput (req/s): 3.83
83+
Output token throughput (tok/s): 437.46
84+
Total Token throughput (tok/s): 726.48
85+
---------------Time to First Token----------------
86+
Mean TTFT (ms): 13715.73
87+
Median TTFT (ms): 13254.17
88+
P99 TTFT (ms): 26364.39
89+
-----Time per Output Token (excl. 1st token)------
90+
Mean TPOT (ms): 171.89
91+
Median TPOT (ms): 157.20
92+
P99 TPOT (ms): 504.86
93+
---------------Inter-token Latency----------------
94+
Mean ITL (ms): 150.41
95+
Median ITL (ms): 56.96
96+
P99 ITL (ms): 614.47
97+
==================================================
98+
99+
```
100+
101+
### Qwen2.5VL-72B Benchmark on Random Synthetic Dataset
102+
103+
Once the server for the 72B model is running, open another terminal and run the benchmark client:
104+
105+
```bash
106+
vllm bench serve \
107+
--host 0.0.0.0 \
108+
--port 8000 \
109+
--model Qwen/Qwen2.5-VL-72B-Instruct \
110+
--dataset-name random \
111+
--random-input-len 8000 \
112+
--random-output-len 1000 \
113+
--num-prompts 128
114+
```
115+
* Test different workloads by adjusting input/output lengths via the `--random-input-len` and `--random-output-len` arguments:
116+
- **Prompt-heavy**: 8000 input / 1000 output
117+
- **Decode-heavy**: 1000 input / 8000 output
118+
- **Balanced**: 1000 input / 1000 output
119+
120+
* Test different batch sizes by changing `--num-prompts`, e.g., 1, 16, 32, 64, 128, 256, 512
121+
122+
123+
#### Expected Output
124+
125+
```shell
126+
============ Serving Benchmark Result ============
127+
Successful requests: 128
128+
Benchmark duration (s): 778.74
129+
Total input tokens: 1023598
130+
Total generated tokens: 114351
131+
Request throughput (req/s): 0.16
132+
Output token throughput (tok/s): 146.84
133+
Total Token throughput (tok/s): 1461.27
134+
---------------Time to First Token----------------
135+
Mean TTFT (ms): 305503.01
136+
Median TTFT (ms): 371429.33
137+
P99 TTFT (ms): 730584.33
138+
-----Time per Output Token (excl. 1st token)------
139+
Mean TPOT (ms): 308.99
140+
Median TPOT (ms): 337.48
141+
P99 TPOT (ms): 542.26
142+
---------------Inter-token Latency----------------
143+
Mean ITL (ms): 297.63
144+
Median ITL (ms): 60.91
145+
P99 ITL (ms): 558.30
146+
==================================================
147+
```
148+
149+
150+
151+
### Qwen2.5VL-7B Benchmark on VisionArena-Chat Dataset
152+
153+
Once the server for the 7B model is running, open another terminal and run the benchmark client:
154+
155+
```bash
156+
vllm bench serve \
157+
--host 0.0.0.0 \
158+
--port 8000 \
159+
--backend openai-chat \
160+
--endpoint /v1/chat/completions \
161+
--endpoint-type openai-chat \
162+
--model Qwen/Qwen2.5-VL-7B-Instruct \
163+
--dataset-name hf \
164+
--dataset-path lmarena-ai/VisionArena-Chat \
165+
--num-prompts 128
166+
```
167+
168+
#### Expected Output
169+
170+
```shell
171+
============ Serving Benchmark Result ============
172+
Successful requests: 128
173+
Benchmark duration (s): 9.78
174+
Total input tokens: 9653
175+
Total generated tokens: 14227
176+
Request throughput (req/s): 13.09
177+
Output token throughput (tok/s): 1455.11
178+
Total Token throughput (tok/s): 2442.40
179+
---------------Time to First Token----------------
180+
Mean TTFT (ms): 4432.91
181+
Median TTFT (ms): 4751.45
182+
P99 TTFT (ms): 7575.37
183+
-----Time per Output Token (excl. 1st token)------
184+
Mean TPOT (ms): 58.19
185+
Median TPOT (ms): 45.30
186+
P99 TPOT (ms): 354.21
187+
---------------Inter-token Latency----------------
188+
Mean ITL (ms): 43.86
189+
Median ITL (ms): 17.22
190+
P99 ITL (ms): 653.85
191+
==================================================
192+
```
193+
194+
### Qwen2.5VL-7B Benchmark on Random Synthetic Dataset
195+
196+
Once the server for the 7B model is running, open another terminal and run the benchmark client:
197+
198+
```bash
199+
vllm bench serve \
200+
--host 0.0.0.0 \
201+
--port 8000 \
202+
--model Qwen/Qwen2.5-VL-72B-Instruct \
203+
--dataset-name random \
204+
--random-input-len 8000 \
205+
--random-output-len 1000 \
206+
--num-prompts 128
207+
```
208+
209+
#### Expected Output
210+
211+
```shell
212+
============ Serving Benchmark Result ============
213+
Successful requests: 128
214+
Benchmark duration (s): 45.30
215+
Total input tokens: 1023598
216+
Total generated tokens: 116924
217+
Request throughput (req/s): 2.83
218+
Output token throughput (tok/s): 2581.01
219+
Total Token throughput (tok/s): 25176.17
220+
---------------Time to First Token----------------
221+
Mean TTFT (ms): 10940.59
222+
Median TTFT (ms): 10560.30
223+
P99 TTFT (ms): 21984.26
224+
-----Time per Output Token (excl. 1st token)------
225+
Mean TPOT (ms): 41.64
226+
Median TPOT (ms): 34.41
227+
P99 TPOT (ms): 177.58
228+
---------------Inter-token Latency----------------
229+
Mean ITL (ms): 33.60
230+
Median ITL (ms): 23.14
231+
P99 ITL (ms): 196.22
232+
==================================================
233+
```

0 commit comments

Comments
 (0)