@@ -17,12 +17,12 @@ uv pip install -U vllm --torch-backend auto
17
17
### Weights
18
18
[ OpenGVLab/InternVL3-8B-hf] ( https://huggingface.co/OpenGVLab/InternVL3-8B-hf )
19
19
20
- ### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards) in eager mode
20
+ ### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards)
21
21
22
22
Launch the online inference server using TP=2:
23
23
``` bash
24
24
export CUDA_VISIBLE_DEVICES=0,1
25
- vllm serve OpenGVLab/InternVL3-8B-hf --enforce-eager \
25
+ vllm serve OpenGVLab/InternVL3-8B-hf \
26
26
--host 0.0.0.0 \
27
27
--port 8000 \
28
28
--tensor-parallel-size 2 \
@@ -31,11 +31,9 @@ vllm serve OpenGVLab/InternVL3-8B-hf --enforce-eager \
31
31
32
32
## Configs and Parameters
33
33
34
- ` --enforce-eager ` disables the CUDA Graph in PyTorch; otherwise, it will throw error ` torch._dynamo.exc.Unsupported: Data-dependent branching ` during testing. For more information about CUDA Graph, please check [ Accelerating-pytorch-with-cuda-graphs ] ( https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ )
34
+ * You can set ` limit-mm-per-prompt ` to limit how many multimodal data instances to allow for each prompt. This is useful if you want to control the incoming traffic of multimodal requests. E.g., ` --limit-mm-per-prompt '{"image":2, "video":0}' `
35
35
36
- ` --tensor-parallel-size ` sets Tensor Parallel (TP).
37
-
38
- ` --data-parallel-size ` sets Data-parallel (DP).
36
+ * You can set ` --tensor-parallel-size ` and ` --data-parallel-size ` to adjust the parallel strategy.
39
37
40
38
41
39
@@ -84,7 +82,9 @@ The result would be like this:
84
82
85
83
### Benchmarking Performance
86
84
87
- Take InternVL3-8B-hf as an example, using random multimodal dataset mentioned in [ this vLLM PR] ( https://github.com/vllm-project/vllm/pull/23119 ) :
85
+ #### InternVL3-8B-hf on Multimodal Random Dataset
86
+
87
+ Take InternVL3-8B-hf as an example, using the random multimodal dataset mentioned in [ this vLLM PR] ( https://github.com/vllm-project/vllm/pull/23119 ) :
88
88
89
89
``` bash
90
90
# need to start vLLM service first
@@ -135,3 +135,44 @@ Median ITL (ms): 47.02
135
135
P99 ITL (ms): 116.90
136
136
==================================================
137
137
```
138
+
139
+ #### InternVL3-8B-hf on VisionArena-Chat Dataset
140
+
141
+ ``` bash
142
+ # need to start vLLM service first
143
+ vllm bench serve \
144
+ --host 0.0.0.0 \
145
+ --port 8000\
146
+ --backend openai-chat \
147
+ --endpoint /v1/chat/completions \
148
+ --endpoint-type openai-chat \
149
+ --model OpenGVLab/InternVL3-8B-hf \
150
+ --dataset-name hf \
151
+ --dataset-path lmarena-ai/VisionArena-Chat \
152
+ --num-prompts 1000
153
+ ```
154
+ If it works successfully, you will see the following output.
155
+
156
+ ```
157
+ ============ Serving Benchmark Result ============
158
+ Successful requests: 1000
159
+ Benchmark duration (s): 597.45
160
+ Total input tokens: 109173
161
+ Total generated tokens: 109352
162
+ Request throughput (req/s): 1.67
163
+ Output token throughput (tok/s): 183.03
164
+ Total Token throughput (tok/s): 365.76
165
+ ---------------Time to First Token----------------
166
+ Mean TTFT (ms): 280208.05
167
+ Median TTFT (ms): 270322.52
168
+ P99 TTFT (ms): 582602.60
169
+ -----Time per Output Token (excl. 1st token)------
170
+ Mean TPOT (ms): 519.16
171
+ Median TPOT (ms): 539.03
172
+ P99 TPOT (ms): 596.74
173
+ ---------------Inter-token Latency----------------
174
+ Mean ITL (ms): 593.88
175
+ Median ITL (ms): 530.72
176
+ P99 ITL (ms): 4129.92
177
+ ==================================================
178
+ ```
0 commit comments