Skip to content

Commit ad28b25

Browse files
authored
Update InternVL3.md
1 parent b2e33b9 commit ad28b25

File tree

1 file changed

+48
-7
lines changed

1 file changed

+48
-7
lines changed

OpenGVLab/InternVL3.md

Lines changed: 48 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@ uv pip install -U vllm --torch-backend auto
1717
### Weights
1818
[OpenGVLab/InternVL3-8B-hf](https://huggingface.co/OpenGVLab/InternVL3-8B-hf)
1919

20-
### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards) in eager mode
20+
### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards)
2121

2222
Launch the online inference server using TP=2:
2323
```bash
2424
export CUDA_VISIBLE_DEVICES=0,1
25-
vllm serve OpenGVLab/InternVL3-8B-hf --enforce-eager \
25+
vllm serve OpenGVLab/InternVL3-8B-hf \
2626
--host 0.0.0.0 \
2727
--port 8000 \
2828
--tensor-parallel-size 2 \
@@ -31,11 +31,9 @@ vllm serve OpenGVLab/InternVL3-8B-hf --enforce-eager \
3131

3232
## Configs and Parameters
3333

34-
`--enforce-eager` disables the CUDA Graph in PyTorch; otherwise, it will throw error `torch._dynamo.exc.Unsupported: Data-dependent branching` during testing. For more information about CUDA Graph, please check [Accelerating-pytorch-with-cuda-graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/)
34+
* You can set `limit-mm-per-prompt` to limit how many multimodal data instances to allow for each prompt. This is useful if you want to control the incoming traffic of multimodal requests. E.g., `--limit-mm-per-prompt '{"image":2, "video":0}'`
3535

36-
`--tensor-parallel-size` sets Tensor Parallel (TP).
37-
38-
`--data-parallel-size` sets Data-parallel (DP).
36+
* You can set `--tensor-parallel-size` and `--data-parallel-size` to adjust the parallel strategy.
3937

4038

4139

@@ -84,7 +82,9 @@ The result would be like this:
8482

8583
### Benchmarking Performance
8684

87-
Take InternVL3-8B-hf as an example, using random multimodal dataset mentioned in [this vLLM PR](https://github.com/vllm-project/vllm/pull/23119):
85+
#### InternVL3-8B-hf on Multimodal Random Dataset
86+
87+
Take InternVL3-8B-hf as an example, using the random multimodal dataset mentioned in [this vLLM PR](https://github.com/vllm-project/vllm/pull/23119):
8888

8989
```bash
9090
# need to start vLLM service first
@@ -135,3 +135,44 @@ Median ITL (ms): 47.02
135135
P99 ITL (ms): 116.90
136136
==================================================
137137
```
138+
139+
#### InternVL3-8B-hf on VisionArena-Chat Dataset
140+
141+
```bash
142+
# need to start vLLM service first
143+
vllm bench serve \
144+
--host 0.0.0.0 \
145+
--port 8000\
146+
--backend openai-chat \
147+
--endpoint /v1/chat/completions \
148+
--endpoint-type openai-chat \
149+
--model OpenGVLab/InternVL3-8B-hf \
150+
--dataset-name hf \
151+
--dataset-path lmarena-ai/VisionArena-Chat \
152+
--num-prompts 1000
153+
```
154+
If it works successfully, you will see the following output.
155+
156+
```
157+
============ Serving Benchmark Result ============
158+
Successful requests: 1000
159+
Benchmark duration (s): 597.45
160+
Total input tokens: 109173
161+
Total generated tokens: 109352
162+
Request throughput (req/s): 1.67
163+
Output token throughput (tok/s): 183.03
164+
Total Token throughput (tok/s): 365.76
165+
---------------Time to First Token----------------
166+
Mean TTFT (ms): 280208.05
167+
Median TTFT (ms): 270322.52
168+
P99 TTFT (ms): 582602.60
169+
-----Time per Output Token (excl. 1st token)------
170+
Mean TPOT (ms): 519.16
171+
Median TPOT (ms): 539.03
172+
P99 TPOT (ms): 596.74
173+
---------------Inter-token Latency----------------
174+
Mean ITL (ms): 593.88
175+
Median ITL (ms): 530.72
176+
P99 ITL (ms): 4129.92
177+
==================================================
178+
```

0 commit comments

Comments
 (0)