Skip to content

Commit 9c0e095

Browse files
committed
Update readme.
Signed-off-by: congw729 <[email protected]>
1 parent 4b2d034 commit 9c0e095

File tree

2 files changed

+252
-0
lines changed

2 files changed

+252
-0
lines changed

InternVL/InternVL3.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# InternVL3 Usage Guide
2+
3+
This guide describes how to run InternVL3 series on NVIDIA GPUs.
4+
5+
[InternVL3](https://huggingface.co/collections/OpenGVLab/internvl3-67f7f690be79c2fe9d74fe9d) is a powerful multimodal model that combines vision and language understanding capabilities. This recipe provides step-by-step instructions for running InternVL3 using vLLM, optimized for various hardware configurations.
6+
7+
## Deployment Steps
8+
9+
### Installing vLLM
10+
11+
```bash
12+
uv venv
13+
source .venv/bin/activate
14+
uv pip install -U vllm --torch-backend auto
15+
```
16+
17+
### Weights
18+
[OpenGVLab/InternVL3-8B-hf](https://huggingface.co/OpenGVLab/InternVL3-8B-hf)
19+
20+
### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards) in eager mode
21+
22+
Launch the online inference server using TP=2:
23+
```bash
24+
export CUDA_VISIBLE_DEVICES=0,1
25+
vllm serve OpenGVLab/InternVL3-8B-hf --enforce-eager \
26+
--host 0.0.0.0 \
27+
--port 8000 \
28+
--tensor-parallel-size 2 \
29+
--data-parallel-size 1
30+
```
31+
32+
## Configs and Parameters
33+
34+
`--enforce-eager` disables the CUDA Graph in PyTorch; otherwise, it will throw error `torch._dynamo.exc.Unsupported: Data-dependent branching` during testing. For more information about CUDA Graph, please check [Accelerating-pytorch-with-cuda-graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/)
35+
36+
`--tensor-parallel-size` sets Tensor Parallel (TP).
37+
38+
`--data-parallel-size` sets Data-parallel (DP).
39+
40+
41+
42+
## Validation & Expected Behavior
43+
44+
### Basic Test
45+
Open another terminal, and use the following commands:
46+
```bash
47+
# need to start vLLM service first
48+
curl http://localhost:8000/v1/completions \
49+
-H "Content-Type: application/json" \
50+
-d '{
51+
"prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>",
52+
"max_tokens": 100,
53+
"temperature": 0.7
54+
}'
55+
```
56+
57+
The result would be like this:
58+
```json
59+
{
60+
"id": "cmpl-1ed0df81b56448afa597215a8725c686",
61+
"object": "text_completion",
62+
"created": 1755739470,
63+
"model": "OpenGVLab/InternVL3-8B-hf",
64+
"choices":
65+
[{
66+
"index":0,
67+
"text":" The capital of France is Paris.",
68+
"logprobs":null,
69+
"finish_reason":"stop",
70+
"stop_reason":null,
71+
"prompt_logprobs":null
72+
}],
73+
"service_tier":null,
74+
"system_fingerprint":null,
75+
"usage":
76+
{
77+
"prompt_tokens":35,
78+
"total_tokens":43,
79+
"completion_tokens":8,
80+
"prompt_tokens_details":null
81+
},
82+
"kv_transfer_params":null}
83+
```
84+
85+
### Benchmarking Performance
86+
87+
Take InternVL3-8B-hf as an example:
88+
89+
```bash
90+
# need to start vLLM service first
91+
vllm bench serve \
92+
--host 0.0.0.0 \
93+
--port 8000 \
94+
--model OpenGVLab/InternVL3-8B-hf \
95+
--dataset-name random \
96+
--random-input 2048 \
97+
--random-output 1024 \
98+
--max-concurrency 10 \
99+
--num-prompts 50 \
100+
--ignore-eos
101+
```
102+
If it works successfully, you will see the following output.
103+
104+
```
105+
============ Serving Benchmark Result ============
106+
Successful requests: 497
107+
Benchmark duration (s): 229.42
108+
Total input tokens: 507680
109+
Total generated tokens: 62259
110+
Request throughput (req/s): 2.17
111+
Output token throughput (tok/s): 271.37
112+
Total Token throughput (tok/s): 2484.22
113+
---------------Time to First Token----------------
114+
Mean TTFT (ms): 102429.40
115+
Median TTFT (ms): 99644.38
116+
P99 TTFT (ms): 213820.81
117+
-----Time per Output Token (excl. 1st token)------
118+
Mean TPOT (ms): 664.26
119+
Median TPOT (ms): 776.39
120+
P99 TPOT (ms): 848.52
121+
---------------Inter-token Latency----------------
122+
Mean ITL (ms): 661.73
123+
Median ITL (ms): 844.15
124+
P99 ITL (ms): 856.42
125+
==================================================
126+
```

OpenGVLab/InternVL3.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# InternVL3 Usage Guide
2+
3+
This guide describes how to run InternVL3 series on NVIDIA GPUs.
4+
5+
[InternVL3](https://huggingface.co/collections/OpenGVLab/internvl3-67f7f690be79c2fe9d74fe9d) is a powerful multimodal model that combines vision and language understanding capabilities. This recipe provides step-by-step instructions for running InternVL3 using vLLM, optimized for various hardware configurations.
6+
7+
## Deployment Steps
8+
9+
### Installing vLLM
10+
11+
```bash
12+
uv venv
13+
source .venv/bin/activate
14+
uv pip install -U vllm --torch-backend auto
15+
```
16+
17+
### Weights
18+
[OpenGVLab/InternVL3-8B-hf](https://huggingface.co/OpenGVLab/InternVL3-8B-hf)
19+
20+
### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards) in eager mode
21+
22+
Launch the online inference server using TP=2:
23+
```bash
24+
export CUDA_VISIBLE_DEVICES=0,1
25+
vllm serve OpenGVLab/InternVL3-8B-hf --enforce-eager \
26+
--host 0.0.0.0 \
27+
--port 8000 \
28+
--tensor-parallel-size 2 \
29+
--data-parallel-size 1
30+
```
31+
32+
## Configs and Parameters
33+
34+
`--enforce-eager` disables the CUDA Graph in PyTorch; otherwise, it will throw error `torch._dynamo.exc.Unsupported: Data-dependent branching` during testing. For more information about CUDA Graph, please check [Accelerating-pytorch-with-cuda-graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/)
35+
36+
`--tensor-parallel-size` sets Tensor Parallel (TP).
37+
38+
`--data-parallel-size` sets Data-parallel (DP).
39+
40+
41+
42+
## Validation & Expected Behavior
43+
44+
### Basic Test
45+
Open another terminal, and use the following commands:
46+
```bash
47+
# need to start vLLM service first
48+
curl http://localhost:8000/v1/completions \
49+
-H "Content-Type: application/json" \
50+
-d '{
51+
"prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>",
52+
"max_tokens": 100,
53+
"temperature": 0.7
54+
}'
55+
```
56+
57+
The result would be like this:
58+
```json
59+
{
60+
"id": "cmpl-1ed0df81b56448afa597215a8725c686",
61+
"object": "text_completion",
62+
"created": 1755739470,
63+
"model": "OpenGVLab/InternVL3-8B-hf",
64+
"choices":
65+
[{
66+
"index":0,
67+
"text":" The capital of France is Paris.",
68+
"logprobs":null,
69+
"finish_reason":"stop",
70+
"stop_reason":null,
71+
"prompt_logprobs":null
72+
}],
73+
"service_tier":null,
74+
"system_fingerprint":null,
75+
"usage":
76+
{
77+
"prompt_tokens":35,
78+
"total_tokens":43,
79+
"completion_tokens":8,
80+
"prompt_tokens_details":null
81+
},
82+
"kv_transfer_params":null}
83+
```
84+
85+
### Benchmarking Performance
86+
87+
Take InternVL3-8B-hf as an example:
88+
89+
```bash
90+
# need to start vLLM service first
91+
vllm bench serve \
92+
--host 0.0.0.0 \
93+
--port 8000 \
94+
--model OpenGVLab/InternVL3-8B-hf \
95+
--dataset-name random \
96+
--random-input 2048 \
97+
--random-output 1024 \
98+
--max-concurrency 10 \
99+
--num-prompts 50 \
100+
--ignore-eos
101+
```
102+
If it works successfully, you will see the following output.
103+
104+
```
105+
============ Serving Benchmark Result ============
106+
Successful requests: 497
107+
Benchmark duration (s): 229.42
108+
Total input tokens: 507680
109+
Total generated tokens: 62259
110+
Request throughput (req/s): 2.17
111+
Output token throughput (tok/s): 271.37
112+
Total Token throughput (tok/s): 2484.22
113+
---------------Time to First Token----------------
114+
Mean TTFT (ms): 102429.40
115+
Median TTFT (ms): 99644.38
116+
P99 TTFT (ms): 213820.81
117+
-----Time per Output Token (excl. 1st token)------
118+
Mean TPOT (ms): 664.26
119+
Median TPOT (ms): 776.39
120+
P99 TPOT (ms): 848.52
121+
---------------Inter-token Latency----------------
122+
Mean ITL (ms): 661.73
123+
Median ITL (ms): 844.15
124+
P99 ITL (ms): 856.42
125+
==================================================
126+
```

0 commit comments

Comments
 (0)