|
| 1 | +# InternVL3 Usage Guide |
| 2 | + |
| 3 | +This guide describes how to run InternVL3 series on NVIDIA GPUs. |
| 4 | + |
| 5 | +[InternVL3](https://huggingface.co/collections/OpenGVLab/internvl3-67f7f690be79c2fe9d74fe9d) is a powerful multimodal model that combines vision and language understanding capabilities. This recipe provides step-by-step instructions for running InternVL3 using vLLM, optimized for various hardware configurations. |
| 6 | + |
| 7 | +## Deployment Steps |
| 8 | + |
| 9 | +### Installing vLLM |
| 10 | + |
| 11 | +```bash |
| 12 | +uv venv |
| 13 | +source .venv/bin/activate |
| 14 | +uv pip install -U vllm --torch-backend auto |
| 15 | +``` |
| 16 | + |
| 17 | +### Weights |
| 18 | +[OpenGVLab/InternVL3-8B-hf](https://huggingface.co/OpenGVLab/InternVL3-8B-hf) |
| 19 | + |
| 20 | +### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards) in eager mode |
| 21 | + |
| 22 | +Launch the online inference server using TP=2: |
| 23 | +```bash |
| 24 | +export CUDA_VISIBLE_DEVICES=0,1 |
| 25 | +vllm serve OpenGVLab/InternVL3-8B-hf --enforce-eager \ |
| 26 | + --host 0.0.0.0 \ |
| 27 | + --port 8000 \ |
| 28 | + --tensor-parallel-size 2 \ |
| 29 | + --data-parallel-size 1 |
| 30 | +``` |
| 31 | + |
| 32 | +## Configs and Parameters |
| 33 | + |
| 34 | +`--enforce-eager` disables the CUDA Graph in PyTorch; otherwise, it will throw error `torch._dynamo.exc.Unsupported: Data-dependent branching` during testing. For more information about CUDA Graph, please check [Accelerating-pytorch-with-cuda-graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) |
| 35 | + |
| 36 | +`--tensor-parallel-size` sets Tensor Parallel (TP). |
| 37 | + |
| 38 | +`--data-parallel-size` sets Data-parallel (DP). |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +## Validation & Expected Behavior |
| 43 | + |
| 44 | +### Basic Test |
| 45 | +Open another terminal, and use the following commands: |
| 46 | +```bash |
| 47 | +# need to start vLLM service first |
| 48 | +curl http://localhost:8000/v1/completions \ |
| 49 | + -H "Content-Type: application/json" \ |
| 50 | + -d '{ |
| 51 | + "prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>", |
| 52 | + "max_tokens": 100, |
| 53 | + "temperature": 0.7 |
| 54 | + }' |
| 55 | +``` |
| 56 | + |
| 57 | +The result would be like this: |
| 58 | +```json |
| 59 | +{ |
| 60 | +"id": "cmpl-1ed0df81b56448afa597215a8725c686", |
| 61 | +"object": "text_completion", |
| 62 | +"created": 1755739470, |
| 63 | +"model": "OpenGVLab/InternVL3-8B-hf", |
| 64 | +"choices": |
| 65 | + [{ |
| 66 | + "index":0, |
| 67 | + "text":" The capital of France is Paris.", |
| 68 | + "logprobs":null, |
| 69 | + "finish_reason":"stop", |
| 70 | + "stop_reason":null, |
| 71 | + "prompt_logprobs":null |
| 72 | + }], |
| 73 | +"service_tier":null, |
| 74 | +"system_fingerprint":null, |
| 75 | +"usage": |
| 76 | + { |
| 77 | + "prompt_tokens":35, |
| 78 | + "total_tokens":43, |
| 79 | + "completion_tokens":8, |
| 80 | + "prompt_tokens_details":null |
| 81 | + }, |
| 82 | +"kv_transfer_params":null} |
| 83 | +``` |
| 84 | + |
| 85 | +### Benchmarking Performance |
| 86 | + |
| 87 | +Take InternVL3-8B-hf as an example: |
| 88 | + |
| 89 | +```bash |
| 90 | +# need to start vLLM service first |
| 91 | +vllm bench serve \ |
| 92 | + --host 0.0.0.0 \ |
| 93 | + --port 8000 \ |
| 94 | + --model OpenGVLab/InternVL3-8B-hf \ |
| 95 | + --dataset-name random \ |
| 96 | + --random-input 2048 \ |
| 97 | + --random-output 1024 \ |
| 98 | + --max-concurrency 10 \ |
| 99 | + --num-prompts 50 \ |
| 100 | + --ignore-eos |
| 101 | +``` |
| 102 | +If it works successfully, you will see the following output. |
| 103 | + |
| 104 | +``` |
| 105 | +============ Serving Benchmark Result ============ |
| 106 | +Successful requests: 497 |
| 107 | +Benchmark duration (s): 229.42 |
| 108 | +Total input tokens: 507680 |
| 109 | +Total generated tokens: 62259 |
| 110 | +Request throughput (req/s): 2.17 |
| 111 | +Output token throughput (tok/s): 271.37 |
| 112 | +Total Token throughput (tok/s): 2484.22 |
| 113 | +---------------Time to First Token---------------- |
| 114 | +Mean TTFT (ms): 102429.40 |
| 115 | +Median TTFT (ms): 99644.38 |
| 116 | +P99 TTFT (ms): 213820.81 |
| 117 | +-----Time per Output Token (excl. 1st token)------ |
| 118 | +Mean TPOT (ms): 664.26 |
| 119 | +Median TPOT (ms): 776.39 |
| 120 | +P99 TPOT (ms): 848.52 |
| 121 | +---------------Inter-token Latency---------------- |
| 122 | +Mean ITL (ms): 661.73 |
| 123 | +Median ITL (ms): 844.15 |
| 124 | +P99 ITL (ms): 856.42 |
| 125 | +================================================== |
| 126 | +``` |
0 commit comments