You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Llama/Llama3.3-70B.md
+55-48Lines changed: 55 additions & 48 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,6 @@ This quick start recipe provides step-by-step instructions for running the Llama
6
6
7
7
The recipe is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—building a docker image with vLLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.
8
8
9
-
10
9
## Access & Licensing
11
10
12
11
### License
@@ -34,31 +33,19 @@ For Hopper, FP8 offers the best performance for most workloads. For Blackwell, N
34
33
35
34
## Deployment Steps
36
35
37
-
### Build Docker Image
36
+
### Pull Docker Image
38
37
39
-
Build a docker image with vLLM using the official vLLM Dockerfile at a specific commit (`dc5e4a653c859573dfcca99f1b0141c2db9f94cc`) on the main branch. This commit contains more performance optimizations compared to the latest official vLLM docker image (`vllm/vllm-openai:latest`).
38
+
Pull the vLLM release docker image for a specific commit (`de533ab2a14192e461900a4950e2b426d99a6862`) on the main branch and tag it as `vllm/vllm-openai:deploy`. This commit contains more performance optimizations compared to the latest official vLLM docker image (`vllm/vllm-openai:latest`).
40
39
41
-
`build_image.sh`
42
-
```
43
-
# Clone the vLLM GitHub repo and checkout the spcific commit.
44
-
git clone -b main --single-branch https://github.com/vllm-project/vllm.git
Note: building the docker image may use lots of CPU threads and CPU memory. If you build the docker image on machines with fewer CPU cores or less CPU memory, please reduce the value of `max_jobs`.
47
+
docker tag public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:de533ab2a14192e461900a4950e2b426d99a6862 vllm/vllm-openai:deploy
48
+
```
62
49
63
50
### Run Docker Container
64
51
@@ -73,6 +60,29 @@ Note: You can mount additional directories and paths using the `-v <local_path>:
73
60
74
61
The `-e HF_TOKEN="$HF_TOKEN" -e HF_HOME="$HF_HOME"` flags are added so that the models are downloaded using your HuggingFace token and the downloaded models can be cached in $HF_HOME. Refer to [HuggingFace documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information about these environment variables and refer to [HuggingFace Quickstart guide](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) about steps to generate your HuggingFace access token.
75
62
63
+
### Install Latest NCCL
64
+
65
+
The default NCCL version in the docker container may lead to long NCCL initialization time on Blackwell architecture. Therefore, install `nvidia-nccl-cu12==2.26.2.post1` to fix it. Refer to [this GitHub issue](https://github.com/vllm-project/vllm/issues/20862) for more information.
66
+
67
+
`install_nccl.sh`
68
+
```
69
+
pip uninstall -y nvidia-nccl-cu12
70
+
pip install nvidia-nccl-cu12==2.26.2.post1
71
+
```
72
+
73
+
### Install Latest FlashInfer
74
+
75
+
The default FlashInfer version (v0.2.4.post1) in the docker container has some functional issues.. Therefore, reinstall FlashInfer at commit `9720182476ede910698f8d783c29b2ec91cec023` to fix it.
Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model. The explanation of each flag is shown in the "Configs and Parameters" section.
@@ -83,15 +93,14 @@ Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruc
83
93
# They will be removed when the performance optimizations have been verified and enabled by default.
After the server is set up, the client can now send prompt requests to the server and receive results.
@@ -133,15 +143,12 @@ You can specify the IP address and the port that you would like to run the serve
133
143
134
144
Below are the config flags that we do not recommend changing or tuning with:
135
145
136
-
-`--tokenizer`: Specify the path to the model file.
137
-
-`--quantization`: Must be `modelopt` for FP8 model and `modelopt_fp4` for FP4 model.
138
146
-`--kv-cache-dtype`: Kv-cache data type. We recommend setting it to `fp8` for best performance.
139
147
-`--trust-remote-code`: Trust the model code.
140
148
-`--gpu-memory-utilization`: The fraction of GPU memory to be used for the model executor. We recommend setting it to `0.9` to use up to 90% of the GPU memory.
141
-
-`--compilation-config`: Configuration for vLLM compilation stage. We recommend setting it to `'{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"full_cuda_graph":true}'` to enable all the necessary fusions for the best performance on Blackwell architecture. However, this feature is not supported on Hopper architecture yet.
142
-
- We are trying to enable these fusions by default so that this flag is no longer needed in the future.
143
-
-`--enable-chunked-prefill`: Enable chunked prefill stage. We recommend always adding this flag for best performance.
149
+
-`--compilation-config`: Configuration for vLLM compilation stage. We recommend setting it to `'{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'` to enable all the necessary fusions for the best performance on Blackwell architecture. However, this feature is not supported on Hopper architecture yet.
144
150
-`--async-scheduling`: Enable asynchronous scheduling to reduce the host overheads between decoding steps. We recommend always adding this flag for best performance on Blackwell architecture. However, this feature is not supported on Hopper architecture yet.
151
+
-`--enable-chunked-prefill`: Enable chunked prefill stage. We recommend always adding this flag for best performance.
145
152
-`--no-enable-prefix-caching` Disable prefix caching. We recommend always adding this flag if running with synthetic dataset for consistent performance measurement.
146
153
-`--pipeline-parallel-size`: Pipeline parallelism size. We recommend setting it to `1` for best performance.
147
154
@@ -163,11 +170,11 @@ Refer to the "Balancing between Throughput and Latencies" about how to adjust th
163
170
164
171
### Basic Test
165
172
166
-
After the vLLM server is set up and shows `Application startup complete`, you can send requests to the server
173
+
After the vLLM server is set up and shows `Application startup complete`, you can send requests to the server
167
174
168
175
`run_basic_test.sh`
169
176
```
170
-
curl http://0.0.0.0:8080/v1/completions -H "Content-Type: application/json" -d '{ "model": "nvidia/Llama-3.3-70B-Instruct-FP4", "prompt": "San Francisco is a", "max_tokens": 20, "temperature": 0 }'
177
+
curl http://0.0.0.0:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "nvidia/Llama-3.3-70B-Instruct-FP4", "prompt": "San Francisco is a", "max_tokens": 20, "temperature": 0 }'
171
178
```
172
179
173
180
Here is an example response, showing that the vLLM server returns "*city that is known for its vibrant culture, stunning architecture, and breathtaking natural beauty. From the iconic...*", completing the input sequence with up to 20 tokens.
@@ -215,7 +222,7 @@ To benchmark the performance, you can use the `vllm bench serve` command.
215
222
```
216
223
vllm bench serve \
217
224
--host 0.0.0.0 \
218
-
--port 8080 \
225
+
--port 8000 \
219
226
--model nvidia/Llama-3.3-70B-Instruct-FP4 \
220
227
--trust-remote-code \
221
228
--dataset-name random \
@@ -237,9 +244,9 @@ Explanations for the flags:
237
244
-`--num-prompts`: Total number of prompts used for performance benchmarking. We recommend setting it to at least five times of the `--max-concurrency` to measure the steady state performance.
238
245
-`--save-result --result-filename`: Output location for the performance benchmarking result.
239
246
240
-
### Interpreting `benchmark_serving.py` Output
247
+
### Interpreting Performance Benchmarking Output
241
248
242
-
Sample output by the `benchmark_serving.py` script:
249
+
Sample output by the `vllm bench serve` command:
243
250
244
251
```
245
252
============ Serving Benchmark Result ============
@@ -272,11 +279,11 @@ P99 E2EL (ms): xxx.xx
272
279
Explanations for key metrics:
273
280
274
281
-`Median Time to First Token (TTFT)`: The typical time elapsed from when a request is sent until the first output token is generated.
275
-
-`Median Time Per Output Token (TPOT)`: The typical time required to generate each token after the first one.
282
+
-`Median Time Per Output Token (TPOT)`: The typical time required to generate each token after the first one.
276
283
-`Median Inter-Token Latency (ITL)`: The typical time delay between the completion of one token and the completion of the next.
277
284
-`Median End-to-End Latency (E2EL)`: The typical total time from when a request is submitted until the final token of the response is received.
278
285
-`Output token throughput`: The rate at which the system generates the output (generated) tokens.
279
-
-`Total Token Throughput`: The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
286
+
-`Total Token Throughput`: The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
0 commit comments