Skip to content

Commit 6ea9d90

Browse files
committed
[Docs] Update Llama3/4 and GPT-OSS recipe for NVIDIA GPUs
Signed-off-by: Po-Han Huang <[email protected]>
1 parent 604e5e6 commit 6ea9d90

File tree

3 files changed

+240
-96
lines changed

3 files changed

+240
-96
lines changed

Llama/Llama3.3-70B.md

Lines changed: 32 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ This quick start recipe provides step-by-step instructions for running the Llama
66

77
The recipe is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—building a docker image with vLLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.
88

9-
109
## Access & Licensing
1110

1211
### License
@@ -34,32 +33,16 @@ For Hopper, FP8 offers the best performance for most workloads. For Blackwell, N
3433

3534
## Deployment Steps
3635

37-
### Build Docker Image
36+
### Pull Docker Image
3837

39-
Build a docker image with vLLM using the official vLLM Dockerfile at a specific commit (`dc5e4a653c859573dfcca99f1b0141c2db9f94cc`) on the main branch. This commit contains more performance optimizations compared to the latest official vLLM docker image (`vllm/vllm-openai:latest`).
38+
Pull the vLLM post-merge docker image for a specific commit (`a5203d04dffcbdb095651ca4bf06589409370301`) on the main branch and tag it as `vllm/vllm-openai:deploy`. This commit contains more performance optimizations compared to the latest official vLLM docker image (`vllm/vllm-openai:latest`).
4039

41-
`build_image.sh`
40+
`pull_image.sh`
4241
```
43-
# Clone the vLLM GitHub repo and checkout the spcific commit.
44-
git clone -b main --single-branch https://github.com/vllm-project/vllm.git
45-
cd vllm
46-
git checkout dc5e4a653c859573dfcca99f1b0141c2db9f94cc
47-
48-
# Build the docker image using official vLLM Dockerfile.
49-
DOCKER_BUILDKIT=1 docker build . \
50-
--file docker/Dockerfile \
51-
--target vllm-openai \
52-
--build-arg CUDA_VERSION=12.8.1 \
53-
--build-arg max_jobs=32 \
54-
--build-arg nvcc_threads=2 \
55-
--build-arg RUN_WHEEL_CHECK=false \
56-
--build-arg torch_cuda_arch_list="9.0+PTX 10.0+PTX" \
57-
--build-arg vllm_fa_cmake_gpu_arches="90-real;100-real" \
58-
-t vllm/vllm-openai:deploy
42+
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:a5203d04dffcbdb095651ca4bf06589409370301
43+
docker tag public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:a5203d04dffcbdb095651ca4bf06589409370301 vllm/vllm-openai:deploy
5944
```
6045

61-
Note: building the docker image may use lots of CPU threads and CPU memory. If you build the docker image on machines with fewer CPU cores or less CPU memory, please reduce the value of `max_jobs`.
62-
6346
### Run Docker Container
6447

6548
Run the docker container using the docker image `vllm/vllm-openai:deploy`.
@@ -73,6 +56,16 @@ Note: You can mount additional directories and paths using the `-v <local_path>:
7356

7457
The `-e HF_TOKEN="$HF_TOKEN" -e HF_HOME="$HF_HOME"` flags are added so that the models are downloaded using your HuggingFace token and the downloaded models can be cached in $HF_HOME. Refer to [HuggingFace documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information about these environment variables and refer to [HuggingFace Quickstart guide](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) about steps to generate your HuggingFace access token.
7558

59+
### Install Latest NCCL
60+
61+
The default NCCL version in the docker container may lead to long NCCL initialization time on Blackwell architecture. Therefore, install `nvidia-nccl-cu12==2.26.2.post1` to fix it. Refer to [this GitHub issue](https://github.com/vllm-project/vllm/issues/20862) for more information.
62+
63+
`install_nccl.sh`
64+
```
65+
pip uninstall -y nvidia-nccl-cu12
66+
pip install nvidia-nccl-cu12==2.26.2.post1
67+
```
68+
7669
### Launch the vLLM Server
7770

7871
Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model. The explanation of each flag is shown in the "Configs and Parameters" section.
@@ -83,15 +76,12 @@ Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruc
8376
# They will be removed when the performance optimizations have been verified and enabled by default.
8477
COMPUTE_CAPABILITY=$(nvidia-smi -i 0 --query-gpu=compute_cap --format=csv,noheader)
8578
if [ "$COMPUTE_CAPABILITY" = "10.0" ]; then
86-
# Use FlashInfer backend for attentions
87-
export VLLM_ATTENTION_BACKEND=FLASHINFER
88-
# Use FlashInfer trtllm-gen attention kernels
89-
export VLLM_USE_TRTLLM_ATTENTION=1
9079
# Enable async scheduling
9180
ASYNC_SCHEDULING_FLAG="--async-scheduling"
92-
# Enable FlashInfer fusions
93-
FUSION_FLAG='{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"full_cuda_graph":true}'
81+
# Enable vLLM fusions and cuda graphs
82+
FUSION_FLAG='{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'
9483
# Use FP4 for Blackwell architecture
84+
# Change this to FP8 to run FP8 on Blackwell architecture
9585
DTYPE="FP4"
9686
else
9787
# Disable async scheduling on Hopper architecture due to vLLM limitations
@@ -102,18 +92,20 @@ else
10292
DTYPE="FP8"
10393
fi
10494
95+
# Disable prefix caching when running with synthetic dataset for consistent performance measurement.
96+
NO_PREFIX_CACHING_FLAG="--no-enable-prefix-caching"
97+
10598
# Launch the vLLM server
10699
vllm serve nvidia/Llama-3.3-70B-Instruct-$DTYPE \
107100
--host 0.0.0.0 \
108101
--port 8080 \
109-
--tokenizer nvidia/Llama-3.3-70B-Instruct-$DTYPE \
110102
--kv-cache-dtype fp8 \
111103
--trust-remote-code \
112104
--gpu-memory-utilization 0.9 \
113-
--compilation-config $FUSION_FLAG \
114-
$ASYNC_SCHEDULING_FLAG \
105+
--compilation-config ${FUSION_FLAG} \
106+
${ASYNC_SCHEDULING_FLAG} \
115107
--enable-chunked-prefill \
116-
--no-enable-prefix-caching \
108+
${NO_PREFIX_CACHING_FLAG} \
117109
--pipeline-parallel-size 1 \
118110
--tensor-parallel-size 1 \
119111
--max-num-seqs 512 \
@@ -128,7 +120,7 @@ After the server is set up, the client can now send prompt requests to the serve
128120

129121
You can specify the IP address and the port that you would like to run the server with using these flags:
130122

131-
- `--host`: IP address of the server.
123+
- `--host`: IP address of the server.
132124
- `--port`: The port to listen to by the server.
133125

134126
Below are the config flags that we do not recommend changing or tuning with:
@@ -138,10 +130,9 @@ Below are the config flags that we do not recommend changing or tuning with:
138130
- `--kv-cache-dtype`: Kv-cache data type. We recommend setting it to `fp8` for best performance.
139131
- `--trust-remote-code`: Trust the model code.
140132
- `--gpu-memory-utilization`: The fraction of GPU memory to be used for the model executor. We recommend setting it to `0.9` to use up to 90% of the GPU memory.
141-
- `--compilation-config`: Configuration for vLLM compilation stage. We recommend setting it to `'{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"full_cuda_graph":true}'` to enable all the necessary fusions for the best performance on Blackwell architecture. However, this feature is not supported on Hopper architecture yet.
142-
- We are trying to enable these fusions by default so that this flag is no longer needed in the future.
143-
- `--enable-chunked-prefill`: Enable chunked prefill stage. We recommend always adding this flag for best performance.
133+
- `--compilation-config`: Configuration for vLLM compilation stage. We recommend setting it to `'{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'` to enable all the necessary fusions for the best performance on Blackwell architecture. However, this feature is not supported on Hopper architecture yet.
144134
- `--async-scheduling`: Enable asynchronous scheduling to reduce the host overheads between decoding steps. We recommend always adding this flag for best performance on Blackwell architecture. However, this feature is not supported on Hopper architecture yet.
135+
- `--enable-chunked-prefill`: Enable chunked prefill stage. We recommend always adding this flag for best performance.
145136
- `--no-enable-prefix-caching` Disable prefix caching. We recommend always adding this flag if running with synthetic dataset for consistent performance measurement.
146137
- `--pipeline-parallel-size`: Pipeline parallelism size. We recommend setting it to `1` for best performance.
147138

@@ -163,7 +154,7 @@ Refer to the "Balancing between Throughput and Latencies" about how to adjust th
163154

164155
### Basic Test
165156

166-
After the vLLM server is set up and shows `Application startup complete`, you can send requests to the server
157+
After the vLLM server is set up and shows `Application startup complete`, you can send requests to the server
167158

168159
`run_basic_test.sh`
169160
```
@@ -237,9 +228,9 @@ Explanations for the flags:
237228
- `--num-prompts`: Total number of prompts used for performance benchmarking. We recommend setting it to at least five times of the `--max-concurrency` to measure the steady state performance.
238229
- `--save-result --result-filename`: Output location for the performance benchmarking result.
239230

240-
### Interpreting `benchmark_serving.py` Output
231+
### Interpreting Performance Benchmarking Output
241232

242-
Sample output by the `benchmark_serving.py` script:
233+
Sample output by the `vllm bench serve` command:
243234

244235
```
245236
============ Serving Benchmark Result ============
@@ -272,11 +263,11 @@ P99 E2EL (ms): xxx.xx
272263
Explanations for key metrics:
273264

274265
- `Median Time to First Token (TTFT)`: The typical time elapsed from when a request is sent until the first output token is generated.
275-
- `Median Time Per Output Token (TPOT)`: The typical time required to generate each token after the first one.
266+
- `Median Time Per Output Token (TPOT)`: The typical time required to generate each token after the first one.
276267
- `Median Inter-Token Latency (ITL)`: The typical time delay between the completion of one token and the completion of the next.
277268
- `Median End-to-End Latency (E2EL)`: The typical total time from when a request is submitted until the final token of the response is received.
278269
- `Output token throughput`: The rate at which the system generates the output (generated) tokens.
279-
- `Total Token Throughput`: The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
270+
- `Total Token Throughput`: The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
280271

281272
### Balancing between Throughput and Latencies
282273

0 commit comments

Comments
 (0)