Skip to content

Commit 16f8680

Browse files
committed
Docs lint
1 parent 043c93d commit 16f8680

File tree

1 file changed

+85
-82
lines changed

1 file changed

+85
-82
lines changed

docs/dev-docker/README.md

Lines changed: 85 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# vllm FP8 Latency and Throughput benchmarks on AMD MI300x
22

3-
Documentation for vLLM Inferencing on AMD Instinct platforms.
3+
Documentation for vLLM Inferencing on AMD Instinct platforms.
44

55
## Overview
66

@@ -10,11 +10,9 @@ This documentation shows some reference performance numbers and the steps to rep
1010

1111
It includes:
1212

13-
- ROCm™ 6.3
14-
15-
- vLLM 0.6.3
16-
17-
- PyTorch 2.6dev (nightly)
13+
- ROCm™ 6.3
14+
- vLLM 0.6.3
15+
- PyTorch 2.6dev (nightly)
1816

1917
## System configuration
2018

@@ -39,40 +37,40 @@ The performance data below was measured on a server with MI300X accelerators wit
3937
| Power cap | 750 W |
4038
| SCLK/MCLK | 2100 Mhz / 1300 Mhz |
4139

42-
## Pull latest
40+
## Pull latest
4341

4442
You can pull the image with `docker pull rocm/vllm-dev:main`
4543

4644
### What is New
4745

48-
- ROCm 6.3 support
49-
- Potential bug with Tunable Ops not saving due to a PyTorch issue
50-
51-
46+
- ROCm 6.3 support
47+
- Potential bug with Tunable Ops not saving due to a PyTorch issue
48+
5249
Gemms are tuned using PyTorch's Tunable Ops feature (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md)
5350
The gemms are automatically enabled in the docker image, and all stored gemm configs are kept in /app/_gemm_csv in the same image
5451

5552
### Reproducing benchmark results
5653

5754
### Use pre-quantized models
5855

59-
To make it easier to run fp8 Llama 3.1 models on MI300X, the quantized checkpoints are available on AMD Huggingface space as follows
56+
To make it easier to run fp8 Llama 3.1 models on MI300X, the quantized checkpoints are available on AMD Huggingface space as follows
6057

61-
- https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV
62-
- https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV
63-
- https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV
64-
- https://huggingface.co/amd/grok-1-FP8-KV
58+
- <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>
59+
- <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>
60+
- <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>
61+
- <https://huggingface.co/amd/grok-1-FP8-KV>
6562

66-
Currently these models are private. Please join https://huggingface.co/amd to access.
63+
Currently these models are private. Please join <https://huggingface.co/amd> to access.
6764

6865
Download the model you want to run.
6966

70-
These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For more information about Quark, please refer to https://quark.docs.amd.com/latest/quark_example_torch_llm_gen.html
67+
These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For more information about Quark, please refer to <https://quark.docs.amd.com/latest/quark_example_torch_llm_gen.html>
7168

7269
### Quantize your own models
73-
This step is optional for you to use quantized models on your own. Take Llama 3.1 405B as an example.
7470

75-
Download the Model View the Llama-3.1-405B model at https://huggingface.co/meta-llama/Llama-3.1-405B. Ensure that you have been granted access, and apply for it if you do not have access.
71+
This step is optional for you to use quantized models on your own. Take Llama 3.1 405B as an example.
72+
73+
Download the Model View the Llama-3.1-405B model at <https://huggingface.co/meta-llama/Llama-3.1-405B>. Ensure that you have been granted access, and apply for it if you do not have access.
7674

7775
If you do not already have a HuggingFace token, open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token.
7876

@@ -100,27 +98,29 @@ Similarly, you can download Llama-3.1-70B and Llama-3.1-8B.
10098

10199
Run the quantization script in the example folder using the following command line:
102100
export MODEL_DIR = [local model checkpoint folder] or meta-llama/Llama-3.1-405B-Instruct
101+
103102
#### single GPU
104-
python3 quantize_quark.py \
105-
--model_dir $MODEL_DIR \
106-
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
107-
--quant_scheme w_fp8_a_fp8 \
108-
--kv_cache_dtype fp8 \
109-
--num_calib_data 128 \
110-
--model_export quark_safetensors \
111-
--no_weight_matrix_merge
112-
113-
#### If model size is too large for single GPU, please use multi GPU instead.
114-
python3 quantize_quark.py \
115-
--model_dir $MODEL_DIR \
116-
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
117-
--quant_scheme w_fp8_a_fp8 \
118-
--kv_cache_dtype fp8 \
119-
--num_calib_data 128 \
120-
--model_export quark_safetensors \
121-
--no_weight_matrix_merge \
122-
--multi_gpu
123103

104+
python3 quantize_quark.py \
105+
--model_dir $MODEL_DIR \
106+
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
107+
--quant_scheme w_fp8_a_fp8 \
108+
--kv_cache_dtype fp8 \
109+
--num_calib_data 128 \
110+
--model_export quark_safetensors \
111+
--no_weight_matrix_merge
112+
113+
#### If model size is too large for single GPU, please use multi GPU instead
114+
115+
python3 quantize_quark.py \
116+
--model_dir $MODEL_DIR \
117+
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
118+
--quant_scheme w_fp8_a_fp8 \
119+
--kv_cache_dtype fp8 \
120+
--num_calib_data 128 \
121+
--model_export quark_safetensors \
122+
--no_weight_matrix_merge \
123+
--multi_gpu
124124

125125
### Launch AMD vLLM Docker
126126

@@ -135,7 +135,7 @@ Download and launch the docker,
135135

136136
### Benchmark with AMD vLLM Docker
137137

138-
There are some system settings to be configured for optimum performance on MI300X.
138+
There are some system settings to be configured for optimum performance on MI300X.
139139

140140
#### NUMA balancing setting
141141

@@ -160,15 +160,16 @@ Some environment variables enhance the performance of the vLLM kernels and PyTor
160160
export NCCL_MIN_NCHANNELS=112
161161
export VLLM_FP8_PADDING=1
162162

163-
You can set both PYTORCH_TUNABLEOP_ENABLED and PYTORCH_TUNABLEOP_TUNING to 1 to performance GEMM tuning for the 1st benchmark run.
164-
It will take some time to complete the tuning during the benchmark. After tuning, it will generate several csv files as the performance lookup database. For the subsequent benchmark runs, you can keep
163+
You can set both PYTORCH_TUNABLEOP_ENABLED and PYTORCH_TUNABLEOP_TUNING to 1 to performance GEMM tuning for the 1st benchmark run.
164+
It will take some time to complete the tuning during the benchmark. After tuning, it will generate several csv files as the performance lookup database. For the subsequent benchmark runs, you can keep
165165

166-
PYTORCH_TUNABLEOP_ENABLED as 1 and set
167-
PYTORCH_TUNABLEOP_TUNING to 0 to use the selected kernels.
166+
PYTORCH_TUNABLEOP_ENABLED as 1 and set
167+
PYTORCH_TUNABLEOP_TUNING to 0 to use the selected kernels.
168168

169169
##### vLLM engine performance settings
170-
vLLM provides a number of engine options which can be changed to improve performance.
171-
Refer https://docs.vllm.ai/en/stable/models/engine_args.html for the complete list of vLLM engine options.
170+
171+
vLLM provides a number of engine options which can be changed to improve performance.
172+
Refer <https://docs.vllm.ai/en/stable/models/engine_args.html> for the complete list of vLLM engine options.
172173
Below is a list of options which are useful:
173174
- **--max-model-len** : Maximum context length supported by the model instance. Can be set to a lower value than model configuration value to improve performance and gpu memory utilization.
174175
- **--max-num-batched-tokens** : The maximum prefill size, i.e., how many prompt tokens can be packed together in a single prefill. Set to a higher value to improve prefill performance at the cost of higher gpu memory utilization. 65536 works well for LLama models.
@@ -179,6 +180,7 @@ Below is a list of options which are useful:
179180
Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments.
180181

181182
##### Online Gemm Tuning
183+
182184
Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8
183185

184186
If you want to do limited online tuning use --enforce-eager and tune for particular batch sizes. See example below.
@@ -239,8 +241,8 @@ If you want to run Meta-Llama-3.1-405B FP16, please run
239241
--input-len 128 \
240242
--output-len 128
241243

242-
You can change various input-len, output-len, batch size and run the benchmark as well. When output-len is 1, it measures prefill latency (TTFT).
243-
Decoding latency (TPOT) can be calculated based on the measured latency.
244+
You can change various input-len, output-len, batch size and run the benchmark as well. When output-len is 1, it measures prefill latency (TTFT).
245+
Decoding latency (TPOT) can be calculated based on the measured latency.
244246

245247
For more information about the parameters, please run
246248

@@ -261,7 +263,7 @@ Benchmark Meta-Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens and t
261263
--num-scheduler-steps 10 \
262264
--tensor-parallel-size 8 \
263265
--input-len 128 \
264-
--output-len 128
266+
--output-len 128
265267

266268
If you want to run Meta-Llama-3.1-405B FP16, please run
267269

@@ -294,23 +296,23 @@ For more information about the parameters, please run
294296

295297
/app/vllm/benchmarks/benchmark_throughput.py -h
296298

297-
Tensor parallelism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance.
299+
Tensor parallelism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance.
298300

299301
##### Online Server Benchmark
300-
302+
301303
Make the following changes if required
302-
304+
303305
/app/vllm/benchmarks/backend_request_func.py
304-
306+
305307
line 242 + "ignore_eos": True,
306-
308+
307309
/app/vllm/benchmarks/benchmark_serving.py
308310
line 245 - interval = np.random.exponential(1.0 / request_rate)
309311
line 245 + ## interval = np.random.exponential(1.0 / request_rate)
310312
line 246 + interval = 1.0 / request_rate
311-
313+
312314
Benchmark Meta-Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
313-
315+
314316
vllm serve /data/llm/Meta-Llama-3.1-70B-Instruct-FP8-KV \
315317
--swap-space 16 \
316318
--disable-log-requests \
@@ -322,11 +324,11 @@ Benchmark Meta-Llama-3.1-70B with input 4096 tokens, output 512 tokens and tenso
322324
--max-num-batched-tokens 65536 \
323325
--gpu-memory-utilization 0.99 \
324326
--num_scheduler-steps 10
325-
327+
326328
Change port (for example --port 8005) if port=8000 is currently being used by other processes.
327-
329+
328330
run client in a separate terminal. Use port_id from previous step else port-id=8000.
329-
331+
330332
python /app/vllm/benchmarks/benchmark_serving.py \
331333
--port 8000 \
332334
--model /data/llm/Meta-Llama-3.1-70B-Instruct-FP8-KV \
@@ -336,18 +338,18 @@ run client in a separate terminal. Use port_id from previous step else port-id=8
336338
--request-rate 1 \
337339
--num-prompts 500 \
338340
--percentile-metrics ttft,tpot,itl,e2el
339-
341+
340342
Once all prompts are processed, terminate the server gracefully (ctrl+c).
341-
343+
342344
##### CPX mode
343-
345+
344346
Currently only CPX-NPS1 mode is supported. So ONLY tp=1 is supported in CPX mode.
345347
But multiple instances can be started simultaneously (if needed) in CPX-NPS1 mode.
346-
348+
347349
Set GPUs in CPX mode
348-
350+
349351
rocm-smi --setcomputepartition cpx
350-
352+
351353
Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512. As mentioned above, tp=1.
352354

353355
HIP_VISIBLE_DEVICES=0 \
@@ -363,42 +365,43 @@ Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512.
363365
--output-json <path/to/output.json> \
364366
--quantization fp8 \
365367
--gpu-memory-utilization 0.99
366-
368+
367369
Set GPU to SPX mode.
368370

369371
rocm-smi --setcomputepartition spx
370372

371373
### Speculative Decoding
372374

373-
Speculative decoding is one of the key features in vLLM. It has been supported on MI300. Here below is an example of the performance benchmark w/wo speculative decoding for Llama 3.1 405B with Llama 3.1 8B as the draft model.
375+
Speculative decoding is one of the key features in vLLM. It has been supported on MI300. Here below is an example of the performance benchmark w/wo speculative decoding for Llama 3.1 405B with Llama 3.1 8B as the draft model.
374376

375-
Without Speculative Decoding -
377+
Without Speculative Decoding -
376378

377379
python benchmark_latency.py --model /models/models--amd--Meta-Llama-3.1-405B-Instruct-FP8-KV/ --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128
378380

379-
With Speculative Decoding -
381+
With Speculative Decoding -
380382

381383
python benchmark_latency.py --model /models/models--amd--Meta-Llama-3.1-405B-Instruct-FP8-KV/ --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128 --speculative-model /models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/ --num-speculative-tokens 5
382384

383-
You should see some performance improvement about the e2e latency.
385+
You should see some performance improvement about the e2e latency.
384386

385387
### MMLU_PRO_Biology Accuracy Eval
386-
388+
387389
### fp16
390+
388391
vllm (pretrained=models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26,dtype=float16,tensor_parallel_size=8), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
389-
392+
390393
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
391394
|-------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
392395
|biology| 0|custom-extract| 5|exact_match||0.8466|± |0.0135|
393-
396+
394397
### fp8
398+
395399
vllm (pretrained=models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26,dtype=float16,quantization=fp8,quantized_weights_path=/llama.safetensors,tensor_parallel_size=8), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 32
396-
400+
397401
| Tasks |Version| Filter |n-shot| Metric | |Value| |Stderr|
398402
|-------|------:|--------------|-----:|-----------|---|----:|---|-----:|
399403
|biology| 0|custom-extract| 5|exact_match||0.848|± |0.0134|
400404

401-
402405
## Performance
403406

404407
### LLaMA2/3 *MLPerf* 70B
@@ -408,18 +411,18 @@ Please refer to the MLPerf instructions for recreating the MLPerf numbers.
408411
## Version
409412

410413
### Release Notes
414+
411415
20240906a: Legacy quantization formats required `--quantization fp8_rocm` as a flag instead of `--quantization fp8`
412416

413417
Updated:
414418

415-
vLLM: https://github.com/ROCm/vllm/commit/2c60adc83981ada77a77b2adda78ef109d2e2e2b
419+
vLLM: <https://github.com/ROCm/vllm/commit/2c60adc83981ada77a77b2adda78ef109d2e2e2b>
420+
416421
### Docker Manifest
417422

418423
To reproduce the release docker:
419424

420-
```
421-
git clone https://github.com/ROCm/vllm.git
422-
cd vllm
423-
git checkout 2c60adc83981ada77a77b2adda78ef109d2e2e2b
424-
docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
425-
```
425+
git clone https://github.com/ROCm/vllm.git
426+
cd vllm
427+
git checkout 2c60adc83981ada77a77b2adda78ef109d2e2e2b
428+
docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .

0 commit comments

Comments
 (0)