You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently these models are private. Please join https://huggingface.co/amd to access.
63
+
Currently these models are private. Please join <https://huggingface.co/amd> to access.
67
64
68
65
Download the model you want to run.
69
66
70
-
These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For more information about Quark, please refer to https://quark.docs.amd.com/latest/quark_example_torch_llm_gen.html
67
+
These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For more information about Quark, please refer to <https://quark.docs.amd.com/latest/quark_example_torch_llm_gen.html>
71
68
72
69
### Quantize your own models
73
-
This step is optional for you to use quantized models on your own. Take Llama 3.1 405B as an example.
74
70
75
-
Download the Model View the Llama-3.1-405B model at https://huggingface.co/meta-llama/Llama-3.1-405B. Ensure that you have been granted access, and apply for it if you do not have access.
71
+
This step is optional for you to use quantized models on your own. Take Llama 3.1 405B as an example.
72
+
73
+
Download the Model View the Llama-3.1-405B model at <https://huggingface.co/meta-llama/Llama-3.1-405B>. Ensure that you have been granted access, and apply for it if you do not have access.
76
74
77
75
If you do not already have a HuggingFace token, open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token.
78
76
@@ -100,27 +98,29 @@ Similarly, you can download Llama-3.1-70B and Llama-3.1-8B.
100
98
101
99
Run the quantization script in the example folder using the following command line:
102
100
export MODEL_DIR = [local model checkpoint folder] or meta-llama/Llama-3.1-405B-Instruct
101
+
103
102
#### single GPU
104
-
python3 quantize_quark.py \
105
-
--model_dir $MODEL_DIR \
106
-
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
107
-
--quant_scheme w_fp8_a_fp8 \
108
-
--kv_cache_dtype fp8 \
109
-
--num_calib_data 128 \
110
-
--model_export quark_safetensors \
111
-
--no_weight_matrix_merge
112
-
113
-
#### If model size is too large for single GPU, please use multi GPU instead.
114
-
python3 quantize_quark.py \
115
-
--model_dir $MODEL_DIR \
116
-
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
117
-
--quant_scheme w_fp8_a_fp8 \
118
-
--kv_cache_dtype fp8 \
119
-
--num_calib_data 128 \
120
-
--model_export quark_safetensors \
121
-
--no_weight_matrix_merge \
122
-
--multi_gpu
123
103
104
+
python3 quantize_quark.py \
105
+
--model_dir $MODEL_DIR \
106
+
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
107
+
--quant_scheme w_fp8_a_fp8 \
108
+
--kv_cache_dtype fp8 \
109
+
--num_calib_data 128 \
110
+
--model_export quark_safetensors \
111
+
--no_weight_matrix_merge
112
+
113
+
#### If model size is too large for single GPU, please use multi GPU instead
114
+
115
+
python3 quantize_quark.py \
116
+
--model_dir $MODEL_DIR \
117
+
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
118
+
--quant_scheme w_fp8_a_fp8 \
119
+
--kv_cache_dtype fp8 \
120
+
--num_calib_data 128 \
121
+
--model_export quark_safetensors \
122
+
--no_weight_matrix_merge \
123
+
--multi_gpu
124
124
125
125
### Launch AMD vLLM Docker
126
126
@@ -135,7 +135,7 @@ Download and launch the docker,
135
135
136
136
### Benchmark with AMD vLLM Docker
137
137
138
-
There are some system settings to be configured for optimum performance on MI300X.
138
+
There are some system settings to be configured for optimum performance on MI300X.
139
139
140
140
#### NUMA balancing setting
141
141
@@ -160,15 +160,16 @@ Some environment variables enhance the performance of the vLLM kernels and PyTor
160
160
export NCCL_MIN_NCHANNELS=112
161
161
export VLLM_FP8_PADDING=1
162
162
163
-
You can set both PYTORCH_TUNABLEOP_ENABLED and PYTORCH_TUNABLEOP_TUNING to 1 to performance GEMM tuning for the 1st benchmark run.
164
-
It will take some time to complete the tuning during the benchmark. After tuning, it will generate several csv files as the performance lookup database. For the subsequent benchmark runs, you can keep
163
+
You can set both PYTORCH_TUNABLEOP_ENABLED and PYTORCH_TUNABLEOP_TUNING to 1 to performance GEMM tuning for the 1st benchmark run.
164
+
It will take some time to complete the tuning during the benchmark. After tuning, it will generate several csv files as the performance lookup database. For the subsequent benchmark runs, you can keep
165
165
166
-
PYTORCH_TUNABLEOP_ENABLED as 1 and set
167
-
PYTORCH_TUNABLEOP_TUNING to 0 to use the selected kernels.
166
+
PYTORCH_TUNABLEOP_ENABLED as 1 and set
167
+
PYTORCH_TUNABLEOP_TUNING to 0 to use the selected kernels.
168
168
169
169
##### vLLM engine performance settings
170
-
vLLM provides a number of engine options which can be changed to improve performance.
171
-
Refer https://docs.vllm.ai/en/stable/models/engine_args.html for the complete list of vLLM engine options.
170
+
171
+
vLLM provides a number of engine options which can be changed to improve performance.
172
+
Refer <https://docs.vllm.ai/en/stable/models/engine_args.html> for the complete list of vLLM engine options.
172
173
Below is a list of options which are useful:
173
174
-**--max-model-len** : Maximum context length supported by the model instance. Can be set to a lower value than model configuration value to improve performance and gpu memory utilization.
174
175
-**--max-num-batched-tokens** : The maximum prefill size, i.e., how many prompt tokens can be packed together in a single prefill. Set to a higher value to improve prefill performance at the cost of higher gpu memory utilization. 65536 works well for LLama models.
@@ -179,6 +180,7 @@ Below is a list of options which are useful:
179
180
Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments.
180
181
181
182
##### Online Gemm Tuning
183
+
182
184
Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8
183
185
184
186
If you want to do limited online tuning use --enforce-eager and tune for particular batch sizes. See example below.
@@ -239,8 +241,8 @@ If you want to run Meta-Llama-3.1-405B FP16, please run
239
241
--input-len 128 \
240
242
--output-len 128
241
243
242
-
You can change various input-len, output-len, batch size and run the benchmark as well. When output-len is 1, it measures prefill latency (TTFT).
243
-
Decoding latency (TPOT) can be calculated based on the measured latency.
244
+
You can change various input-len, output-len, batch size and run the benchmark as well. When output-len is 1, it measures prefill latency (TTFT).
245
+
Decoding latency (TPOT) can be calculated based on the measured latency.
244
246
245
247
For more information about the parameters, please run
246
248
@@ -261,7 +263,7 @@ Benchmark Meta-Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens and t
261
263
--num-scheduler-steps 10 \
262
264
--tensor-parallel-size 8 \
263
265
--input-len 128 \
264
-
--output-len 128
266
+
--output-len 128
265
267
266
268
If you want to run Meta-Llama-3.1-405B FP16, please run
267
269
@@ -294,23 +296,23 @@ For more information about the parameters, please run
294
296
295
297
/app/vllm/benchmarks/benchmark_throughput.py -h
296
298
297
-
Tensor parallelism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance.
299
+
Tensor parallelism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance.
298
300
299
301
##### Online Server Benchmark
300
-
302
+
301
303
Make the following changes if required
302
-
304
+
303
305
/app/vllm/benchmarks/backend_request_func.py
304
-
306
+
305
307
line 242 + "ignore_eos": True,
306
-
308
+
307
309
/app/vllm/benchmarks/benchmark_serving.py
308
310
line 245 - interval = np.random.exponential(1.0 / request_rate)
309
311
line 245 + ## interval = np.random.exponential(1.0 / request_rate)
310
312
line 246 + interval = 1.0 / request_rate
311
-
313
+
312
314
Benchmark Meta-Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
@@ -336,18 +338,18 @@ run client in a separate terminal. Use port_id from previous step else port-id=8
336
338
--request-rate 1 \
337
339
--num-prompts 500 \
338
340
--percentile-metrics ttft,tpot,itl,e2el
339
-
341
+
340
342
Once all prompts are processed, terminate the server gracefully (ctrl+c).
341
-
343
+
342
344
##### CPX mode
343
-
345
+
344
346
Currently only CPX-NPS1 mode is supported. So ONLY tp=1 is supported in CPX mode.
345
347
But multiple instances can be started simultaneously (if needed) in CPX-NPS1 mode.
346
-
348
+
347
349
Set GPUs in CPX mode
348
-
350
+
349
351
rocm-smi --setcomputepartition cpx
350
-
352
+
351
353
Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512. As mentioned above, tp=1.
352
354
353
355
HIP_VISIBLE_DEVICES=0 \
@@ -363,42 +365,43 @@ Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512.
363
365
--output-json <path/to/output.json> \
364
366
--quantization fp8 \
365
367
--gpu-memory-utilization 0.99
366
-
368
+
367
369
Set GPU to SPX mode.
368
370
369
371
rocm-smi --setcomputepartition spx
370
372
371
373
### Speculative Decoding
372
374
373
-
Speculative decoding is one of the key features in vLLM. It has been supported on MI300. Here below is an example of the performance benchmark w/wo speculative decoding for Llama 3.1 405B with Llama 3.1 8B as the draft model.
375
+
Speculative decoding is one of the key features in vLLM. It has been supported on MI300. Here below is an example of the performance benchmark w/wo speculative decoding for Llama 3.1 405B with Llama 3.1 8B as the draft model.
0 commit comments