@@ -12,7 +12,7 @@ The pre-built image includes:
12
12
13
13
- ROCm™ 6.4.1
14
14
- HipblasLT 0.15
15
- - vLLM 0.9 .1
15
+ - vLLM 0.10 .1
16
16
- PyTorch 2.7
17
17
18
18
## Pull latest Docker Image
@@ -21,15 +21,12 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
21
21
22
22
## What is New
23
23
24
- - No need to specify the --compilation-config parameter, these options were turned on by default
25
- - Fixed llama3.1 405b CAR issue (no longer need --disable-custom-all-reduce)
26
- - Fixed +rms_norm custom kernel issue
27
- - Added quick reduce (set VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP to enable. Supported modes are FP, INT8, INT6, INT4)
28
- - Mitigated the commandr model causing GPU crash through a workaround until the driver issue is fixed
24
+ - vLLM version 0.10.1
25
+ - Flag enabled by default in the docker -VLLM_V1_USE_PREFILL_DECODE_ATTENTION
29
26
30
27
## Known Issues and Workarounds
31
28
32
- - AITER does not support fp8 kv cache
29
+ - None.
33
30
34
31
## Performance Results
35
32
@@ -42,14 +39,14 @@ The table below shows performance data where a local inference client is fed req
42
39
43
40
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
44
41
| -------| -----------| ---------| -------| --------| -------------| --------------| -----------------------|
45
- | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 12638.9 |
46
- | | | | 128 | 4096 | 1500 | 1500 | 10756.8 |
47
- | | | | 500 | 2000 | 2000 | 2000 | 10691 .7 |
48
- | | | | 2048 | 2048 | 1500 | 1500 | 7354.9 |
49
- | Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 3912.8 |
50
- | | | | 128 | 4096 | 1500 | 1500 | 3084.7 |
51
- | | | | 500 | 2000 | 2000 | 2000 | 2935.9 |
52
- | | | | 2048 | 2048 | 500 | 500 | 2191.5 |
42
+ | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 13818.7 |
43
+ | | | | 128 | 4096 | 1500 | 1500 | 11612.0 |
44
+ | | | | 500 | 2000 | 2000 | 2000 | 11408 .7 |
45
+ | | | | 2048 | 2048 | 1500 | 1500 | 7800.5 |
46
+ | Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4134.0 |
47
+ | | | | 128 | 4096 | 1500 | 1500 | 3177.6 |
48
+ | | | | 500 | 2000 | 2000 | 2000 | 3034.1 |
49
+ | | | | 2048 | 2048 | 500 | 500 | 2214.2 |
53
50
54
51
* TP stands for Tensor Parallelism.*
55
52
@@ -61,38 +58,38 @@ The table below shows latency measurement, which typically involves assessing th
61
58
62
59
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
63
60
| -------| -----------| ----------| ------------| --------| ---------| -------------------|
64
- | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.236 |
65
- | | | | 2 | 128 | 2048 | 18.057 |
66
- | | | | 4 | 128 | 2048 | 18.450 |
67
- | | | | 8 | 128 | 2048 | 19.677 |
68
- | | | | 16 | 128 | 2048 | 22.072 |
69
- | | | | 32 | 128 | 2048 | 24.932 |
70
- | | | | 64 | 128 | 2048 | 33.287 |
71
- | | | | 128 | 128 | 2048 | 46.484 |
72
- | | | | 1 | 2048 | 2048 | 17.500 |
73
- | | | | 2 | 2048 | 2048 | 18.055 |
74
- | | | | 4 | 2048 | 2048 | 18.858 |
75
- | | | | 8 | 2048 | 2048 | 20.161 |
76
- | | | | 16 | 2048 | 2048 | 22.347 |
77
- | | | | 32 | 2048 | 2048 | 25.966 |
78
- | | | | 64 | 2048 | 2048 | 35.324 |
79
- | | | | 128 | 2048 | 2048 | 52.394 |
80
- | Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 48.453 |
81
- | | | | 2 | 128 | 2048 | 49.268 |
82
- | | | | 4 | 128 | 2048 | 51.136 |
83
- | | | | 8 | 128 | 2048 | 54.226 |
84
- | | | | 16 | 128 | 2048 | 57.274 |
85
- | | | | 32 | 128 | 2048 | 68.901 |
86
- | | | | 64 | 128 | 2048 | 88.631 |
87
- | | | | 128 | 128 | 2048 | 117.027 |
88
- | | | | 1 | 2048 | 2048 | 48.362 |
89
- | | | | 2 | 2048 | 2048 | 49.121 |
90
- | | | | 4 | 2048 | 2048 | 52.347 |
91
- | | | | 8 | 2048 | 2048 | 54.471 |
92
- | | | | 16 | 2048 | 2048 | 57.841 |
93
- | | | | 32 | 2048 | 2048 | 70.538 |
94
- | | | | 64 | 2048 | 2048 | 91.452 |
95
- | | | | 128 | 2048 | 2048 | 125.471 |
61
+ | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.254 |
62
+ | | | | 2 | 128 | 2048 | 18.157 |
63
+ | | | | 4 | 128 | 2048 | 18.549 |
64
+ | | | | 8 | 128 | 2048 | 20.547 |
65
+ | | | | 16 | 128 | 2048 | 22.164 |
66
+ | | | | 32 | 128 | 2048 | 25.426 |
67
+ | | | | 64 | 128 | 2048 | 33.297 |
68
+ | | | | 128 | 128 | 2048 | 45.792 |
69
+ | | | | 1 | 2048 | 2048 | 15.299 |
70
+ | | | | 2 | 2048 | 2048 | 18.194 |
71
+ | | | | 4 | 2048 | 2048 | 18.942 |
72
+ | | | | 8 | 2048 | 2048 | 20.526 |
73
+ | | | | 16 | 2048 | 2048 | 23.211 |
74
+ | | | | 32 | 2048 | 2048 | 26.516 |
75
+ | | | | 64 | 2048 | 2048 | 34.824 |
76
+ | | | | 128 | 2048 | 2048 | 52.211 |
77
+ | Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 47.150 |
78
+ | | | | 2 | 128 | 2048 | 50.933 |
79
+ | | | | 4 | 128 | 2048 | 52.521 |
80
+ | | | | 8 | 128 | 2048 | 55.233 |
81
+ | | | | 16 | 128 | 2048 | 59.065 |
82
+ | | | | 32 | 128 | 2048 | 68.786 |
83
+ | | | | 64 | 128 | 2048 | 88.094 |
84
+ | | | | 128 | 128 | 2048 | 118.512 |
85
+ | | | | 1 | 2048 | 2048 | 47.675 |
86
+ | | | | 2 | 2048 | 2048 | 50.788 |
87
+ | | | | 4 | 2048 | 2048 | 52.405 |
88
+ | | | | 8 | 2048 | 2048 | 55.459 |
89
+ | | | | 16 | 2048 | 2048 | 59.923 |
90
+ | | | | 32 | 2048 | 2048 | 70.388 |
91
+ | | | | 64 | 2048 | 2048 | 91.218 |
92
+ | | | | 128 | 2048 | 2048 | 127.004 |
96
93
97
94
* TP stands for Tensor Parallelism.*
98
95
@@ -201,12 +198,6 @@ Note: the `--multi_gpu` parameter can be omitted for small models that fit on a
201
198
202
199
Some environment variables enhance the performance of the vLLM kernels on the MI300X / MI325X accelerator. See the AMD Instinct MI300X workload optimization guide for more information.
203
200
204
- ``` bash
205
- export VLLM_USE_TRITON_FLASH_ATTN=0
206
- export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
207
-
208
- ```
209
-
210
201
### vLLM engine performance settings
211
202
212
203
vLLM provides a number of engine options which can be changed to improve performance. Refer to the [ vLLM Engine Args] ( https://docs.vllm.ai/en/stable/usage/engine_args.html ) documentation for the complete list of vLLM engine options.
@@ -225,8 +216,6 @@ vLLM's benchmark_latency.py script measures end-to-end latency for a specified m
225
216
You can run latency tests for FP8 models with:
226
217
227
218
``` bash
228
- export VLLM_USE_TRITON_FLASH_ATTN=0
229
- export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
230
219
MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
231
220
BS=1
232
221
IN=128
@@ -265,8 +254,6 @@ vLLM's benchmark_throughput.py script measures offline throughput. It can eithe
265
254
You can run throughput tests for FP8 models with:
266
255
267
256
``` bash
268
- export VLLM_USE_TRITON_FLASH_ATTN=0
269
- export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
270
257
MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
271
258
IN=128
272
259
OUT=2048
@@ -313,7 +300,6 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py -h
313
300
Benchmark Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
314
301
315
302
``` bash
316
- export VLLM_USE_TRITON_FLASH_ATTN=0
317
303
vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
318
304
--swap-space 16 \
319
305
--disable-log-requests \
@@ -432,14 +418,12 @@ Speculative decoding is one of the key features in vLLM. It has been supported o
432
418
Without Speculative Decoding -
433
419
434
420
``` bash
435
- export VLLM_USE_TRITON_FLASH_ATTN=0
436
421
python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --input-len 1024 --output-len 128
437
422
```
438
423
439
424
With Speculative Decoding -
440
425
441
426
``` bash
442
- export VLLM_USE_TRITON_FLASH_ATTN=0
443
427
python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --input-len 1024 --output-len 128 --speculative-model amd/Llama-3.1-8B-Instruct-FP8-KV --num-speculative-tokens 5
444
428
```
445
429
@@ -456,7 +440,6 @@ Some use cases include:
456
440
``` bash
457
441
export VLLM_ROCM_USE_AITER=1
458
442
export VLLM_ROCM_USE_AITER_MHA=0
459
- export VLLM_ROCM_USE_AITER_RMSNORM=0
460
443
python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 128 --output-len 2048
461
444
```
462
445
@@ -493,7 +476,7 @@ To reproduce the release docker:
493
476
``` bash
494
477
git clone https://github.com/ROCm/vllm.git
495
478
cd vllm
496
- git checkout b432b7a285aa0dcb9677380936ffa74931bb6d6f
479
+ git checkout 6663000a391911eba96d7864a26ac42b07f6ef29
497
480
docker build -f docker/Dockerfile.rocm -t < your_tag> --build-arg USE_CYTHON=1 .
498
481
```
499
482
@@ -510,6 +493,10 @@ Use AITER release candidate branch instead:
510
493
511
494
## Changelog
512
495
496
+ rocm6.4.1_vllm_0.10.1_20250909:
497
+ - vLLM version 0.10.1
498
+ - Flag enabled by default in the docker -VLLM_V1_USE_PREFILL_DECODE_ATTENTION
499
+
513
500
20250715_aiter:
514
501
- No need to specify the --compilation-config parameter, these options were turned on by default
515
502
- Fixed llama3.1 405b CAR issue (no longer need --disable-custom-all-reduce)
0 commit comments