Skip to content

Commit cf0486d

Browse files
authored
Updated README.md with RC2 results and removed envvars (#660)
* Updated README.md with RC2 results and removed envvars * Changed "What is new" section
1 parent 7ecc5da commit cf0486d

File tree

1 file changed

+49
-62
lines changed

1 file changed

+49
-62
lines changed

docs/dev-docker/README.md

Lines changed: 49 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ The pre-built image includes:
1212

1313
- ROCm™ 6.4.1
1414
- HipblasLT 0.15
15-
- vLLM 0.9.1
15+
- vLLM 0.10.1
1616
- PyTorch 2.7
1717

1818
## Pull latest Docker Image
@@ -21,15 +21,12 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
2121

2222
## What is New
2323

24-
- No need to specify the --compilation-config parameter, these options were turned on by default
25-
- Fixed llama3.1 405b CAR issue (no longer need --disable-custom-all-reduce)
26-
- Fixed +rms_norm custom kernel issue
27-
- Added quick reduce (set VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP to enable. Supported modes are FP, INT8, INT6, INT4)
28-
- Mitigated the commandr model causing GPU crash through a workaround until the driver issue is fixed
24+
- vLLM version 0.10.1
25+
- Flag enabled by default in the docker -VLLM_V1_USE_PREFILL_DECODE_ATTENTION
2926

3027
## Known Issues and Workarounds
3128

32-
- AITER does not support fp8 kv cache
29+
- None.
3330

3431
## Performance Results
3532

@@ -42,14 +39,14 @@ The table below shows performance data where a local inference client is fed req
4239

4340
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
4441
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
45-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 12638.9 |
46-
| | | | 128 | 4096 | 1500 | 1500 | 10756.8 |
47-
| | | | 500 | 2000 | 2000 | 2000 | 10691.7 |
48-
| | | | 2048 | 2048 | 1500 | 1500 | 7354.9 |
49-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 3912.8 |
50-
| | | | 128 | 4096 | 1500 | 1500 | 3084.7 |
51-
| | | | 500 | 2000 | 2000 | 2000 | 2935.9 |
52-
| | | | 2048 | 2048 | 500 | 500 | 2191.5 |
42+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 13818.7 |
43+
| | | | 128 | 4096 | 1500 | 1500 | 11612.0 |
44+
| | | | 500 | 2000 | 2000 | 2000 | 11408.7 |
45+
| | | | 2048 | 2048 | 1500 | 1500 | 7800.5 |
46+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4134.0 |
47+
| | | | 128 | 4096 | 1500 | 1500 | 3177.6 |
48+
| | | | 500 | 2000 | 2000 | 2000 | 3034.1 |
49+
| | | | 2048 | 2048 | 500 | 500 | 2214.2 |
5350

5451
*TP stands for Tensor Parallelism.*
5552

@@ -61,38 +58,38 @@ The table below shows latency measurement, which typically involves assessing th
6158

6259
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
6360
|-------|-----------|----------|------------|--------|---------|-------------------|
64-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.236 |
65-
| | | | 2 | 128 | 2048 | 18.057 |
66-
| | | | 4 | 128 | 2048 | 18.450 |
67-
| | | | 8 | 128 | 2048 | 19.677 |
68-
| | | | 16 | 128 | 2048 | 22.072 |
69-
| | | | 32 | 128 | 2048 | 24.932 |
70-
| | | | 64 | 128 | 2048 | 33.287 |
71-
| | | | 128 | 128 | 2048 | 46.484 |
72-
| | | | 1 | 2048 | 2048 | 17.500 |
73-
| | | | 2 | 2048 | 2048 | 18.055 |
74-
| | | | 4 | 2048 | 2048 | 18.858 |
75-
| | | | 8 | 2048 | 2048 | 20.161 |
76-
| | | | 16 | 2048 | 2048 | 22.347 |
77-
| | | | 32 | 2048 | 2048 | 25.966 |
78-
| | | | 64 | 2048 | 2048 | 35.324 |
79-
| | | | 128 | 2048 | 2048 | 52.394 |
80-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 48.453 |
81-
| | | | 2 | 128 | 2048 | 49.268 |
82-
| | | | 4 | 128 | 2048 | 51.136 |
83-
| | | | 8 | 128 | 2048 | 54.226 |
84-
| | | | 16 | 128 | 2048 | 57.274 |
85-
| | | | 32 | 128 | 2048 | 68.901 |
86-
| | | | 64 | 128 | 2048 | 88.631 |
87-
| | | | 128 | 128 | 2048 | 117.027 |
88-
| | | | 1 | 2048 | 2048 | 48.362 |
89-
| | | | 2 | 2048 | 2048 | 49.121 |
90-
| | | | 4 | 2048 | 2048 | 52.347 |
91-
| | | | 8 | 2048 | 2048 | 54.471 |
92-
| | | | 16 | 2048 | 2048 | 57.841 |
93-
| | | | 32 | 2048 | 2048 | 70.538 |
94-
| | | | 64 | 2048 | 2048 | 91.452 |
95-
| | | | 128 | 2048 | 2048 | 125.471 |
61+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.254 |
62+
| | | | 2 | 128 | 2048 | 18.157 |
63+
| | | | 4 | 128 | 2048 | 18.549 |
64+
| | | | 8 | 128 | 2048 | 20.547 |
65+
| | | | 16 | 128 | 2048 | 22.164 |
66+
| | | | 32 | 128 | 2048 | 25.426 |
67+
| | | | 64 | 128 | 2048 | 33.297 |
68+
| | | | 128 | 128 | 2048 | 45.792 |
69+
| | | | 1 | 2048 | 2048 | 15.299 |
70+
| | | | 2 | 2048 | 2048 | 18.194 |
71+
| | | | 4 | 2048 | 2048 | 18.942 |
72+
| | | | 8 | 2048 | 2048 | 20.526 |
73+
| | | | 16 | 2048 | 2048 | 23.211 |
74+
| | | | 32 | 2048 | 2048 | 26.516 |
75+
| | | | 64 | 2048 | 2048 | 34.824 |
76+
| | | | 128 | 2048 | 2048 | 52.211 |
77+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 47.150 |
78+
| | | | 2 | 128 | 2048 | 50.933 |
79+
| | | | 4 | 128 | 2048 | 52.521 |
80+
| | | | 8 | 128 | 2048 | 55.233 |
81+
| | | | 16 | 128 | 2048 | 59.065 |
82+
| | | | 32 | 128 | 2048 | 68.786 |
83+
| | | | 64 | 128 | 2048 | 88.094 |
84+
| | | | 128 | 128 | 2048 | 118.512 |
85+
| | | | 1 | 2048 | 2048 | 47.675 |
86+
| | | | 2 | 2048 | 2048 | 50.788 |
87+
| | | | 4 | 2048 | 2048 | 52.405 |
88+
| | | | 8 | 2048 | 2048 | 55.459 |
89+
| | | | 16 | 2048 | 2048 | 59.923 |
90+
| | | | 32 | 2048 | 2048 | 70.388 |
91+
| | | | 64 | 2048 | 2048 | 91.218 |
92+
| | | | 128 | 2048 | 2048 | 127.004 |
9693

9794
*TP stands for Tensor Parallelism.*
9895

@@ -201,12 +198,6 @@ Note: the `--multi_gpu` parameter can be omitted for small models that fit on a
201198

202199
Some environment variables enhance the performance of the vLLM kernels on the MI300X / MI325X accelerator. See the AMD Instinct MI300X workload optimization guide for more information.
203200

204-
```bash
205-
export VLLM_USE_TRITON_FLASH_ATTN=0
206-
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
207-
208-
```
209-
210201
### vLLM engine performance settings
211202

212203
vLLM provides a number of engine options which can be changed to improve performance. Refer to the [vLLM Engine Args](https://docs.vllm.ai/en/stable/usage/engine_args.html) documentation for the complete list of vLLM engine options.
@@ -225,8 +216,6 @@ vLLM's benchmark_latency.py script measures end-to-end latency for a specified m
225216
You can run latency tests for FP8 models with:
226217

227218
```bash
228-
export VLLM_USE_TRITON_FLASH_ATTN=0
229-
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
230219
MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
231220
BS=1
232221
IN=128
@@ -265,8 +254,6 @@ vLLM's benchmark_throughput.py script measures offline throughput. It can eithe
265254
You can run throughput tests for FP8 models with:
266255

267256
```bash
268-
export VLLM_USE_TRITON_FLASH_ATTN=0
269-
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
270257
MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
271258
IN=128
272259
OUT=2048
@@ -313,7 +300,6 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py -h
313300
Benchmark Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
314301

315302
```bash
316-
export VLLM_USE_TRITON_FLASH_ATTN=0
317303
vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
318304
--swap-space 16 \
319305
--disable-log-requests \
@@ -432,14 +418,12 @@ Speculative decoding is one of the key features in vLLM. It has been supported o
432418
Without Speculative Decoding -
433419

434420
```bash
435-
export VLLM_USE_TRITON_FLASH_ATTN=0
436421
python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --input-len 1024 --output-len 128
437422
```
438423

439424
With Speculative Decoding -
440425

441426
```bash
442-
export VLLM_USE_TRITON_FLASH_ATTN=0
443427
python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --input-len 1024 --output-len 128 --speculative-model amd/Llama-3.1-8B-Instruct-FP8-KV --num-speculative-tokens 5
444428
```
445429

@@ -456,7 +440,6 @@ Some use cases include:
456440
```bash
457441
export VLLM_ROCM_USE_AITER=1
458442
export VLLM_ROCM_USE_AITER_MHA=0
459-
export VLLM_ROCM_USE_AITER_RMSNORM=0
460443
python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 128 --output-len 2048
461444
```
462445

@@ -493,7 +476,7 @@ To reproduce the release docker:
493476
```bash
494477
git clone https://github.com/ROCm/vllm.git
495478
cd vllm
496-
git checkout b432b7a285aa0dcb9677380936ffa74931bb6d6f
479+
git checkout 6663000a391911eba96d7864a26ac42b07f6ef29
497480
docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
498481
```
499482

@@ -510,6 +493,10 @@ Use AITER release candidate branch instead:
510493

511494
## Changelog
512495

496+
rocm6.4.1_vllm_0.10.1_20250909:
497+
- vLLM version 0.10.1
498+
- Flag enabled by default in the docker -VLLM_V1_USE_PREFILL_DECODE_ATTENTION
499+
513500
20250715_aiter:
514501
- No need to specify the --compilation-config parameter, these options were turned on by default
515502
- Fixed llama3.1 405b CAR issue (no longer need --disable-custom-all-reduce)

0 commit comments

Comments
 (0)