Skip to content

Commit 7bb0618

Browse files
authored
Added benchmark results and commit hash (ROCm#556)
* Added benchmark results and commit hash * Added release notes/changelog * Update README.md
1 parent f4a992c commit 7bb0618

File tree

1 file changed

+45
-44
lines changed

1 file changed

+45
-44
lines changed

docs/dev-docker/README.md

Lines changed: 45 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,7 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
2121

2222
## What is New
2323

24-
- Out of memory bug fix
25-
- PyTorch fixes
26-
- Tunable ops fixes
24+
- AITER V1 engine performance improvement
2725

2826
## Known Issues and Workarounds
2927
- None
@@ -39,14 +37,14 @@ The table below shows performance data where a local inference client is fed req
3937

4038
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
4139
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
42-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16892.6 |
43-
| | | | 128 | 4096 | 1500 | 1500 | 13916.7 |
44-
| | | | 500 | 2000 | 2000 | 2000 | 13616.1 |
45-
| | | | 2048 | 2048 | 1500 | 1500 | 8491.8 |
46-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4380.3 |
47-
| | | | 128 | 4096 | 1500 | 1500 | 3404.2 |
48-
| | | | 500 | 2000 | 2000 | 2000 | 3251.3 |
49-
| | | | 2048 | 2048 | 500 | 500 | 2249.3 |
40+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16622.2 |
41+
| | | | 128 | 4096 | 1500 | 1500 | 13779.8 |
42+
| | | | 500 | 2000 | 2000 | 2000 | 13424.9 |
43+
| | | | 2048 | 2048 | 1500 | 1500 | 8356.5 |
44+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4243.9 |
45+
| | | | 128 | 4096 | 1500 | 1500 | 3394.4 |
46+
| | | | 500 | 2000 | 2000 | 2000 | 3201.8 |
47+
| | | | 2048 | 2048 | 500 | 500 | 2208.0 |
5048

5149
*TP stands for Tensor Parallelism.*
5250

@@ -56,38 +54,38 @@ The table below shows latency measurement, which typically involves assessing th
5654

5755
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
5856
|-------|-----------|----------|------------|--------|---------|-------------------|
59-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.591 |
60-
| | | | 2 | 128 | 2048 | 16.865 |
61-
| | | | 4 | 128 | 2048 | 17.295 |
62-
| | | | 8 | 128 | 2048 | 18.939 |
63-
| | | | 16 | 128 | 2048 | 20.891 |
64-
| | | | 32 | 128 | 2048 | 23.402 |
65-
| | | | 64 | 128 | 2048 | 30.633 |
66-
| | | | 128 | 128 | 2048 | 43.898 |
67-
| | | | 1 | 2048 | 2048 | 15.678 |
68-
| | | | 2 | 2048 | 2048 | 16.892 |
69-
| | | | 4 | 2048 | 2048 | 17.781 |
70-
| | | | 8 | 2048 | 2048 | 19.536 |
71-
| | | | 16 | 2048 | 2048 | 22.521 |
72-
| | | | 32 | 2048 | 2048 | 26.729 |
73-
| | | | 64 | 2048 | 2048 | 36.794 |
74-
| | | | 128 | 2048 | 2048 | 56.371 |
75-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 45.446 |
76-
| | | | 2 | 128 | 2048 | 46.223 |
77-
| | | | 4 | 128 | 2048 | 47.833 |
78-
| | | | 8 | 128 | 2048 | 52.085 |
79-
| | | | 16 | 128 | 2048 | 54.378 |
80-
| | | | 32 | 128 | 2048 | 63.108 |
81-
| | | | 64 | 128 | 2048 | 81.764 |
82-
| | | | 128 | 128 | 2048 | 109.479 |
83-
| | | | 1 | 2048 | 2048 | 46.001 |
84-
| | | | 2 | 2048 | 2048 | 46.720 |
85-
| | | | 4 | 2048 | 2048 | 49.250 |
86-
| | | | 8 | 2048 | 2048 | 54.495 |
87-
| | | | 16 | 2048 | 2048 | 59.539 |
88-
| | | | 32 | 2048 | 2048 | 73.906 |
89-
| | | | 64 | 2048 | 2048 | 103.847 |
90-
| | | | 128 | 2048 | 2048 | 151.613 |
57+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.851 |
58+
| | | | 2 | 128 | 2048 | 16.995 |
59+
| | | | 4 | 128 | 2048 | 17.578 |
60+
| | | | 8 | 128 | 2048 | 19.277 |
61+
| | | | 16 | 128 | 2048 | 21.111 |
62+
| | | | 32 | 128 | 2048 | 23.902 |
63+
| | | | 64 | 128 | 2048 | 30.976 |
64+
| | | | 128 | 128 | 2048 | 44.107 |
65+
| | | | 1 | 2048 | 2048 | 15.981 |
66+
| | | | 2 | 2048 | 2048 | 17.322 |
67+
| | | | 4 | 2048 | 2048 | 18.025 |
68+
| | | | 8 | 2048 | 2048 | 20.218 |
69+
| | | | 16 | 2048 | 2048 | 22.690 |
70+
| | | | 32 | 2048 | 2048 | 27.407 |
71+
| | | | 64 | 2048 | 2048 | 37.099 |
72+
| | | | 128 | 2048 | 2048 | 56.659 |
73+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 45.929 |
74+
| | | | 2 | 128 | 2048 | 46.871 |
75+
| | | | 4 | 128 | 2048 | 48.763 |
76+
| | | | 8 | 128 | 2048 | 51.621 |
77+
| | | | 16 | 128 | 2048 | 54.822 |
78+
| | | | 32 | 128 | 2048 | 63.642 |
79+
| | | | 64 | 128 | 2048 | 82.256 |
80+
| | | | 128 | 128 | 2048 | 110.142 |
81+
| | | | 1 | 2048 | 2048 | 46.489 |
82+
| | | | 2 | 2048 | 2048 | 47.465 |
83+
| | | | 4 | 2048 | 2048 | 49.906 |
84+
| | | | 8 | 2048 | 2048 | 54.252 |
85+
| | | | 16 | 2048 | 2048 | 60.275 |
86+
| | | | 32 | 2048 | 2048 | 74.346 |
87+
| | | | 64 | 2048 | 2048 | 104.508 |
88+
| | | | 128 | 2048 | 2048 | 154.134 |
9189

9290
*TP stands for Tensor Parallelism.*
9391

@@ -489,7 +487,7 @@ To reproduce the release docker:
489487
```bash
490488
git clone https://github.com/ROCm/vllm.git
491489
cd vllm
492-
git checkout d60b5a337a552b6f74f511462d4ba67ea0ac4402
490+
git checkout 91a56009841e11b84a2aeb9cc5aa305ab2808ede
493491
docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
494492
```
495493

@@ -506,6 +504,9 @@ Use AITER release candidate branch instead:
506504

507505
## Changelog
508506

507+
20250521_aiter:
508+
- AITER V1 engine performance improvement
509+
509510
20250513_aiter:
510511
- Out of memory bug fix
511512
- PyTorch fixes

0 commit comments

Comments
 (0)