Skip to content

Commit 7ec024e

Browse files
authored
Updated results, removed max_seq_len_to_capture (#727)
* Updated results, removed max_seq_len_to_capture * Updated Python version and Docker Manifest section
1 parent f6761f8 commit 7ec024e

File tree

1 file changed

+63
-51
lines changed

1 file changed

+63
-51
lines changed

docs/dev-docker/README.md

Lines changed: 63 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,20 @@ This documentation includes information for running the popular Llama 3.1 series
1010

1111
The pre-built image includes:
1212

13-
- ROCm™ 6.4.1
13+
- ROCm™ 7.0.0
1414
- HipblasLT 0.15
15-
- vLLM 0.10.1
16-
- PyTorch 2.7
15+
- vLLM 0.10.2
16+
- PyTorch 2.9
1717

1818
## Pull latest Docker Image
1919

2020
Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main`
2121

2222
## What is New
2323

24-
- vLLM version 0.10.1
25-
- Flag enabled by default in the docker -VLLM_V1_USE_PREFILL_DECODE_ATTENTION
24+
- Support for FP4 models
25+
- GPT-OSS support
26+
- Support for MI35x
2627

2728
## Known Issues and Workarounds
2829

@@ -39,14 +40,14 @@ The table below shows performance data where a local inference client is fed req
3940

4041
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
4142
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
42-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 13818.7 |
43-
| | | | 128 | 4096 | 1500 | 1500 | 11612.0 |
44-
| | | | 500 | 2000 | 2000 | 2000 | 11408.7 |
45-
| | | | 2048 | 2048 | 1500 | 1500 | 7800.5 |
46-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4134.0 |
47-
| | | | 128 | 4096 | 1500 | 1500 | 3177.6 |
48-
| | | | 500 | 2000 | 2000 | 2000 | 3034.1 |
49-
| | | | 2048 | 2048 | 500 | 500 | 2214.2 |
43+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 13212.5 |
44+
| | | | 128 | 4096 | 1500 | 1500 | 11312.8 |
45+
| | | | 500 | 2000 | 2000 | 2000 | 11376.7 |
46+
| | | | 2048 | 2048 | 1500 | 1500 | 7252.1 |
47+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4201.7 |
48+
| | | | 128 | 4096 | 1500 | 1500 | 3176.3 |
49+
| | | | 500 | 2000 | 2000 | 2000 | 2992.0 |
50+
| | | | 2048 | 2048 | 500 | 500 | 2153.7 |
5051

5152
*TP stands for Tensor Parallelism.*
5253

@@ -58,38 +59,38 @@ The table below shows latency measurement, which typically involves assessing th
5859

5960
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
6061
|-------|-----------|----------|------------|--------|---------|-------------------|
61-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.254 |
62-
| | | | 2 | 128 | 2048 | 18.157 |
63-
| | | | 4 | 128 | 2048 | 18.549 |
64-
| | | | 8 | 128 | 2048 | 20.547 |
65-
| | | | 16 | 128 | 2048 | 22.164 |
66-
| | | | 32 | 128 | 2048 | 25.426 |
67-
| | | | 64 | 128 | 2048 | 33.297 |
68-
| | | | 128 | 128 | 2048 | 45.792 |
69-
| | | | 1 | 2048 | 2048 | 15.299 |
70-
| | | | 2 | 2048 | 2048 | 18.194 |
71-
| | | | 4 | 2048 | 2048 | 18.942 |
72-
| | | | 8 | 2048 | 2048 | 20.526 |
73-
| | | | 16 | 2048 | 2048 | 23.211 |
74-
| | | | 32 | 2048 | 2048 | 26.516 |
75-
| | | | 64 | 2048 | 2048 | 34.824 |
76-
| | | | 128 | 2048 | 2048 | 52.211 |
77-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 47.150 |
78-
| | | | 2 | 128 | 2048 | 50.933 |
79-
| | | | 4 | 128 | 2048 | 52.521 |
80-
| | | | 8 | 128 | 2048 | 55.233 |
81-
| | | | 16 | 128 | 2048 | 59.065 |
82-
| | | | 32 | 128 | 2048 | 68.786 |
83-
| | | | 64 | 128 | 2048 | 88.094 |
84-
| | | | 128 | 128 | 2048 | 118.512 |
85-
| | | | 1 | 2048 | 2048 | 47.675 |
86-
| | | | 2 | 2048 | 2048 | 50.788 |
87-
| | | | 4 | 2048 | 2048 | 52.405 |
88-
| | | | 8 | 2048 | 2048 | 55.459 |
89-
| | | | 16 | 2048 | 2048 | 59.923 |
90-
| | | | 32 | 2048 | 2048 | 70.388 |
91-
| | | | 64 | 2048 | 2048 | 91.218 |
92-
| | | | 128 | 2048 | 2048 | 127.004 |
62+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.882 |
63+
| | | | 2 | 128 | 2048 | 17.934 |
64+
| | | | 4 | 128 | 2048 | 18.487 |
65+
| | | | 8 | 128 | 2048 | 20.251 |
66+
| | | | 16 | 128 | 2048 | 22.307 |
67+
| | | | 32 | 128 | 2048 | 29.933 |
68+
| | | | 64 | 128 | 2048 | 32.359 |
69+
| | | | 128 | 128 | 2048 | 45.419 |
70+
| | | | 1 | 2048 | 2048 | 15.959 |
71+
| | | | 2 | 2048 | 2048 | 18.177 |
72+
| | | | 4 | 2048 | 2048 | 18.684 |
73+
| | | | 8 | 2048 | 2048 | 20.716 |
74+
| | | | 16 | 2048 | 2048 | 23.136 |
75+
| | | | 32 | 2048 | 2048 | 26.969 |
76+
| | | | 64 | 2048 | 2048 | 34.359 |
77+
| | | | 128 | 2048 | 2048 | 52.351 |
78+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 49.098 |
79+
| | | | 2 | 128 | 2048 | 51.009 |
80+
| | | | 4 | 128 | 2048 | 52.979 |
81+
| | | | 8 | 128 | 2048 | 55.675 |
82+
| | | | 16 | 128 | 2048 | 58.982 |
83+
| | | | 32 | 128 | 2048 | 67.889 |
84+
| | | | 64 | 128 | 2048 | 86.844 |
85+
| | | | 128 | 128 | 2048 | 117.440 |
86+
| | | | 1 | 2048 | 2048 | 49.033 |
87+
| | | | 2 | 2048 | 2048 | 51.316 |
88+
| | | | 4 | 2048 | 2048 | 52.947 |
89+
| | | | 8 | 2048 | 2048 | 55.863 |
90+
| | | | 16 | 2048 | 2048 | 60.103 |
91+
| | | | 32 | 2048 | 2048 | 69.632 |
92+
| | | | 64 | 2048 | 2048 | 89.826 |
93+
| | | | 128 | 2048 | 2048 | 126.433 |
9394

9495
*TP stands for Tensor Parallelism.*
9596

@@ -206,7 +207,6 @@ Below is a list of a few of the key vLLM engine arguments for performance; these
206207
- **--max-model-len** : Maximum context length supported by the model instance. Can be set to a lower value than model configuration value to improve performance and gpu memory utilization.
207208
- **--max-num-batched-tokens** : The maximum prefill size, i.e., how many prompt tokens can be packed together in a single prefill. Set to a higher value to improve prefill performance at the cost of higher gpu memory utilization. 65536 works well for LLama models.
208209
- **--max-num-seqs** : The maximum decode batch size (default 256). Using larger values will allow more prompts to be processed concurrently, resulting in increased throughput (possibly at the expense of higher latency). If the value is too large, there may not be enough GPU memory for the KV cache, resulting in requests getting preempted. The optimal value will depend on the GPU memory, model size, and maximum context length.
209-
- **--max-seq-len-to-capture** : Maximum sequence length for which Hip-graphs are captured and utilized. It's recommended to use Hip-graphs for the best decode performance. The default value of this parameter is 8K, which is lower than the large context lengths supported by recent models such as LLama. Set this parameter to max-model-len or maximum context length supported by the model for best performance.
210210
- **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9. Increasing the value (potentially as high as 0.99) will increase the amount of memory available for KV cache. When running in graph mode (i.e. not using `--enforce-eager`), it may be necessary to use a slightly smaller value of 0.92 - 0.95 to ensure adequate memory is available for the HIP graph.
211211

212212
### Latency Benchmark
@@ -271,7 +271,6 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
271271
--model $MODEL \
272272
--max-model-len 8192 \
273273
--max-num-batched-tokens 131072 \
274-
--max-seq-len-to-capture 131072 \
275274
--input-len $IN \
276275
--output-len $OUT \
277276
--tensor-parallel-size $TP \
@@ -471,13 +470,21 @@ Please refer to the [Benchmarking Machine Learning using ROCm and AMD GPUs: Repr
471470

472471
## Docker Manifest
473472

474-
To reproduce the release docker:
473+
Clone the vLLM repository:
475474

476475
```bash
477-
git clone https://github.com/ROCm/vllm.git
476+
git clone https://github.com/vllm-project/vllm.git
478477
cd vllm
479-
git checkout 6663000a391911eba96d7864a26ac42b07f6ef29
480-
docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
478+
```
479+
480+
Use the following command to build the image directly from the specified commit.
481+
482+
```bash
483+
docker build -f docker/Dockerfile.rocm \
484+
--build-arg REMOTE_VLLM=1 \
485+
--build-arg VLLM_REPO=https://github.com/ROCm/vllm \
486+
--build-arg VLLM_BRANCH="790d22168820507f3105fef29596549378cfe399" \
487+
-t vllm-rocm .
481488
```
482489

483490
### Building AITER Image
@@ -493,6 +500,11 @@ Use AITER release candidate branch instead:
493500

494501
## Changelog
495502

503+
rocm7.0.0_vllm_0.10.2_20251002:
504+
- Support for FP4 models
505+
- GPT-OSS support
506+
- Support for MI35x
507+
496508
rocm6.4.1_vllm_0.10.1_20250909:
497509
- vLLM version 0.10.1
498510
- Flag enabled by default in the docker -VLLM_V1_USE_PREFILL_DECODE_ATTENTION

0 commit comments

Comments
 (0)