Updated README.md (ROCm#546)

Mcirino1 · gshtras · web-flow · commit 16d2b92ebcf9 · 2025-05-19T10:35:04.000-05:00
* Updated README.md

Waiting on benchmark results, do not publish yet

* Changed "OOM" to "Out of memory"

* Added throughput results

* Added latency results

* Trying to fix syntax

---------

Co-authored-by: Gregory Shtrasberg &lt;156009573+gshtras@users.noreply.github.com&gt;
diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
@@ -12,22 +12,21 @@ The pre-built image includes:
 
 - ROCm™ 6.3.1
 - HipblasLT 0.15
-- vLLM 0.8.3
-- PyTorch 2.7dev (nightly)
+- vLLM 0.8.5
+- PyTorch 2.7
 
 ## Pull latest Docker Image
 
 Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main`
 
 ## What is New
 
-- [Improved DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
-- Initial Gemma-3 enablement
-- Detokenizer disablement
-- Torch.compile support
+- Out of memory bug fix
+- PyTorch fixes
+- Tunable ops fixes
 
 ## Known Issues and Workarounds
-- Mem fault encountered when running the model meta 405 fp8. To workaround this issue, set PYTORCH_TUNABLEOP_ENABLED=0
+- None
 
 ## Performance Results
 
@@ -40,14 +39,14 @@ The table below shows performance data where a local inference client is fed req
 
 | Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
 |-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
-| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16364.9  |
-|       |           |         | 128   | 4096   | 1500        | 1500         | 12171.0               |
-|       |           |         | 500   | 2000   | 2000        | 2000         | 13290.4               |
-|       |           |         | 2048  | 2048   | 1500        | 1500         | 8216.5                |
-| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4331.6 |
-|       |           |         | 128   | 4096   | 1500        | 1500         | 3409.9                |
-|       |           |         | 500   | 2000   | 2000        | 2000         | 3184.0                |
-|       |           |         | 2048  | 2048   | 500         | 500          | 2154.3                |
+| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 16892.6  |
+|       |           |         | 128   | 4096   | 1500        | 1500         | 13916.7               |
+|       |           |         | 500   | 2000   | 2000        | 2000         | 13616.1               |
+|       |           |         | 2048  | 2048   | 1500        | 1500         | 8491.8                |
+| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4380.3 |
+|       |           |         | 128   | 4096   | 1500        | 1500         | 3404.2                |
+|       |           |         | 500   | 2000   | 2000        | 2000         | 3251.3                |
+|       |           |         | 2048  | 2048   | 500         | 500          | 2249.3                |
 
 *TP stands for Tensor Parallelism.*
 
@@ -57,42 +56,42 @@ The table below shows latency measurement, which typically involves assessing th
 
 | Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
 |-------|-----------|----------|------------|--------|---------|-------------------|
-| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.411 |
-| | | | 2 | 128 | 2048 | 18.750 |
-| | | | 4 | 128 | 2048 | 19.059 |
-| | | | 8 | 128 | 2048 | 20.857  |
-| | | | 16 | 128 | 2048 | 22.670 |
-| | | | 32 | 128 | 2048 | 25.495 |
-| | | | 64 | 128 | 2048 | 34.187 |
-| | | | 128 | 128 | 2048 | 48.754 |
-| | | | 1 | 2048 | 2048 | 17.699 |
-| | | | 2 | 2048 | 2048 | 18.919 |
-| | | | 4 | 2048 | 2048 | 19.220 |
-| | | | 8 | 2048 | 2048 | 21.545 |
-| | | | 16 | 2048 | 2048 | 24.329 |
-| | | | 32 | 2048 | 2048 | 29.461 |
-| | | | 64 | 2048 | 2048 | 40.148 |
-| | | | 128 | 2048 | 2048 | 61.382 |
-| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.601 |
-| | | | 2 | 128 | 2048 | 46.947 |
-| | | | 4 | 128 | 2048 | 48.971 |
-| | | | 8 | 128 | 2048 | 53.021 |
-| | | | 16 | 128 | 2048 | 55.836 |
-| | | | 32 | 128 | 2048 | 64.947 |
-| | | | 64 | 128 | 2048 | 81.408 |
-| | | | 128 | 128 | 2048 | 115.296 |
-| | | | 1 | 2048 | 2048 | 46.998 |
-| | | | 2 | 2048 | 2048 | 47.619 |
-| | | | 4 | 2048 | 2048 | 51.086 |
-| | | | 8 | 2048 | 2048 | 55.706 |
-| | | | 16 | 2048 | 2048 | 61.049 |
-| | | | 32 | 2048 | 2048 | 75.842 |
-| | | | 64 | 2048 | 2048 | 103.074 |
-| | | | 128 | 2048 | 2048 | 157.705 |
+| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.591 |
+| | | | 2 | 128 | 2048 | 16.865 |
+| | | | 4 | 128 | 2048 | 17.295 |
+| | | | 8 | 128 | 2048 | 18.939 |
+| | | | 16 | 128 | 2048 | 20.891 |
+| | | | 32 | 128 | 2048 | 23.402 |
+| | | | 64 | 128 | 2048 | 30.633 |
+| | | | 128 | 128 | 2048 | 43.898 |
+| | | | 1 | 2048 | 2048 | 15.678 |
+| | | | 2 | 2048 | 2048 | 16.892 |
+| | | | 4 | 2048 | 2048 | 17.781 |
+| | | | 8 | 2048 | 2048 | 19.536 |
+| | | | 16 | 2048 | 2048 | 22.521 |
+| | | | 32 | 2048 | 2048 | 26.729 |
+| | | | 64 | 2048 | 2048 | 36.794 |
+| | | | 128 | 2048 | 2048 | 56.371 |
+| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 45.446 |
+| | | | 2 | 128 | 2048 | 46.223 |
+| | | | 4 | 128 | 2048 | 47.833 |
+| | | | 8 | 128 | 2048 | 52.085 |
+| | | | 16 | 128 | 2048 | 54.378 |
+| | | | 32 | 128 | 2048 | 63.108 |
+| | | | 64 | 128 | 2048 | 81.764 |
+| | | | 128 | 128 | 2048 | 109.479 |
+| | | | 1 | 2048 | 2048 | 46.001 |
+| | | | 2 | 2048 | 2048 | 46.720 |
+| | | | 4 | 2048 | 2048 | 49.250 |
+| | | | 8 | 2048 | 2048 | 54.495 |
+| | | | 16 | 2048 | 2048 | 59.539 |
+| | | | 32 | 2048 | 2048 | 73.906 |
+| | | | 64 | 2048 | 2048 | 103.847 |
+| | | | 128 | 2048 | 2048 | 151.613 |
 
 *TP stands for Tensor Parallelism.*
 
-Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9554 Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
+Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9575F Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
 
 ## Reproducing Benchmarked Results
 
@@ -490,7 +489,7 @@ To reproduce the release docker:
 ```bash
     git clone https://github.com/ROCm/vllm.git
     cd vllm
-    git checkout b8498bc4a1c2aae1e25cfc780db0eadbc4716c67
+    git checkout d60b5a337a552b6f74f511462d4ba67ea0ac4402
     docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
 ```
 
@@ -507,6 +506,11 @@ Use AITER release candidate branch instead:
 
 ## Changelog
 
+20250513_aiter:
+- Out of memory bug fix
+- PyTorch fixes
+- Tunable ops fixes
+
 20250410_aiter:
 - 2-stage MoE
 - MLA from AITER