@@ -14,7 +14,7 @@ author: "Guest Post by Embedded LLM and Hot Aisle Inc."
14
14
  ;   ;
15
15
<img src =" /assets/figures/vllm-serving-amd/405b2.png " width =" 35% " >
16
16
</picture ><br >
17
- vLLM vs. TGI performance comparison for Llama 3.1 405B on 8 x MI300X (FP16 , 32 QPS).
17
+ vLLM vs. TGI performance comparison for Llama 3.1 405B on 8 x MI300X (BF16 , 32 QPS).
18
18
</p >
19
19
20
20
<p align =" center " >
@@ -24,7 +24,7 @@ vLLM vs. TGI performance comparison for Llama 3.1 405B on 8 x MI300X (FP16, 32 Q
24
24
  ;   ;
25
25
<img src =" /assets/figures/vllm-serving-amd/70b2.png " width =" 35% " >
26
26
</picture ><br >
27
- vLLM vs. TGI performance comparison for Llama 3.1 70B on 8 x MI300X (FP16 , 32 QPS).
27
+ vLLM vs. TGI performance comparison for Llama 3.1 70B on 8 x MI300X (BF16 , 32 QPS).
28
28
</p >
29
29
30
30
### Introduction
@@ -49,7 +49,7 @@ Even in the default configuration, vLLM shows superior performance compared to T
49
49
<picture >
50
50
<img src =" /assets/figures/vllm-serving-amd/introduction/Mean TTFT (ms).png " width =" 70% " >
51
51
</picture ><br >
52
- vLLM vs. TGI performance for Llama 3.1 405B on 8 x MI300X (FP16 , QPS 16, 32, 1000; see Appendix for commands).
52
+ vLLM vs. TGI performance for Llama 3.1 405B on 8 x MI300X (BF16 , QPS 16, 32, 1000; see Appendix for commands).
53
53
</p >
54
54
55
55
### How to run vLLM with Optimal Performance
@@ -335,5 +335,5 @@ We have built the ROCm compatible vLLM docker from Dockerfile.rocm found in the
335
335
| ------------- | ------------- |
336
336
| vLLM Default Configuration | `VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-num-seqs 1024 --max-num-batched-tokens 1024 ` |
337
337
| TGI Default Configuration | `ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --model-id Llama-3.1-405B-Instruct` |
338
- | vLLM (This Guide) | `VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct-FP8 -tp 8 --max-seq-len-to-capture 16384 --enable-chunked-prefill=False --num-scheduler-step 15 --max-num-seqs 1024 ` |
338
+ | vLLM (This Guide) | `VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-seq-len-to-capture 16384 --enable-chunked-prefill=False --num-scheduler-step 15 --max-num-seqs 1024 ` |
339
339
| TGI (This Guide) | `ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --max-total-tokens 131072 --max-input-tokens 131000 --model-id Llama-3.1-405B-Instruct` |
0 commit comments