Skip to content

Commit 65480a4

Browse files
[Blog] Exploring inference memory saturation effect: H100 vs MI300x (#2061)
1 parent 8acd95e commit 65480a4

File tree

1 file changed

+203
-0
lines changed

1 file changed

+203
-0
lines changed
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
---
2+
title: "Exploring inference memory saturation effect: H100 vs MI300x"
3+
date: 2024-12-05
4+
description: "This benchmark explores how GPU memory saturation affects LLM inference performance and cost, comparing NVIDIA H100 and AMD MI300x."
5+
slug: h100-mi300x-inference-benchmark
6+
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-v2.png?raw=true
7+
categories:
8+
- Benchmarks
9+
- AMD
10+
- NVIDIA
11+
---
12+
13+
# Exploring inference memory saturation effect: H100 vs MI300x
14+
15+
GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory
16+
saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8.
17+
18+
We examine the effect of limited parallel computational resources on throughput and Time to First Token (TTFT).
19+
Additionally, we compare deployment strategies: running two Llama 3.1 405B FP8 replicas on 4xMI300x versus a single
20+
replica on 4xMI300x and 8xMI300x
21+
22+
Finally, we extrapolate performance projections for upcoming GPUs like NVIDIA H200, B200, and AMD MI325x, MI350x.
23+
24+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-v2.png?raw=true" width="600" />
25+
26+
This benchmark is made possible through the generous support of our friends at
27+
[Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"} and
28+
[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"},
29+
who provided high-end hardware.
30+
31+
<!-- more -->
32+
33+
## Benchmark setup
34+
35+
1. AMD 8xMI300x
36+
* 2x Intel Xeon Platinum 8470, 52C/104T, 16GT/s, 105M Cache (350W)
37+
* 8x AMD MI300x GPU OAM, 192GB, 750W
38+
* 32x 64GB RDIMM, 4800MT/s
39+
2. NVIDIA 8xH100 SXM5
40+
* 2× Intel Xeon Platinum 8480+, 56C/112T, 16GT/s, 105M Cache (350W)
41+
* 8× NVIDIA H100 SXM5 GPU, 80GB, 700W
42+
* 32x 64GB DDR5
43+
44+
### Benchmark modes
45+
46+
1. **Online inference**: Benchmarked across QPS 16, 32, and 1000 using
47+
the [ShareGPT :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered){:target="_blank"} dataset. Execution used
48+
vLLM’s [benchmark\_serving](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py).
49+
2. **Offline inference**: Benchmarked with varying input/output lengths across different batch sizes, using vLLM’s [benchmark\_throughput.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py).
50+
51+
| | Input prompt lengths | Batch size |
52+
|-----------------|----------------------|-------------------------|
53+
| **Short/Small** | 4 to 1024 | |
54+
| **Short/Large** | 128 | 256 |
55+
| **Large/Large** | 32784 | 64 (MI300x) / 16 (H100) |
56+
57+
## Observations
58+
59+
### Cost per token
60+
61+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-cpt.png?raw=true" width="750">
62+
63+
As prompt and batch sizes grow, the NVIDIA H100 reaches memory limits, causing a sharp drop in cost-effectiveness. In
64+
contrast, the 1 FP8 8xMI300x configuration is the most cost-efficient for large prompts.
65+
66+
For large prompts, two parallel replicas running on 4xMI300x lose their cost advantage compared to a single replica on
67+
8xMI300x. The latter offers 51% more memory for the KV cache, improving throughput and reducing cost per token.
68+
69+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-online-requests.png?raw=true" width="750">
70+
71+
While 4xMI300x is a cost-effective alternative to 8xH100 for smaller load profiles, it underperforms in online serving.
72+
8xH100 SXM5 processes 74% more requests per second and reduces TTFT by at least 50% at all QPS levels.
73+
74+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-online-ttft.png?raw=true" width="750">
75+
76+
### Throughput
77+
78+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-throughput.png?raw=true" width="750">
79+
80+
With large prompts and batch sizes, two replicas on 4xMI300x GPUs hit memory saturation when total tokens (prompt
81+
length x batch size) exceed the available memory for the KV cache. This forces the inference engine to compute KV
82+
tensors on-the-fly or offload them to CPU memory, degrading throughput.
83+
84+
In [Lambda Labs](https://lambdalabs.com/blog/partner-spotlight-evaluating-nvidia-h200-gpus-for-ai-inference-with-baseten)
85+
benchmark, an 8xH200 setup processed 3.4 times more tokens per second than an 8xH100. Extrapolating to our
86+
setup, an 8xH200 would process around 2,186 tokens per second (3.4 × 643), though still lower than 8xMI300x.
87+
88+
| | AMD MI300x | NVIDIA H200 |
89+
|---------------------------|------------|-------------|
90+
| **GPU Memory** | 192 GB | 141 GB |
91+
| **Memory Type** | HBM3 | HBM3e |
92+
| **Peak Memory Bandwidth** | 5.3TB/s | 4.8TB/s |
93+
| **TFLOPS (FP8)** | 2610 | 1979 |
94+
95+
#### Replicas on 4xMi300x
96+
97+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-throughput-2048.png?raw=true" width="750">
98+
99+
Running two replicas on 4xMI300x delivers better throughput for small to medium prompts than a single replica on
100+
8xMI300x.
101+
102+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-throughput-32784.png?raw=true" width="750">
103+
104+
This boost comes from distributing the Llama 3.1 405B model across four GPUs, enabling parallel execution. For
105+
small prompts, a single replica underutilizes the GPUs. Running two replicas doubles the batch size, improving GPU
106+
utilization and efficiency.
107+
108+
### Time To First Token
109+
110+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-ttft-qps-1000.png?raw=true" width="750">
111+
112+
The 4xMI300x setup provides 768 GB of memory (4 GPUs × 192 GB each), compared to 640 GB with 8xH100 (8 GPUs × 80 GB
113+
each). However, at 1000 QPS, TTFT for 4xMI300x is over twice as long as for 8xH100
114+
115+
This difference occurs during the prefill stage, where KV tensors for input prompts are computed. Since tensors are
116+
processed in parallel, the 8xH100 configuration distributes the load more effectively, reducing computation time.
117+
118+
Despite offering more memory, 4xMI300x lacks the parallelism of 8xH100, leading to longer TTFT.
119+
120+
### Time to Serve 1 Request
121+
122+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-time-1-request.png?raw=true" width="750">
123+
124+
Processing a single large prompt request with 8xMI300x takes around 11.25 seconds. This latency is mainly due to
125+
computational demands during the prefill phase, where KV tensors are computed.
126+
127+
Optimizations like [automatic prefix caching :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html){:target="_blank"}
128+
could help reduce this time, but are outside the scope of this benchmark.
129+
130+
## Benchmark notes
131+
132+
### Benchmark setup
133+
134+
The script used in this benchmark was designed for large prompts in offline inference. A different script tailored for
135+
online inference would provide more accurate insights.
136+
137+
### Batch size
138+
139+
We compared throughput at batch size 16 for 8xH100 and batch size 64 for 8xMI300x. The 8xH100 setup begins to struggle
140+
with batch size 16 due to memory saturation, resulting in slower generation times.
141+
142+
### Model checkpoints
143+
144+
For AMD MI300x, we used [`amd/Llama-3.1-405B-Instruct-FP8-KV` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV){:target="_blank"}
145+
to achieve optimal performance, relying on AMD for quantization.
146+
147+
### vLLM configuration
148+
149+
To maximize inference results on AMD MI300x, we adjusted specific arguments:
150+
151+
<div class="termy">
152+
153+
```shell
154+
$ VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve \
155+
meta-llama/Llama-3.1-405B-FP8 -tp 8 \
156+
--max-seq-len-to-capture 16384 \
157+
--served-model-name meta-llama/Llama-3.1-405B-FP8 \
158+
--enable-chunked-prefill=False \
159+
--num-scheduler-step 15 \
160+
--max-num-seqs 1024
161+
```
162+
163+
</div>
164+
165+
Our benchmark focused on testing inference with tensor parallelism. Integrating tensor and pipeline parallelism could
166+
provide additional insights.
167+
168+
## On B200, MI325x, and MI350x
169+
170+
The MI325x offers 64GB more HBM and 0.7TB/s higher bandwidth than MI300x. However, because it has the same FP8 TFLOPS, it
171+
doesn't provide significant compute gains, positioning it against NVIDIA's H200.
172+
173+
The NVIDIA B200 outperforms MI300x and MI325x with more TFLOPS and higher peak memory bandwidth, resulting in lower TTFT
174+
by reducing compute time for KV tensors and memory transfer times during the decode stage. We expect the B200 to
175+
challenge MI325x, as long as memory saturation is avoided.
176+
177+
Notably, future GPUs from AMD and NVIDIA are expected to support FP4 and FP6, improving throughput, latency, and
178+
cost-efficiency.
179+
180+
| | AMD MI300x | AMD MI325x | AMD MI350x | NVIDIA B200 |
181+
|---------------------------|------------|------------|---------------|---------------|
182+
| **GPU Memory** | 192 GB | 256 GB | 288GB | 192 GB |
183+
| **Memory Type** | HBM3 | HBM3e | | HBM3e |
184+
| **Peak Memory Bandwidth** | 5.3TB/s | 6TB/s | | 8TB/s |
185+
| **TFLOPS (FP8)** | 2610 | 2610 | | 4500 |
186+
| **Low precision** | FP8 | FP8 | FP4, FP6, FP8 | FP4, FP6, FP8 |
187+
188+
## Thanks to our friends
189+
190+
### Hot Aisle
191+
192+
[Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"} sponsored this benchmark by providing access to 8x MI300x hardware. We’re deeply grateful for their support.
193+
194+
If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Hot Aisle. With `dstack`, accessing
195+
your cluster via SSH is seamless and straightforward.
196+
197+
### Lambda
198+
199+
[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"} sponsored this benchmark with credits for on-demand 8x H100 instances.
200+
We’re truly thankful for their support.
201+
202+
For top-tier cloud compute with NVIDIA GPUs, Lambda is an excellent choice. Once set up, you can easily provision
203+
compute, manage clusters, and orchestrate your AI workloads using `dstack`.

0 commit comments

Comments
 (0)