|
| 1 | +--- |
| 2 | +title: "Exploring inference memory saturation effect: H100 vs MI300x" |
| 3 | +date: 2024-12-05 |
| 4 | +description: "This benchmark explores how GPU memory saturation affects LLM inference performance and cost, comparing NVIDIA H100 and AMD MI300x." |
| 5 | +slug: h100-mi300x-inference-benchmark |
| 6 | +image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-v2.png?raw=true |
| 7 | +categories: |
| 8 | + - Benchmarks |
| 9 | + - AMD |
| 10 | + - NVIDIA |
| 11 | +--- |
| 12 | + |
| 13 | +# Exploring inference memory saturation effect: H100 vs MI300x |
| 14 | + |
| 15 | +GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory |
| 16 | +saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8. |
| 17 | + |
| 18 | +We examine the effect of limited parallel computational resources on throughput and Time to First Token (TTFT). |
| 19 | +Additionally, we compare deployment strategies: running two Llama 3.1 405B FP8 replicas on 4xMI300x versus a single |
| 20 | +replica on 4xMI300x and 8xMI300x |
| 21 | + |
| 22 | +Finally, we extrapolate performance projections for upcoming GPUs like NVIDIA H200, B200, and AMD MI325x, MI350x. |
| 23 | + |
| 24 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-v2.png?raw=true" width="600" /> |
| 25 | + |
| 26 | +This benchmark is made possible through the generous support of our friends at |
| 27 | +[Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"} and |
| 28 | +[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"}, |
| 29 | +who provided high-end hardware. |
| 30 | + |
| 31 | +<!-- more --> |
| 32 | + |
| 33 | +## Benchmark setup |
| 34 | + |
| 35 | +1. AMD 8xMI300x |
| 36 | + * 2x Intel Xeon Platinum 8470, 52C/104T, 16GT/s, 105M Cache (350W) |
| 37 | + * 8x AMD MI300x GPU OAM, 192GB, 750W |
| 38 | + * 32x 64GB RDIMM, 4800MT/s |
| 39 | +2. NVIDIA 8xH100 SXM5 |
| 40 | + * 2× Intel Xeon Platinum 8480+, 56C/112T, 16GT/s, 105M Cache (350W) |
| 41 | + * 8× NVIDIA H100 SXM5 GPU, 80GB, 700W |
| 42 | + * 32x 64GB DDR5 |
| 43 | + |
| 44 | +### Benchmark modes |
| 45 | + |
| 46 | +1. **Online inference**: Benchmarked across QPS 16, 32, and 1000 using |
| 47 | + the [ShareGPT :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered){:target="_blank"} dataset. Execution used |
| 48 | + vLLM’s [benchmark\_serving](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py). |
| 49 | +2. **Offline inference**: Benchmarked with varying input/output lengths across different batch sizes, using vLLM’s [benchmark\_throughput.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py). |
| 50 | + |
| 51 | +| | Input prompt lengths | Batch size | |
| 52 | +|-----------------|----------------------|-------------------------| |
| 53 | +| **Short/Small** | 4 to 1024 | | |
| 54 | +| **Short/Large** | 128 | 256 | |
| 55 | +| **Large/Large** | 32784 | 64 (MI300x) / 16 (H100) | |
| 56 | + |
| 57 | +## Observations |
| 58 | + |
| 59 | +### Cost per token |
| 60 | + |
| 61 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-cpt.png?raw=true" width="750"> |
| 62 | + |
| 63 | +As prompt and batch sizes grow, the NVIDIA H100 reaches memory limits, causing a sharp drop in cost-effectiveness. In |
| 64 | +contrast, the 1 FP8 8xMI300x configuration is the most cost-efficient for large prompts. |
| 65 | + |
| 66 | +For large prompts, two parallel replicas running on 4xMI300x lose their cost advantage compared to a single replica on |
| 67 | +8xMI300x. The latter offers 51% more memory for the KV cache, improving throughput and reducing cost per token. |
| 68 | + |
| 69 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-online-requests.png?raw=true" width="750"> |
| 70 | + |
| 71 | +While 4xMI300x is a cost-effective alternative to 8xH100 for smaller load profiles, it underperforms in online serving. |
| 72 | +8xH100 SXM5 processes 74% more requests per second and reduces TTFT by at least 50% at all QPS levels. |
| 73 | + |
| 74 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-online-ttft.png?raw=true" width="750"> |
| 75 | + |
| 76 | +### Throughput |
| 77 | + |
| 78 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-throughput.png?raw=true" width="750"> |
| 79 | + |
| 80 | +With large prompts and batch sizes, two replicas on 4xMI300x GPUs hit memory saturation when total tokens (prompt |
| 81 | +length x batch size) exceed the available memory for the KV cache. This forces the inference engine to compute KV |
| 82 | +tensors on-the-fly or offload them to CPU memory, degrading throughput. |
| 83 | + |
| 84 | +In [Lambda Labs](https://lambdalabs.com/blog/partner-spotlight-evaluating-nvidia-h200-gpus-for-ai-inference-with-baseten)’ |
| 85 | +benchmark, an 8xH200 setup processed 3.4 times more tokens per second than an 8xH100. Extrapolating to our |
| 86 | +setup, an 8xH200 would process around 2,186 tokens per second (3.4 × 643), though still lower than 8xMI300x. |
| 87 | + |
| 88 | +| | AMD MI300x | NVIDIA H200 | |
| 89 | +|---------------------------|------------|-------------| |
| 90 | +| **GPU Memory** | 192 GB | 141 GB | |
| 91 | +| **Memory Type** | HBM3 | HBM3e | |
| 92 | +| **Peak Memory Bandwidth** | 5.3TB/s | 4.8TB/s | |
| 93 | +| **TFLOPS (FP8)** | 2610 | 1979 | |
| 94 | + |
| 95 | +#### Replicas on 4xMi300x |
| 96 | + |
| 97 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-throughput-2048.png?raw=true" width="750"> |
| 98 | + |
| 99 | +Running two replicas on 4xMI300x delivers better throughput for small to medium prompts than a single replica on |
| 100 | +8xMI300x. |
| 101 | + |
| 102 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-throughput-32784.png?raw=true" width="750"> |
| 103 | + |
| 104 | +This boost comes from distributing the Llama 3.1 405B model across four GPUs, enabling parallel execution. For |
| 105 | +small prompts, a single replica underutilizes the GPUs. Running two replicas doubles the batch size, improving GPU |
| 106 | +utilization and efficiency. |
| 107 | + |
| 108 | +### Time To First Token |
| 109 | + |
| 110 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-ttft-qps-1000.png?raw=true" width="750"> |
| 111 | + |
| 112 | +The 4xMI300x setup provides 768 GB of memory (4 GPUs × 192 GB each), compared to 640 GB with 8xH100 (8 GPUs × 80 GB |
| 113 | +each). However, at 1000 QPS, TTFT for 4xMI300x is over twice as long as for 8xH100 |
| 114 | + |
| 115 | +This difference occurs during the prefill stage, where KV tensors for input prompts are computed. Since tensors are |
| 116 | +processed in parallel, the 8xH100 configuration distributes the load more effectively, reducing computation time. |
| 117 | + |
| 118 | +Despite offering more memory, 4xMI300x lacks the parallelism of 8xH100, leading to longer TTFT. |
| 119 | + |
| 120 | +### Time to Serve 1 Request |
| 121 | + |
| 122 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h100-mi300x-inference-benchmark-time-1-request.png?raw=true" width="750"> |
| 123 | + |
| 124 | +Processing a single large prompt request with 8xMI300x takes around 11.25 seconds. This latency is mainly due to |
| 125 | +computational demands during the prefill phase, where KV tensors are computed. |
| 126 | + |
| 127 | +Optimizations like [automatic prefix caching :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html){:target="_blank"} |
| 128 | +could help reduce this time, but are outside the scope of this benchmark. |
| 129 | + |
| 130 | +## Benchmark notes |
| 131 | + |
| 132 | +### Benchmark setup |
| 133 | + |
| 134 | +The script used in this benchmark was designed for large prompts in offline inference. A different script tailored for |
| 135 | +online inference would provide more accurate insights. |
| 136 | + |
| 137 | +### Batch size |
| 138 | + |
| 139 | +We compared throughput at batch size 16 for 8xH100 and batch size 64 for 8xMI300x. The 8xH100 setup begins to struggle |
| 140 | +with batch size 16 due to memory saturation, resulting in slower generation times. |
| 141 | + |
| 142 | +### Model checkpoints |
| 143 | + |
| 144 | +For AMD MI300x, we used [`amd/Llama-3.1-405B-Instruct-FP8-KV` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV){:target="_blank"} |
| 145 | +to achieve optimal performance, relying on AMD for quantization. |
| 146 | + |
| 147 | +### vLLM configuration |
| 148 | + |
| 149 | +To maximize inference results on AMD MI300x, we adjusted specific arguments: |
| 150 | + |
| 151 | +<div class="termy"> |
| 152 | + |
| 153 | +```shell |
| 154 | +$ VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve \ |
| 155 | + meta-llama/Llama-3.1-405B-FP8 -tp 8 \ |
| 156 | + --max-seq-len-to-capture 16384 \ |
| 157 | + --served-model-name meta-llama/Llama-3.1-405B-FP8 \ |
| 158 | + --enable-chunked-prefill=False \ |
| 159 | + --num-scheduler-step 15 \ |
| 160 | + --max-num-seqs 1024 |
| 161 | +``` |
| 162 | + |
| 163 | +</div> |
| 164 | + |
| 165 | +Our benchmark focused on testing inference with tensor parallelism. Integrating tensor and pipeline parallelism could |
| 166 | +provide additional insights. |
| 167 | + |
| 168 | +## On B200, MI325x, and MI350x |
| 169 | + |
| 170 | +The MI325x offers 64GB more HBM and 0.7TB/s higher bandwidth than MI300x. However, because it has the same FP8 TFLOPS, it |
| 171 | +doesn't provide significant compute gains, positioning it against NVIDIA's H200. |
| 172 | + |
| 173 | +The NVIDIA B200 outperforms MI300x and MI325x with more TFLOPS and higher peak memory bandwidth, resulting in lower TTFT |
| 174 | +by reducing compute time for KV tensors and memory transfer times during the decode stage. We expect the B200 to |
| 175 | +challenge MI325x, as long as memory saturation is avoided. |
| 176 | + |
| 177 | +Notably, future GPUs from AMD and NVIDIA are expected to support FP4 and FP6, improving throughput, latency, and |
| 178 | +cost-efficiency. |
| 179 | + |
| 180 | +| | AMD MI300x | AMD MI325x | AMD MI350x | NVIDIA B200 | |
| 181 | +|---------------------------|------------|------------|---------------|---------------| |
| 182 | +| **GPU Memory** | 192 GB | 256 GB | 288GB | 192 GB | |
| 183 | +| **Memory Type** | HBM3 | HBM3e | | HBM3e | |
| 184 | +| **Peak Memory Bandwidth** | 5.3TB/s | 6TB/s | | 8TB/s | |
| 185 | +| **TFLOPS (FP8)** | 2610 | 2610 | | 4500 | |
| 186 | +| **Low precision** | FP8 | FP8 | FP4, FP6, FP8 | FP4, FP6, FP8 | |
| 187 | + |
| 188 | +## Thanks to our friends |
| 189 | + |
| 190 | +### Hot Aisle |
| 191 | + |
| 192 | +[Hot Aisle :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/){:target="_blank"} sponsored this benchmark by providing access to 8x MI300x hardware. We’re deeply grateful for their support. |
| 193 | + |
| 194 | +If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Hot Aisle. With `dstack`, accessing |
| 195 | +your cluster via SSH is seamless and straightforward. |
| 196 | + |
| 197 | +### Lambda |
| 198 | + |
| 199 | +[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"} sponsored this benchmark with credits for on-demand 8x H100 instances. |
| 200 | +We’re truly thankful for their support. |
| 201 | + |
| 202 | +For top-tier cloud compute with NVIDIA GPUs, Lambda is an excellent choice. Once set up, you can easily provision |
| 203 | +compute, manage clusters, and orchestrate your AI workloads using `dstack`. |
0 commit comments