Skip to content

Commit 23adfbc

Browse files
committed
Add blog post about PTPC FP8 on ROCm
Signed-off-by: tanpinsiang <[email protected]>
1 parent 3128f43 commit 23adfbc

File tree

11 files changed

+304
-0
lines changed

11 files changed

+304
-0
lines changed

_posts/2025-02-24-ptpc-fp8-rocm.md

Lines changed: 300 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,300 @@
1+
---
2+
layout: post
3+
title: "Boosting vLLM Performance on AMD ROCm: PTPC-FP8 Quantization Unleashes Speed and Accuracy"
4+
author: "AMD and Embedded LLM"
5+
image: /assets/figures/ptpc/PTPC-tumbnail.png
6+
thumbnail-img: /assets/figures/ptpc/PTPC-tumbnail.png
7+
share-img: /assets/figures/ptpc/PTPC-tumbnail.png
8+
---
9+
10+
# **Boosting vLLM Performance on AMD ROCm: PTPC-FP8 Quantization Unleashes Speed and Accuracy**
11+
12+
**TL;DR**: vLLM on AMD ROCm now has better FP8 performance!
13+
14+
* **What's new?** [PTPC-FP8 quantization](https://github.com/vllm-project/vllm/pull/12501) is now supported in vLLM (v0.7.3+) on AMD ROCm.
15+
* **Why is it good?** You get speeds similar to other FP8 methods, but with accuracy much closer to the original (BF16) model quality. It's the best FP8 option for ROCm.
16+
* **How to use it:**
17+
1. Install ROCm.
18+
2. Get the latest vLLM (v0.7.3 or newer).
19+
3. Add the `--quantization ptpc_fp8` flag when running your Hugging Face model. No need to pre-quantize!
20+
21+
22+
<img align="center" src="/assets/figures/ptpc/PTPC121.png" alt="What is PTPC-FP8" width="90%" height="90%">
23+
24+
**What is PTPC-FP8?** It's a method for FP8 weights *and* activations quantization. It uses per-token scaling for activations and per-channel scaling for weights, giving you better accuracy than traditional per-tensor FP8.
25+
26+
## **Introduction**
27+
28+
Large Language Models (LLMs) are revolutionizing how we interact with technology, but their immense computational demands can be a barrier. What if you could run these powerful models faster and more efficiently on your AMD GPUs, without sacrificing accuracy? Now you can! This post introduces a breakthrough: PTPC-FP8 quantization in vLLM, optimized for AMD's ROCm platform. Get ready for near-BF16 accuracy at FP8 speeds, directly using Hugging Face models – no pre-quantization needed! We'll show you how it works, benchmark its performance, and get you started.
29+
30+
**The Challenge of LLM Quantization and the PTPC-FP8 Solution**
31+
32+
Running large language models is computationally expensive. FP8 (8-bit floating-point) offers a compelling solution by reducing memory footprint and accelerating matrix multiplications, but traditional quantization approaches face a critical challenge with LLMs.
33+
34+
**The Outlier Problem**
35+
36+
LLMs develop activation outliers as they scale beyond certain sizes. These unusually large values create significant quantization challenges:
37+
38+
- Most values receive few effective bits of precision when using per-tensor quantization
39+
- Outliers appear persistently in specific channels across different tokens
40+
- While weights are relatively uniform and easy to quantize, activations are not
41+
42+
**PTPC: A Precision-Targeted Approach**
43+
44+
PTPC-FP8 (Per-Token-Activation, Per-Channel-Weight FP8) addresses this challenge by using tailored scaling factors based on three key observations:
45+
46+
1. Outliers consistently appear in the same channels
47+
2. Channel magnitudes within a token vary widely
48+
3. The same channel's magnitude across different tokens remains relatively stable
49+
50+
This insight led to a dual-granularity approach:
51+
* **Per-Token Activation Quantization**: Each input token receives its own scaling factor
52+
* **Per-Channel Weight Quantization**: Each weight column gets a unique scaling factor
53+
54+
<img align="right" src="/assets/figures/ptpc/PTPC-Diagram.png" alt="Per-Token Activation + Per-Channel Weight Quantization" width="50%" height="50%">
55+
56+
**Understanding the Diagram**
57+
58+
The illustration shows two quantization approaches:
59+
60+
**Tensor Dimensions (Both Methods):**
61+
- **X**: Input activation tensor (T×Ci)
62+
- **W**: Weight tensor (Ci×Co)
63+
- **T**: Token sequence length
64+
- **Ci/Co**: Input/output channels
65+
- **\***: Matrix multiplication
66+
67+
**Scaling Factors:**
68+
- **Top (Per-Tensor)**: Single scalars ΔX[1] and ΔW[1] for entire tensors
69+
- **Bottom (PTPC)**: Vector ΔX[T×1] with one scale per token and ΔW[1×Co] with one scale per output channel
70+
71+
This granular scaling approach allows PTPC-FP8 to achieve accuracy close to BF16 while maintaining the speed and memory benefits of 8-bit computation.
72+
73+
## **Deep Dive: How PTPC-FP8 Works in vLLM (and the Fused Kernel)**
74+
**Unlocking FP8 Speed: The Fused Rowwise Scaled GEMM**
75+
76+
PTPC-FP8's fine-grained scaling could slow things down without proper optimization. The key to maintaining speed is AMD ROCm's implementation of a **fused FP8 rowwise scaled GEMM** operation.
77+
78+
**The Challenge: 2-Step vs. Fused Approach**
79+
80+
Without optimization, matrix multiplication with per-token and per-channel scaling would require two costly steps:
81+
82+
```python
83+
# Naive 2-step approach:
84+
output = torch._scaled_mm(input, weight) # Step 1: FP8 GEMM
85+
output = output * token_scales * channel_scales # Step 2: Apply scaling factors
86+
```
87+
88+
This creates a performance bottleneck:
89+
- Write large intermediate results to memory
90+
- Read them back for scaling operations
91+
- Waste memory bandwidth and compute cycles
92+
93+
**The Solution: Fusion**
94+
95+
The fused approach combines matrix multiplication and scaling into a single hardware operation:
96+
97+
```python
98+
# Optimized fused operation:
99+
output = torch._scaled_mm(input, weight,
100+
scale_a=token_scales,
101+
scale_b=channel_scales)
102+
```
103+
104+
<img align="center" src="/assets/figures/ptpc/FusedGEMM.svg" alt="Fused GEMM Operation" width="90%" height="90%">
105+
106+
**Why This Matters**
107+
108+
This fusion leverages AMD GPUs' specialized hardware (particularly on MI300X with native FP8 support):
109+
110+
- **Memory Efficiency**: Scaling happens within on-chip memory before writing results
111+
- **Computational Efficiency**: Eliminates redundant operations
112+
- **Performance Boost**: Our tests show up to 2.5× speedup compared to the naive implementation
113+
114+
The fused operation makes PTPC-FP8 practical for real-world deployment, eliminating the performance penalty of using more granular scaling factors while maintaining accuracy benefits.
115+
116+
## **Benchmarks and Results (MI300X GPUs)**
117+
118+
**Benchmarking PTPC-FP8: Speed and Accuracy on MI300X**
119+
120+
We extensively benchmarked PTPC-FP8 using vLLM on AMD MI300X GPUs (commit `4ea48fb35cf67d61a1c3f18e3981c362e1d8e26f`). Here's what we found:
121+
122+
**1\. Throughput Comparison (PTPC-FP8 vs. Per-Tensor FP8):**
123+
124+
* **Model:** Llama-3.1-70B-Instruct
125+
* **Dataset:** SharedGPT
126+
* **GPU:** 1x MI300X
127+
* **Result:** PTPC-FP8 achieves virtually identical throughput to per-tensor FP8 (even slightly *better* – 1.01x improvement). This demonstrates that the fused kernel completely overcomes the potential overhead of PTPC-FP8's more complex scaling.
128+
129+
<img align="center" src="/assets/figures/ptpc/PTPCReqs.svg" alt="Throughput in Reqs/s across various input-output sequence length of Llama-3.1-70B-Instruct" width="90%" height="50%">
130+
131+
<img align="center" src="/assets/figures/ptpc/PTPCSpeedup.svg" alt="Request/s Throughput gain over FP8 per-tensor quantization
132+
across different input token length - output token length" width="90%" height="50%">
133+
134+
**2.1. Accuracy: Perplexity (Lower is Better)**
135+
136+
* **Model:** Llama-3.1-8B-Instruct
137+
* **Dataset:** Wikitext
138+
* **Setup:** 2× MI300X GPUs with tensor parallelism
139+
140+
**Understanding Perplexity: The Prediction Power Test**
141+
142+
Think of perplexity as a measure of how "confused" the model is when predicting text. Like a student taking a quiz:
143+
- **Lower perplexity = Better predictions** (the model confidently assigns high probability to the correct next words)
144+
- **Higher perplexity = More uncertainty** (the model is frequently surprised by what comes next)
145+
146+
A small increase in perplexity (even 0.1) can indicate meaningful degradation in model quality, especially for large language models that have been extensively optimized.
147+
148+
**Results: PTPC-FP8 Maintains BF16-Like Quality**
149+
150+
<img align="right" src="/assets/figures/ptpc/PerplexityBits.png" alt="bits and byte perplexity" width="50%" height="50%">
151+
152+
<img align="right" src="/assets/figures/ptpc/Perplexitywords.png" alt="Word Perplexity Comparison" width="50%" height="50%">
153+
154+
| Precision | Word Perplexity | % Degradation |
155+
|:----------|:----------------|:--------------|
156+
| BF16 (baseline) | 9.4281 | - |
157+
| PTPC-FP8 | 9.5093 | 0.86% |
158+
| Standard FP8 | 9.5124 | 0.89% |
159+
160+
As shown in both the table and chart:
161+
162+
1. **PTPC-FP8 outperforms standard FP8** quantization (9.5093 vs 9.5124)
163+
2. **The gap to BF16 is minimal** - only 0.86% degradation from the full-precision baseline
164+
3. **Byte-level metrics** (bits_per_byte and byte_perplexity) show the same pattern of results
165+
166+
**Why This Matters:** While standard FP8 already provides decent results, PTPC-FP8's lower perplexity indicates it better preserves the model's ability to make accurate predictions. This is especially important for complex reasoning and generation tasks, where small quality drops can compound into noticeable differences in output quality.
167+
168+
**2.2. Accuracy on GSM8K: Testing Mathematical Reasoning**
169+
170+
**What is GSM8K and Why It Matters**
171+
172+
GSM8K tests a model's ability to solve grade school math word problems – one of the most challenging tasks for LLMs. Unlike simple text prediction, these problems require:
173+
- Multi-step reasoning
174+
- Numerical accuracy
175+
- Logical consistency
176+
177+
This benchmark provides a strong indicator of whether quantization preserves a model's reasoning abilities.
178+
179+
**Understanding the Results**
180+
181+
We measured accuracy using two methods:
182+
- **Flexible-extract**: Accepts answers if the correct number appears anywhere in the response
183+
- **Strict-match**: Requires the exact answer in the expected format
184+
185+
<img align="center" src="/assets/figures/ptpc/GSM8K8B.png" alt="Accuracy Comparison on Llama-3.1-8B" width="80%" height="80%">
186+
187+
**8B Model Results at a Glance:**
188+
189+
| Method | Strict-match Accuracy | % of BF16 Performance |
190+
|:-------|:----------------------|:----------------------|
191+
| BF16 (baseline) | 73.2% | 100% |
192+
| PTPC-FP8 | 70.8% | 96.7% |
193+
| Standard FP8 | 69.2% | 94.5% |
194+
195+
**70B Model Results:**
196+
197+
<img align="center" src="/assets/figures/ptpc/GSM8K70B.png" alt="Accuracy Comparison on Llama-3.1-70B" width="80%" height="80%">
198+
199+
For the larger 70B model:
200+
- PTPC-FP8 achieves **87.3%** strict-match accuracy
201+
- This is actually **slightly better** than BF16's 86.3%
202+
- Both outperform standard FP8 in strict-match conditions
203+
204+
**Why These Results Matter**
205+
206+
1. **Preservation of reasoning abilities**: Mathematical reasoning is often the first capability to degrade with quantization
207+
208+
2. **PTPC-FP8 consistently outperforms standard FP8** across both model sizes
209+
210+
3. **Near-BF16 quality** with substantially reduced memory and improved performance
211+
212+
4. **Scaling advantage**: The performance gap between quantization methods narrows as model size increases, suggesting PTPC-FP8 is especially valuable for large models
213+
214+
These results demonstrate that PTPC-FP8 quantization preserves the model's ability to perform complex reasoning tasks while delivering the speed and efficiency benefits of 8-bit precision.
215+
216+
## **Getting Started**
217+
218+
1. **Install ROCm:** Make sure you have a recent version.
219+
2. Clone the latest vLLM commit now! Setup and start exploring this new feature!
220+
221+
```bash
222+
$ git clone https://github.com/vllm-project/vllm.git
223+
$ cd vllm
224+
$ DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .
225+
$ docker run -it \
226+
--network=host \
227+
--group-add=video \
228+
--ipc=host \
229+
--cap-add=SYS_PTRACE \
230+
--security-opt seccomp=unconfined \
231+
--device /dev/kfd \
232+
--device /dev/dri \
233+
-v <path/to/model>:/app/model \
234+
vllm-rocm \
235+
bash
236+
```
237+
238+
3. **Run vLLM with the `--quantization ptpc_fp8` flag:**
239+
240+
```bash
241+
VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve <your-model> --max-seq-len-to-capture 16384 --enable-chunked-prefill=False --num-scheduler-steps 15 --max-num-seqs 1024 --quantization ptpc_fp8
242+
```
243+
244+
(Replace `<your-model>` with any hugging face model; It will automatically quantize the weight on-the-fly.)
245+
246+
## **Conclusion: The Accuracy-Speed Sweet Spot**
247+
248+
PTPC-FP8 quantization in vLLM on AMD ROCm represents a significant step towards democratizing access to powerful LLMs. By making near-BF16 accuracy achievable at FP8 speeds, we're breaking down the computational barriers that have limited wider adoption. This advancement empowers a broader community – from individual researchers to resource-constrained organizations – to leverage the power of large language models on accessible AMD hardware. We invite you to explore PTPC-FP8, share your experiences, contribute to the vLLM project, and help us build a future where efficient and accurate AI is available to everyone.
249+
250+
## **Appendix**
251+
252+
**lm-evaluation-harness Commands:**
253+
254+
```bash
255+
# Unquantized (Bfloat16)
256+
MODEL=meta-llama/Llama-3.1-8B-Instruct
257+
HIP_VISIBLE_DEVICES=0,1 lm_eval \
258+
--model vllm \
259+
--model_args pretrained=$MODEL,add_bos_token=True,tensor_parallel_size=2,kv_cache_dtype=auto,max_model_len=2048,gpu_memory_utilization=0.6 \
260+
--tasks wikitext --batch_size 16
261+
262+
# Per-Tensor FP8 Quantization
263+
MODEL=meta-llama/Llama-3.1-8B-Instruct
264+
HIP_VISIBLE_DEVICES=0,1 lm_eval \
265+
--model vllm \
266+
--model_args pretrained=$MODEL,add_bos_token=True,tensor_parallel_size=2,quantization=fp8,kv_cache_dtype=fp8_e4m3,max_model_len=2048,gpu_memory_utilization=0.6 \
267+
--tasks wikitext --batch_size 16
268+
269+
# Per-Token-Activation Per-Channel-Weight FP8 Quantization
270+
MODEL=meta-llama/Llama-3.1-8B-Instruct
271+
HIP_VISIBLE_DEVICES=0,1 lm_eval \
272+
--model vllm \
273+
--model_args pretrained=$MODEL,add_bos_token=True,tensor_parallel_size=2,quantization=ptpc_fp8,kv_cache_dtype=fp8_e4m3,max_model_len=2048,gpu_memory_utilization=0.6 \
274+
--tasks wikitext --batch_size 16
275+
```
276+
277+
**lm-evaluation-harness Commands (8B Model - adjust for 70B):**
278+
279+
```bash
280+
# FP8 (Per-Tensor)
281+
MODEL=/app/model/Llama-3.1-8B-Instruct/ # Or Llama-3.1-70B-Instruct
282+
lm_eval \
283+
--model vllm \
284+
--model_args pretrained=$MODEL,add_bos_token=True,quantization=fp8,kv_cache_dtype=fp8_e4m3 \
285+
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
286+
287+
# PTPC FP8
288+
MODEL=/app/model/Llama-3.1-8B-Instruct/ # Or Llama-3.1-70B-Instruct
289+
lm_eval \
290+
--model vllm \
291+
--model_args pretrained=$MODEL,add_bos_token=True,quantization=ptpc_fp8,kv_cache_dtype=fp8_e4m3 \
292+
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
293+
294+
# BF16
295+
MODEL=/app/model/Llama-3.1-8B-Instruct/ # Or Llama-3.1-70B-Instruct
296+
lm_eval \
297+
--model vllm \
298+
--model_args pretrained=$MODEL,add_bos_token=True,kv_cache_dtype=auto \
299+
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
300+
```

assets/figures/ptpc/FusedGEMM.svg

Lines changed: 2 additions & 0 deletions
Loading

assets/figures/ptpc/GSM8K70B.png

43.9 KB
Loading

assets/figures/ptpc/GSM8K8B.png

43 KB
Loading

assets/figures/ptpc/PTPC-Diagram.png

58 KB
Loading

assets/figures/ptpc/PTPC-tumbnail.png

47.6 KB
Loading

assets/figures/ptpc/PTPC121.png

26.7 KB
Loading

assets/figures/ptpc/PTPCReqs.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)