Skip to content

Commit 711a2d6

Browse files
tjtanaatanpinsiang
authored andcommitted
fix headers
Signed-off-by: tjtanaa <[email protected]> Signed-off-by: tanpinsiang <[email protected]>
1 parent 1a0c1c3 commit 711a2d6

File tree

1 file changed

+21
-24
lines changed

1 file changed

+21
-24
lines changed

_posts/2025-02-24-ptpc-fp8-rocm.md

Lines changed: 21 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -21,23 +21,23 @@ share-img: /assets/figures/ptpc/PTPC-tumbnail.png
2121

2222
**What is PTPC-FP8?** It's a method for FP8 weights *and* activations quantization. It uses per-token scaling for activations and per-channel scaling for weights, giving you better accuracy than traditional per-tensor FP8.
2323

24-
## **Introduction**
24+
## Introduction
2525

2626
Large Language Models (LLMs) are revolutionizing how we interact with technology, but their immense computational demands can be a barrier. What if you could run these powerful models faster and more efficiently on your AMD GPUs, without sacrificing accuracy? Now you can! This post introduces a breakthrough: PTPC-FP8 quantization in vLLM, optimized for AMD's ROCm platform. Get ready for near-BF16 accuracy at FP8 speeds, directly using Hugging Face models – no pre-quantization needed! We'll show you how it works, benchmark its performance, and get you started.
2727

28-
**The Challenge of LLM Quantization and the PTPC-FP8 Solution**
28+
### The Challenge of LLM Quantization and the PTPC-FP8 Solution
2929

3030
Running large language models is computationally expensive. FP8 (8-bit floating-point) offers a compelling solution by reducing memory footprint and accelerating matrix multiplications, but traditional quantization approaches face a critical challenge with LLMs.
3131

32-
**The Outlier Problem**
32+
#### The Outlier Problem
3333

3434
LLMs develop activation outliers as they scale beyond certain sizes. These unusually large values create significant quantization challenges:
3535

3636
- Most values receive few effective bits of precision when using per-tensor quantization
3737
- Outliers appear persistently in specific channels across different tokens
3838
- While weights are relatively uniform and easy to quantize, activations are not
3939

40-
**PTPC: A Precision-Targeted Approach**
40+
#### PTPC: A Precision-Targeted Approach
4141

4242
PTPC-FP8 (Per-Token-Activation, Per-Channel-Weight FP8) addresses this challenge by using tailored scaling factors based on three key observations:
4343

@@ -51,7 +51,7 @@ This insight led to a dual-granularity approach:
5151

5252
<img align="right" src="/assets/figures/ptpc/PTPC-Diagram.png" alt="Per-Token Activation + Per-Channel Weight Quantization" width="50%" height="50%">
5353

54-
**Understanding the Diagram**
54+
#### Understanding the Diagram
5555

5656
The illustration shows two quantization approaches:
5757

@@ -68,12 +68,11 @@ The illustration shows two quantization approaches:
6868

6969
This granular scaling approach allows PTPC-FP8 to achieve accuracy close to BF16 while maintaining the speed and memory benefits of 8-bit computation.
7070

71-
## **Deep Dive: How PTPC-FP8 Works in vLLM (and the Fused Kernel)**
72-
**Unlocking FP8 Speed: The Fused Rowwise Scaled GEMM**
71+
## Deep Dive: How PTPC-FP8 Works in vLLM (and the Fused Kernel)
7372

7473
PTPC-FP8's fine-grained scaling could slow things down without proper optimization. The key to maintaining speed is AMD ROCm's implementation of a **fused FP8 rowwise scaled GEMM** operation.
7574

76-
**The Challenge: 2-Step vs. Fused Approach**
75+
### The Challenge: 2-Step vs. Fused Approach
7776

7877
Without optimization, matrix multiplication with per-token and per-channel scaling would require two costly steps:
7978

@@ -88,7 +87,7 @@ This creates a performance bottleneck:
8887
- Read them back for scaling operations
8988
- Waste memory bandwidth and compute cycles
9089

91-
**The Solution: Fusion**
90+
### The Solution: Fusion
9291

9392
The fused approach combines matrix multiplication and scaling into a single hardware operation:
9493

@@ -101,7 +100,7 @@ output = torch._scaled_mm(input, weight,
101100

102101
<img align="center" src="/assets/figures/ptpc/FusedGEMM.svg" alt="Fused GEMM Operation" width="90%" height="90%">
103102

104-
**Why This Matters**
103+
### Why This Matters
105104

106105
This fusion leverages AMD GPUs' specialized hardware (particularly on MI300X with native FP8 support):
107106

@@ -111,13 +110,11 @@ This fusion leverages AMD GPUs' specialized hardware (particularly on MI300X wit
111110

112111
The fused operation makes PTPC-FP8 practical for real-world deployment, eliminating the performance penalty of using more granular scaling factors while maintaining accuracy benefits.
113112

114-
## **Benchmarks and Results (MI300X GPUs)**
115-
116-
**Benchmarking PTPC-FP8: Speed and Accuracy on MI300X**
113+
## Benchmarking PTPC-FP8: Speed and Accuracy on MI300X
117114

118115
We extensively benchmarked PTPC-FP8 using vLLM on AMD MI300X GPUs (commit `4ea48fb35cf67d61a1c3f18e3981c362e1d8e26f`). Here's what we found:
119116

120-
**1\. Throughput Comparison (PTPC-FP8 vs. Per-Tensor FP8):**
117+
### 1. Throughput Comparison (PTPC-FP8 vs. Per-Tensor FP8):
121118

122119
* **Model:** Llama-3.1-70B-Instruct
123120
* **Dataset:** SharedGPT
@@ -129,21 +126,21 @@ We extensively benchmarked PTPC-FP8 using vLLM on AMD MI300X GPUs (commit `4ea48
129126
<img align="center" src="/assets/figures/ptpc/PTPCSpeedup.svg" alt="Request/s Throughput gain over FP8 per-tensor quantization
130127
across different input token length - output token length" width="90%" height="50%">
131128

132-
**2.1. Accuracy: Perplexity (Lower is Better)**
129+
### 2.1. Accuracy: Perplexity (Lower is Better)
133130

134131
* **Model:** Llama-3.1-8B-Instruct
135132
* **Dataset:** Wikitext
136133
* **Setup:** 2× MI300X GPUs with tensor parallelism
137134

138-
**Understanding Perplexity: The Prediction Power Test**
135+
#### Understanding Perplexity: The Prediction Power Test
139136

140137
Think of perplexity as a measure of how "confused" the model is when predicting text. Like a student taking a quiz:
141138
- **Lower perplexity = Better predictions** (the model confidently assigns high probability to the correct next words)
142139
- **Higher perplexity = More uncertainty** (the model is frequently surprised by what comes next)
143140

144141
A small increase in perplexity (even 0.1) can indicate meaningful degradation in model quality, especially for large language models that have been extensively optimized.
145142

146-
**Results: PTPC-FP8 Maintains BF16-Like Quality**
143+
#### Results: PTPC-FP8 Maintains BF16-Like Quality
147144

148145
<img align="right" src="/assets/figures/ptpc/PerplexityBits.png" alt="bits and byte perplexity" width="50%" height="50%">
149146

@@ -163,9 +160,9 @@ As shown in both the table and chart:
163160

164161
**Why This Matters:** While standard FP8 already provides decent results, PTPC-FP8's lower perplexity indicates it better preserves the model's ability to make accurate predictions. This is especially important for complex reasoning and generation tasks, where small quality drops can compound into noticeable differences in output quality.
165162

166-
**2.2. Accuracy on GSM8K: Testing Mathematical Reasoning**
163+
### 2.2. Accuracy on GSM8K: Testing Mathematical Reasoning**
167164

168-
**What is GSM8K and Why It Matters**
165+
#### What is GSM8K and Why It Matters
169166

170167
GSM8K tests a model's ability to solve grade school math word problems – one of the most challenging tasks for LLMs. Unlike simple text prediction, these problems require:
171168
- Multi-step reasoning
@@ -174,7 +171,7 @@ GSM8K tests a model's ability to solve grade school math word problems – one o
174171

175172
This benchmark provides a strong indicator of whether quantization preserves a model's reasoning abilities.
176173

177-
**Understanding the Results**
174+
#### Understanding the Results
178175

179176
We measured accuracy using two methods:
180177
- **Flexible-extract**: Accepts answers if the correct number appears anywhere in the response
@@ -199,7 +196,7 @@ For the larger 70B model:
199196
- This is actually **slightly better** than BF16's 86.3%
200197
- Both outperform standard FP8 in strict-match conditions
201198

202-
**Why These Results Matter**
199+
#### Why These Results Matter
203200

204201
1. **Preservation of reasoning abilities**: Mathematical reasoning is often the first capability to degrade with quantization
205202

@@ -211,7 +208,7 @@ For the larger 70B model:
211208

212209
These results demonstrate that PTPC-FP8 quantization preserves the model's ability to perform complex reasoning tasks while delivering the speed and efficiency benefits of 8-bit precision.
213210

214-
## **Getting Started**
211+
## Getting Started
215212

216213
1. **Install ROCm:** Make sure you have a recent version.
217214
2. Clone the latest vLLM commit now! Setup and start exploring this new feature!
@@ -241,11 +238,11 @@ VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve <your-model> --max-seq-len-to-capture 16
241238

242239
(Replace `<your-model>` with any hugging face model; It will automatically quantize the weight on-the-fly.)
243240

244-
## **Conclusion: The Accuracy-Speed Sweet Spot**
241+
## Conclusion: The Accuracy-Speed Sweet Spot
245242

246243
PTPC-FP8 quantization in vLLM on AMD ROCm represents a significant step towards democratizing access to powerful LLMs. By making near-BF16 accuracy achievable at FP8 speeds, we're breaking down the computational barriers that have limited wider adoption. This advancement empowers a broader community – from individual researchers to resource-constrained organizations – to leverage the power of large language models on accessible AMD hardware. We invite you to explore PTPC-FP8, share your experiences, contribute to the vLLM project, and help us build a future where efficient and accurate AI is available to everyone.
247244

248-
## **Appendix**
245+
## Appendix
249246

250247
**lm-evaluation-harness Commands:**
251248

0 commit comments

Comments
 (0)