Skip to content

Commit d4275ec

Browse files
committed
Attempt better styling for toc
1 parent 4d74a7a commit d4275ec

File tree

1 file changed

+29
-15
lines changed

1 file changed

+29
-15
lines changed

page.md

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -35,17 +35,27 @@ license: "MIT"
3535
license_url: "https://opensource.org/license/mit"
3636
---
3737

38-
Table of contents:
3938

40-
- [fp8: Why?](#fp8-why)
41-
- [fp8: How?](#fp8-how)
39+
Quick Jump:
40+
<div>
41+
<ol>
42+
<li><a href="#why">Why?</a></li>
43+
<li><a href="#how">How?</a></li>
44+
<li><a href="#saving-a-quantized-checkpoint">Saving a quantized checkpoint</a></li>
45+
</ol>
46+
</div>
47+
48+
- [Why?](#why)
49+
- [How?](#how)
4250
- [Note on executing fp8 models](#note-on-executing-fp8-models)
4351
- [fp8 bit format](#fp8-bit-format)
4452
- [Quantization - scaling to lower precision loss \& handle large values](#quantization---scaling-to-lower-precision-loss--handle-large-values)
4553
- [Finer grained scale - weight block size](#finer-grained-scale---weight-block-size)
46-
- [Saving an inference compatible model checkpoint](#saving-an-inference-compatible-model-checkpoint)
54+
- [Saving a quantized checkpoint](#saving-a-quantized-checkpoint)
55+
- [Add the scales to `Linear` layers](#add-the-scales-to-linear-layers)
56+
- [Update model config](#update-model-config)
4757

48-
# fp8: Why?
58+
# Why?
4959

5060
tl;dr:
5161

@@ -60,20 +70,20 @@ Starting with NVIDIA H100 GPU, GPUs have *hardware support* for 8 bit floating p
6070

6171
1. Model takes less GPU ram => more space for kv cache. Modern inference libraries (like vllm/sglang) will have higher/more stable performance with more space for kv cache
6272
2. Model parameters are half as big => less GPU memory bandwidth
63-
3. Depending on the GPU, fp8 FLOPS are just higher than bf16 FLOPS. E.g. See [H100 specifications](https://www.nvidia.com/en-us/data-center/h100/); bfloat16 has ~2k teraFLOPS and fp8 has ~4k teraFLOPS
73+
3. Depending on the GPU, fp8 FLOPS are just higher than `bf16` FLOPS. E.g. See [H100 specifications](https://www.nvidia.com/en-us/data-center/h100/); bfloat16 has ~2k teraFLOPS and fp8 has ~4k teraFLOPS
6474

6575

66-
# fp8: How?
76+
# How?
6777

6878
## Note on executing fp8 models
6979

70-
When we talk about fp8 models, we typically only are talking about the **weights being fp8**. The actual execution of the model is still done in `bf16`. So all the **intermediate tensors are still in bf16**, and it's the underlying CUDA kernels that are taking in bf16 tensors and fp8 weights.
80+
When we talk about `fp8` models, we typically only are talking about the **weights being `fp8`**. The actual execution of the model is still done in `bf16`. So all the **intermediate tensors are still in `bf16`**, and it's the underlying CUDA kernels that are taking in `bf16` tensors and `fp8` weights.
7181

7282
**fp8 models still use `bf16` kv cache by default** (since the kv cache stores kv values, which are intermediate tensors).
7383

7484
## fp8 bit format
7585

76-
There are a number of different fp8 formats; the most common is `float8_e4m3fn`. Here are some facts about it:
86+
There are a number of different `fp8` formats; the most common is `float8_e4m3fn`. Here are some facts about it:
7787

7888
1. This format has `1` sign bit, `4` bits for exponent (`e4`), and `3` bits for mantissa (`m3`)
7989
2. Values can be between `[-448, +448]`
@@ -101,8 +111,8 @@ And here is how all the representable values are distributed (notice how there a
101111

102112
So this leads us with two questions for quantization:
103113

104-
1. `bf16` can store values between `[-3.38953e+38, +3.38953e+38]`, how do we fit that into fp8 range of `[-448, +448]`?
105-
2. How do we take advantage of the distribution of values in fp8?
114+
1. `bf16` can store values between `[-3.38953e+38, +3.38953e+38]`, how do we fit that into `fp8` range of `[-448, +448]`?
115+
2. How do we take advantage of the distribution of values in `fp8`?
106116

107117
## Quantization - scaling to lower precision loss & handle large values
108118

@@ -126,7 +136,7 @@ x_dequantized = x.to(torch.bfloat16) * scale
126136

127137
## Finer grained scale - weight block size
128138

129-
Above I showed the scale being a single value, but you can also have it be a tensor. If you look at some popular open source fp8 models they typically use this option.
139+
Above I showed the scale being a single value, but you can also have it be a tensor. If you look at some popular open source `fp8` models they typically use this option.
130140

131141
Why would you do this? To theoretically preserve accuracy, though if the values in your tensor are all relatively close together you won't get much benefit.
132142

@@ -142,11 +152,13 @@ scale = x.abs().amax(dim=[1, 3]) / 448
142152
assert scale.shape == torch.Size([N // n, K // k])
143153
```
144154

145-
# Saving an inference compatible model checkpoint
155+
# Saving a quantized checkpoint
146156

147157
For compatibility with things like VLLM there's a couple things we need to do:
148158

149-
1. Add the `weight_scale` as a parameter to each of the `Linear` layers. This basically means just replace the `Linear` layer with this `PackedLinear` class, where `weight` is the `fp8` tensor, and `weight_scale` is the scale.
159+
## Add the scales to `Linear` layers
160+
161+
We need to add the previously computed `weight_scale` as a parameter to each of the `Linear` layers. This basically means just replace the `Linear` layer with this custom `PackedLinear` class, where `weight` is the `fp8` tensor, and `weight_scale` is the scale from previous sections.
150162

151163
```python
152164
class PackedLinear(torch.nn.Module):
@@ -156,7 +168,9 @@ class PackedLinear(torch.nn.Module):
156168
self.weight_scale = torch.nn.Parameter(weight_scale, requires_grad=False)
157169
```
158170

159-
2. Add a `quantization_config` into the model's config. This will also appear in the `config.json` file in the huggingface repo of the model.
171+
## Update model config
172+
173+
This part is really easy, just add a `quantization_config` into the model's config. This will also appear in the `config.json` file in the huggingface repo of the model.
160174

161175
```python
162176
model.config.quantization_config = {

0 commit comments

Comments
 (0)