Skip to content

Commit 873426d

Browse files
chengzeyistevhliu
andauthored
Update docs/source/en/optimization/para_attn.md
Co-authored-by: Steven Liu <[email protected]>
1 parent 6d30ba1 commit 873426d

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

docs/source/en/optimization/para_attn.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -142,13 +142,13 @@ First Block Cache reduced the inference speed to 2271.06 seconds compared to the
142142

143143
### FP8 Quantization
144144

145-
To further speed up the inference and reduce memory usage, we can quantize the model into FP8 with dynamic quantization.
146-
We must quantize both the activation and weight of the transformer model to utilize the 8-bit **Tensor Cores** on NVIDIA GPUs.
147-
Here, we use `float8_weight_only` and `float8_dynamic_activation_float8_weight`to quantize the text encoder and transformer model respectively.
148-
The default quantization method is per tensor quantization. If your GPU supports row-wise quantization, you can also try it for better accuracy.
149-
[diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao) provides a really good tutorial on how to quantize models in `diffusers` and achieve a good speedup.
150-
Here, we simply install the latest `torchao` that is capable of quantizing FLUX.1-dev and HunyuanVideo.
151-
If you are not familiar with `torchao` quantization, you can refer to this [documentation](https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md).
145+
fp8 with dynamic quantization further speeds up inference and reduces memory usage. Both the activations and weights must be quantized in order to use the 8-bit [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/).
146+
147+
Use `float8_weight_only` and `float8_dynamic_activation_float8_weight` to quantize the text encoder and transformer model.
148+
149+
The default quantization method is per tensor quantization, but if your GPU supports row-wise quantization, you can also try it for better accuracy.
150+
151+
Install [torchao](https://github.com/pytorch/ao/tree/main) with the command below.
152152

153153
```bash
154154
pip3 install -U torch torchao

0 commit comments

Comments
 (0)