You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/optimization/para_attn.md
+3-6Lines changed: 3 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -154,12 +154,9 @@ If you are not familiar with `torchao` quantization, you can refer to this [docu
154
154
pip3 install -U torch torchao
155
155
```
156
156
157
-
We also need to pass the model to `torch.compile` to gain actual speedup.
158
-
`torch.compile` with `mode="max-autotune-no-cudagraphs"` or `mode="max-autotune"` can help us to achieve the best performance by generating and selecting the best kernel for the model inference.
159
-
The compilation process could take a long time, but it is worth it.
160
-
If you are not familiar with `torch.compile`, you can refer to the [official tutorial](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
161
-
In this example, we only quantize the transformer model, but you can also quantize the text encoder to reduce more memory usage.
162
-
We also need to notice that the actually compilation process is done on the first time the model is called, so we need to warm up the model to measure the speedup correctly.
157
+
[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) with `mode="max-autotune-no-cudagraphs"` or `mode="max-autotune"` selects the best kernel for performance. Compilation can take a long time if it's the first time the model is called, but it is worth it once the model has been compiled.
158
+
159
+
This example only quantizes the transformer model, but you can also quantize the text encoder to reduce memory usage even more.
163
160
164
161
> [!TIP]
165
162
> Dynamic quantization can significantly change the distribution of the model output, so you need to change the `residual_diff_threshold` to a larger value for it to take effect.
0 commit comments