Skip to content

Commit d4f10ea

Browse files
authored
[Diffusion fast] add doc for diffusion fast (#6311)
* add doc for diffusion fast * add entry to _toctree * Apply suggestions from code review * fix titlew * fix: title entry * add note about fuse_qkv_projections
1 parent 3aba99a commit d4f10ea

File tree

2 files changed

+320
-0
lines changed

2 files changed

+320
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919
title: Train a diffusion model
2020
- local: tutorials/using_peft_for_inference
2121
title: Inference with PEFT
22+
- local: tutorials/fast_diffusion
23+
title: Accelerate inference of text-to-image diffusion models
2224
title: Tutorials
2325
- sections:
2426
- sections:
Lines changed: 318 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,318 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Accelerate inference of text-to-image diffusion models
14+
15+
Diffusion models are known to be slower than their counter parts, GANs, because of the iterative and sequential reverse diffusion process. Recent works try to address limitation with:
16+
17+
* progressive timestep distillation (such as [LCM LoRA](../using-diffusers/inference_with_lcm_lora.md))
18+
* model compression (such as [SSD-1B](https://huggingface.co/segmind/SSD-1B))
19+
* reusing adjacent features of the denoiser (such as [DeepCache](https://github.com/horseee/DeepCache))
20+
21+
In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl.md) as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.
22+
23+
## Setup
24+
25+
Make sure you're on the latest version of `diffusers`:
26+
27+
```bash
28+
pip install -U diffusers
29+
```
30+
31+
Then upgrade the other required libraries too:
32+
33+
```bash
34+
pip install -U transformers accelerate peft
35+
```
36+
37+
To benefit from the fastest kernels, use PyTorch nightly. You can find the installation instructions [here](https://pytorch.org/).
38+
39+
To report the numbers shown below, we used an 80GB 400W A100 with its clock rate set to the maximum.
40+
41+
_This tutorial doesn't present the benchmarking code and focuses on how to perform the optimizations, instead. For the full benchmarking code, refer to: [https://github.com/huggingface/diffusion-fast](https://github.com/huggingface/diffusion-fast)._
42+
43+
## Baseline
44+
45+
Let's start with a baseline. Disable the use of a reduced precision and [`scaled_dot_product_attention`](../optimization/torch2.0.md):
46+
47+
```python
48+
from diffusers import StableDiffusionXLPipeline
49+
50+
# Load the pipeline in full-precision and place its model components on CUDA.
51+
pipe = StableDiffusionXLPipeline.from_pretrained(
52+
"stabilityai/stable-diffusion-xl-base-1.0"
53+
).to("cuda")
54+
55+
# Run the attention ops without efficiency.
56+
pipe.unet.set_default_attn_processor()
57+
pipe.vae.set_default_attn_processor()
58+
59+
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
60+
image = pipe(prompt, num_inference_steps=30).images[0]
61+
```
62+
63+
This takes 7.36 seconds:
64+
65+
<div align="center">
66+
67+
<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_0.png" width=500>
68+
69+
</div>
70+
71+
## Running inference in bfloat16
72+
73+
Enable the first optimization: use a reduced precision to run the inference.
74+
75+
```python
76+
from diffusers import StableDiffusionXLPipeline
77+
import torch
78+
79+
pipe = StableDiffusionXLPipeline.from_pretrained(
80+
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
81+
).to("cuda")
82+
83+
# Run the attention ops without efficiency.
84+
pipe.unet.set_default_attn_processor()
85+
pipe.vae.set_default_attn_processor()
86+
87+
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
88+
image = pipe(prompt, num_inference_steps=30).images[0]
89+
```
90+
91+
bfloat16 reduces the latency from 7.36 seconds to 4.63 seconds:
92+
93+
<div align="center">
94+
95+
<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_1.png" width=500>
96+
97+
</div>
98+
99+
**Why bfloat16?**
100+
101+
* Using a reduced numerical precision (such as float16, bfloat16) to run inference doesn’t affect the generation quality but significantly improves latency.
102+
* The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16.
103+
* Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.
104+
105+
We have a [dedicated guide](../optimization/fp16.md) for running inference in a reduced precision.
106+
107+
## Running attention efficiently
108+
109+
Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0.md), we can run them efficiently.
110+
111+
```python
112+
from diffusers import StableDiffusionXLPipeline
113+
import torch
114+
115+
pipe = StableDiffusionXLPipeline.from_pretrained(
116+
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
117+
).to("cuda")
118+
119+
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
120+
image = pipe(prompt, num_inference_steps=30).images[0]
121+
```
122+
123+
`scaled_dot_product_attention` improves the latency from 4.63 seconds to 3.31 seconds.
124+
125+
<div align="center">
126+
127+
<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_2.png" width=500>
128+
129+
</div>
130+
131+
## Use faster kernels with torch.compile
132+
133+
Compile the UNet and the VAE to benefit from the faster kernels. First, configure a few compiler flags:
134+
135+
```python
136+
from diffusers import StableDiffusionXLPipeline
137+
import torch
138+
139+
torch._inductor.config.conv_1x1_as_mm = True
140+
torch._inductor.config.coordinate_descent_tuning = True
141+
torch._inductor.config.epilogue_fusion = False
142+
torch._inductor.config.coordinate_descent_check_all_directions = True
143+
```
144+
145+
For the full list of compiler flags, refer to [this file](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py).
146+
147+
It is also important to change the memory layout of the UNet and the VAE to “channels_last” when compiling them. This ensures maximum speed:
148+
149+
```python
150+
pipe.unet.to(memory_format=torch.channels_last)
151+
pipe.vae.to(memory_format=torch.channels_last)
152+
```
153+
154+
Then, compile and perform inference:
155+
156+
```python
157+
# Compile the UNet and VAE.
158+
pipe.unet = torch.compile(pipe.unet, mode="max-autotune", fullgraph=True)
159+
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)
160+
161+
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
162+
163+
# First call to `pipe` will be slow, subsequent ones will be faster.
164+
image = pipe(prompt, num_inference_steps=30).images[0]
165+
```
166+
167+
`torch.compile` offers different backends and modes. As we’re aiming for maximum inference speed, we opt for the inductor backend using the “max-autotune”. “max-autotune” uses CUDA graphs and optimizes the compilation graph specifically for latency. Specifying fullgraph to be True ensures that there are no graph breaks in the underlying model, ensuring the fullest potential of `torch.compile`.
168+
169+
Using SDPA attention and compiling both the UNet and VAE reduces the latency from 3.31 seconds to 2.54 seconds.
170+
171+
<div align="center">
172+
173+
<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_3.png" width=500>
174+
175+
</div>
176+
177+
## Combine the projection matrices of attention
178+
179+
Both the UNet and the VAE used in SDXL make use of Transformer-like blocks. A Transformer block consists of attention blocks and feed-forward blocks.
180+
181+
In an attention block, the input is projected into three sub-spaces using three different projection matrices – Q, K, and V. In the naive implementation, these projections are performed separately on the input. But we can horizontally combine the projection matrices into a single matrix and perform the projection in one shot. This increases the size of the matmuls of the input projections and improves the impact of quantization (to be discussed next).
182+
183+
Enabling this kind of computation in Diffusers just takes a single line of code:
184+
185+
```python
186+
pipe.fuse_qkv_projections()
187+
```
188+
189+
It provides a minor boost from 2.54 seconds to 2.52 seconds.
190+
191+
<div align="center">
192+
193+
<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_4.png" width=500>
194+
195+
</div>
196+
197+
<Tip warning={true}>
198+
199+
Support for `fuse_qkv_projections()` is limited and experimental. As such, it's not available for many non-SD pipelines such as [Kandinsky](../using-diffusers/kandinsky.md). You can refer to [this PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to support this kind of computation.
200+
201+
</Tip>
202+
203+
## Dynamic quantization
204+
205+
Aapply [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to both the UNet and the VAE. This is because quantization adds additional conversion overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization). If the matmuls are too small, these techniques may degrade performance.
206+
207+
<Tip>
208+
209+
Through experimentation, we found that certain linear layers in the UNet and the VAE don’t benefit from dynamic int8 quantization. You can check out the full code for filtering those layers [here](https://github.com/huggingface/diffusion-fast/blob/0f169640b1db106fe6a479f78c1ed3bfaeba3386/utils/pipeline_utils.py#L16) (referred to as `dynamic_quant_filter_fn` below).
210+
211+
</Tip>
212+
213+
You will leverage the ultra-lightweight pure PyTorch library [torchao](https://github.com/pytorch-labs/ao) to use its user-friendly APIs for quantization.
214+
215+
First, configure all the compiler tags:
216+
217+
```python
218+
from diffusers import StableDiffusionXLPipeline
219+
import torch
220+
221+
# Notice the two new flags at the end.
222+
torch._inductor.config.conv_1x1_as_mm = True
223+
torch._inductor.config.coordinate_descent_tuning = True
224+
torch._inductor.config.epilogue_fusion = False
225+
torch._inductor.config.coordinate_descent_check_all_directions = True
226+
torch._inductor.config.force_fuse_int_mm_with_mul = True
227+
torch._inductor.config.use_mixed_mm = True
228+
```
229+
230+
Define the filtering functions:
231+
232+
```python
233+
def dynamic_quant_filter_fn(mod, *args):
234+
return (
235+
isinstance(mod, torch.nn.Linear)
236+
and mod.in_features > 16
237+
and (mod.in_features, mod.out_features)
238+
not in [
239+
(1280, 640),
240+
(1920, 1280),
241+
(1920, 640),
242+
(2048, 1280),
243+
(2048, 2560),
244+
(2560, 1280),
245+
(256, 128),
246+
(2816, 1280),
247+
(320, 640),
248+
(512, 1536),
249+
(512, 256),
250+
(512, 512),
251+
(640, 1280),
252+
(640, 1920),
253+
(640, 320),
254+
(640, 5120),
255+
(640, 640),
256+
(960, 320),
257+
(960, 640),
258+
]
259+
)
260+
261+
262+
def conv_filter_fn(mod, *args):
263+
return (
264+
isinstance(mod, torch.nn.Conv2d) and mod.kernel_size == (1, 1) and 128 in [mod.in_channels, mod.out_channels]
265+
)
266+
```
267+
268+
Then apply all the optimizations discussed so far:
269+
270+
```python
271+
# SDPA + bfloat16.
272+
pipe = StableDiffusionXLPipeline.from_pretrained(
273+
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
274+
).to("cuda")
275+
276+
# Combine attention projection matrices.
277+
pipe.fuse_qkv_projections()
278+
279+
# Change the memory layout.
280+
pipe.unet.to(memory_format=torch.channels_last)
281+
pipe.vae.to(memory_format=torch.channels_last)
282+
```
283+
284+
Since this quantization support is limited to linear layers only, we also turn suitable pointwise convolution layers into linear layers to maximize the benefit.
285+
286+
```python
287+
from torchao import swap_conv2d_1x1_to_linear
288+
289+
swap_conv2d_1x1_to_linear(pipe.unet, conv_filter_fn)
290+
swap_conv2d_1x1_to_linear(pipe.vae, conv_filter_fn)
291+
```
292+
293+
Apply dynamic quantization:
294+
295+
```python
296+
from torchao import apply_dynamic_quant
297+
298+
apply_dynamic_quant(pipe.unet, dynamic_quant_filter_fn)
299+
apply_dynamic_quant(pipe.vae, dynamic_quant_filter_fn)
300+
```
301+
302+
Finally, compile and perform inference:
303+
304+
```python
305+
pipe.unet = torch.compile(pipe.unet, mode="max-autotune", fullgraph=True)
306+
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)
307+
308+
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
309+
image = pipe(prompt, num_inference_steps=30).images[0]
310+
```
311+
312+
Applying dynamic quantization improves the latency from 2.52 seconds to 2.43 seconds.
313+
314+
<div align="center">
315+
316+
<img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_5.png" width=500>
317+
318+
</div>

0 commit comments

Comments
 (0)