Skip to content

Commit a6d2fc2

Browse files
authored
[docs] Refresh effective and efficient doc (#12134)
* refresh * init * feedback
1 parent bc2762c commit a6d2fc2

File tree

2 files changed

+76
-211
lines changed

2 files changed

+76
-211
lines changed

docs/source/en/_toctree.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
- local: quicktour
88
title: Quicktour
99
- local: stable_diffusion
10-
title: Effective and efficient diffusion
10+
title: Basic performance
1111

1212
- title: DiffusionPipeline
1313
isExpanded: false

docs/source/en/stable_diffusion.md

Lines changed: 75 additions & 210 deletions
Original file line numberDiff line numberDiff line change
@@ -10,252 +10,117 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
1010
specific language governing permissions and limitations under the License.
1111
-->
1212

13-
# Effective and efficient diffusion
14-
1513
[[open-in-colab]]
1614

17-
Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again.
18-
19-
This is why it's important to get the most *computational* (speed) and *memory* (GPU vRAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster.
20-
21-
This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`].
22-
23-
Begin by loading the [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) model:
15+
# Basic performance
2416

25-
```python
26-
from diffusers import DiffusionPipeline
27-
28-
model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
29-
pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True)
30-
```
17+
Diffusion is a random process that is computationally demanding. You may need to run the [`DiffusionPipeline`] several times before getting a desired output. That's why it's important to carefully balance generation speed and memory usage in order to iterate faster,
3118

32-
The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt:
19+
This guide recommends some basic performance tips for using the [`DiffusionPipeline`]. Refer to the Inference Optimization section docs such as [Accelerate inference](./optimization/fp16) or [Reduce memory usage](./optimization/memory) for more detailed performance guides.
3320

34-
```python
35-
prompt = "portrait photo of a old warrior chief"
36-
```
37-
38-
## Speed
39-
40-
<Tip>
41-
42-
💡 If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)!
43-
44-
</Tip>
45-
46-
One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module:
47-
48-
```python
49-
pipeline = pipeline.to("cuda")
50-
```
21+
## Memory usage
5122

52-
To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reusing_seeds):
23+
Reducing the amount of memory used indirectly speeds up generation and can help a model fit on device.
5324

54-
```python
25+
```py
5526
import torch
27+
from diffusers import DiffusionPipeline
5628

57-
generator = torch.Generator("cuda").manual_seed(0)
58-
```
59-
60-
Now you can generate an image:
29+
pipeline = DiffusionPipeline.from_pretrained(
30+
"stabilityai/stable-diffusion-xl-base-1.0",
31+
torch_dtype=torch.bfloat16
32+
).to("cuda")
33+
pipeline.enable_model_cpu_offload()
6134

62-
```python
63-
image = pipeline(prompt, generator=generator).images[0]
64-
image
35+
prompt = """
36+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
37+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
38+
"""
39+
pipeline(prompt).images[0]
40+
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
6541
```
6642

67-
<div class="flex justify-center">
68-
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_1.png">
69-
</div>
43+
## Inference speed
7044

71-
This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps.
45+
Denoising is the most computationally demanding process during diffusion. Methods that optimizes this process accelerates inference speed. Try the following methods for a speed up.
7246

73-
Let's start by loading the model in `float16` and generate an image:
47+
- Add `.to("cuda")` to place the pipeline on a GPU. Placing a model on an accelerator, like a GPU, increases speed because it performs computations in parallel.
48+
- Set `torch_dtype=torch.bfloat16` to execute the pipeline in half-precision. Reducing the data type precision increases speed because it takes less time to perform computations in a lower precision.
7449

75-
```python
50+
```py
7651
import torch
52+
import time
53+
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
7754

78-
pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True)
79-
pipeline = pipeline.to("cuda")
80-
generator = torch.Generator("cuda").manual_seed(0)
81-
image = pipeline(prompt, generator=generator).images[0]
82-
image
55+
pipeline = DiffusionPipeline.from_pretrained(
56+
"stabilityai/stable-diffusion-xl-base-1.0",
57+
torch_dtype=torch.bfloat16
58+
).to("cuda")
8359
```
8460

85-
<div class="flex justify-center">
86-
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_2.png">
87-
</div>
88-
89-
This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before!
90-
91-
<Tip>
92-
93-
💡 We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality.
94-
95-
</Tip>
96-
97-
Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method:
98-
99-
```python
100-
pipeline.scheduler.compatibles
101-
[
102-
diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler,
103-
diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler,
104-
diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler,
105-
diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler,
106-
diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler,
107-
diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler,
108-
diffusers.schedulers.scheduling_ddpm.DDPMScheduler,
109-
diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,
110-
diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler,
111-
diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler,
112-
diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,
113-
diffusers.schedulers.scheduling_pndm.PNDMScheduler,
114-
diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,
115-
diffusers.schedulers.scheduling_ddim.DDIMScheduler,
116-
]
117-
```
118-
119-
The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`~ConfigMixin.from_config`] method to load a new scheduler:
120-
121-
```python
122-
from diffusers import DPMSolverMultistepScheduler
61+
- Use a faster scheduler, such as [`DPMSolverMultistepScheduler`], which only requires ~20-25 steps.
62+
- Set `num_inference_steps` to a lower value. Reducing the number of inference steps reduces the overall number of computations. However, this can result in lower generation quality.
12363

64+
```py
12465
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
125-
```
126-
127-
Now set the `num_inference_steps` to 20:
128-
129-
```python
130-
generator = torch.Generator("cuda").manual_seed(0)
131-
image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0]
132-
image
133-
```
134-
135-
<div class="flex justify-center">
136-
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_3.png">
137-
</div>
138-
139-
Great, you've managed to cut the inference time to just 4 seconds! ⚡️
140-
141-
## Memory
142-
143-
The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM).
14466

145-
Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result.
67+
prompt = """
68+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
69+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
70+
"""
14671

147-
```python
148-
def get_inputs(batch_size=1):
149-
generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)]
150-
prompts = batch_size * [prompt]
151-
num_inference_steps = 20
72+
start_time = time.perf_counter()
73+
image = pipeline(prompt).images[0]
74+
end_time = time.perf_counter()
15275

153-
return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
76+
print(f"Image generation took {end_time - start_time:.3f} seconds")
15477
```
15578

156-
Start with `batch_size=4` and see how much memory you've consumed:
79+
## Generation quality
15780

158-
```python
159-
from diffusers.utils import make_image_grid
81+
Many modern diffusion models deliver high-quality images out-of-the-box. However, you can still improve generation quality by trying the following.
16082

161-
images = pipeline(**get_inputs(batch_size=4)).images
162-
make_image_grid(images, 2, 2)
163-
```
164-
165-
Unless you have a GPU with more vRAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function:
166-
167-
```python
168-
pipeline.enable_attention_slicing()
169-
```
170-
171-
Now try increasing the `batch_size` to 8!
172-
173-
```python
174-
images = pipeline(**get_inputs(batch_size=8)).images
175-
make_image_grid(images, rows=2, cols=4)
176-
```
177-
178-
<div class="flex justify-center">
179-
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_5.png">
180-
</div>
181-
182-
Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality.
183-
184-
## Quality
185-
186-
In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images.
187-
188-
### Better checkpoints
189-
190-
The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results.
191-
192-
As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in!
83+
- Try a more detailed and descriptive prompt. Include details such as the image medium, subject, style, and aesthetic. A negative prompt may also help by guiding a model away from undesirable features by using words like low quality or blurry.
19384

194-
### Better pipeline components
85+
```py
86+
import torch
87+
from diffusers import DiffusionPipeline
19588

196-
You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autoencoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images:
89+
pipeline = DiffusionPipeline.from_pretrained(
90+
"stabilityai/stable-diffusion-xl-base-1.0",
91+
torch_dtype=torch.bfloat16
92+
).to("cuda")
19793

198-
```python
199-
from diffusers import AutoencoderKL
94+
prompt = """
95+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
96+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
97+
"""
98+
negative_prompt = "low quality, blurry, ugly, poor details"
99+
pipeline(prompt, negative_prompt=negative_prompt).images[0]
100+
```
200101

201-
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda")
202-
pipeline.vae = vae
203-
images = pipeline(**get_inputs(batch_size=8)).images
204-
make_image_grid(images, rows=2, cols=4)
205-
```
206-
207-
<div class="flex justify-center">
208-
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_6.png">
209-
</div>
210-
211-
### Better prompt engineering
212-
213-
The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are:
214-
215-
- How is the image or similar images of the one I want to generate stored on the internet?
216-
- What additional detail can I give that steers the model towards the style I want?
217-
218-
With this in mind, let's improve the prompt to include color and higher quality details:
219-
220-
```python
221-
prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes"
222-
prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta"
223-
```
224-
225-
Generate a batch of images with the new prompt:
102+
For more details about creating better prompts, take a look at the [Prompt techniques](./using-diffusers/weighted_prompts) doc.
226103

227-
```python
228-
images = pipeline(**get_inputs(batch_size=8)).images
229-
make_image_grid(images, rows=2, cols=4)
230-
```
104+
- Try a different scheduler, like [`HeunDiscreteScheduler`] or [`LMSDiscreteScheduler`], that gives up generation speed for quality.
231105

232-
<div class="flex justify-center">
233-
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_7.png">
234-
</div>
106+
```py
107+
import torch
108+
from diffusers import DiffusionPipeline, HeunDiscreteScheduler
235109

236-
Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject:
110+
pipeline = DiffusionPipeline.from_pretrained(
111+
"stabilityai/stable-diffusion-xl-base-1.0",
112+
torch_dtype=torch.bfloat16
113+
).to("cuda")
114+
pipeline.scheduler = HeunDiscreteScheduler.from_config(pipeline.scheduler.config)
237115

238-
```python
239-
prompts = [
240-
"portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
241-
"portrait photo of an old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
242-
"portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
243-
"portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
244-
]
245-
246-
generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
247-
images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
248-
make_image_grid(images, 2, 2)
249-
```
250-
251-
<div class="flex justify-center">
252-
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_8.png">
253-
</div>
116+
prompt = """
117+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
118+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
119+
"""
120+
negative_prompt = "low quality, blurry, ugly, poor details"
121+
pipeline(prompt, negative_prompt=negative_prompt).images[0]
122+
```
254123

255124
## Next steps
256125

257-
In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources:
258-
259-
- Learn how [PyTorch 2.0](./optimization/fp16) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster!
260-
- If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption.
261-
- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16).
126+
Diffusers offers more advanced and powerful optimizations such as [group-offloading](./optimization/memory#group-offloading) and [regional compilation](./optimization/fp16#regional-compilation). To learn more about how to maximize performance, take a look at the Inference Optimization section.

0 commit comments

Comments
 (0)