Skip to content

Commit 04717fd

Browse files
DN6sayakpaulyiyixuxu
authored
Add Stable Diffusion 3 (#8483)
* up * add sd3 * update * update * add tests * fix copies * fix docs * update * add dreambooth lora * add LoRA * update * update * update * update * import fix * update * Update src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py Co-authored-by: YiYi Xu <[email protected]> * import fix 2 * update * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/models/autoencoders/autoencoder_kl.py Co-authored-by: YiYi Xu <[email protected]> * update * update * update * fix ckpt id * fix more ids * update * missing doc * Update src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py Co-authored-by: YiYi Xu <[email protected]> * Update src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py Co-authored-by: YiYi Xu <[email protected]> * Update docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md Co-authored-by: Sayak Paul <[email protected]> * Update docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md Co-authored-by: Sayak Paul <[email protected]> * update' * fix * update * Update src/diffusers/models/autoencoders/autoencoder_kl.py * Update src/diffusers/models/autoencoders/autoencoder_kl.py * note on gated access. * requirements * licensing --------- Co-authored-by: sayakpaul <[email protected]> Co-authored-by: YiYi Xu <[email protected]>
1 parent 6fd458e commit 04717fd

37 files changed

+8303
-65
lines changed

docs/source/en/_toctree.yml

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,8 @@
107107
title: Create a dataset for training
108108
- local: training/adapt_a_model
109109
title: Adapt a model to a new task
110-
- sections:
110+
- isExpanded: false
111+
sections:
111112
- local: training/unconditional_training
112113
title: Unconditional image generation
113114
- local: training/text2image
@@ -125,8 +126,8 @@
125126
- local: training/instructpix2pix
126127
title: InstructPix2Pix
127128
title: Models
128-
isExpanded: false
129-
- sections:
129+
- isExpanded: false
130+
sections:
130131
- local: training/text_inversion
131132
title: Textual Inversion
132133
- local: training/dreambooth
@@ -140,7 +141,6 @@
140141
- local: training/ddpo
141142
title: Reinforcement learning training with DDPO
142143
title: Methods
143-
isExpanded: false
144144
title: Training
145145
- sections:
146146
- local: optimization/fp16
@@ -187,16 +187,17 @@
187187
title: Evaluating Diffusion Models
188188
title: Conceptual Guides
189189
- sections:
190-
- sections:
190+
- isExpanded: false
191+
sections:
191192
- local: api/configuration
192193
title: Configuration
193194
- local: api/logging
194195
title: Logging
195196
- local: api/outputs
196197
title: Outputs
197198
title: Main Classes
198-
isExpanded: false
199-
- sections:
199+
- isExpanded: false
200+
sections:
200201
- local: api/loaders/ip_adapter
201202
title: IP-Adapter
202203
- local: api/loaders/lora
@@ -210,8 +211,8 @@
210211
- local: api/loaders/peft
211212
title: PEFT
212213
title: Loaders
213-
isExpanded: false
214-
- sections:
214+
- isExpanded: false
215+
sections:
215216
- local: api/models/overview
216217
title: Overview
217218
- local: api/models/unet
@@ -246,13 +247,15 @@
246247
title: HunyuanDiT2DModel
247248
- local: api/models/transformer_temporal
248249
title: TransformerTemporalModel
250+
- local: api/models/sd3_transformer2d
251+
title: SD3Transformer2DModel
249252
- local: api/models/prior_transformer
250253
title: PriorTransformer
251254
- local: api/models/controlnet
252255
title: ControlNetModel
253256
title: Models
254-
isExpanded: false
255-
- sections:
257+
- isExpanded: false
258+
sections:
256259
- local: api/pipelines/overview
257260
title: Overview
258261
- local: api/pipelines/amused
@@ -350,6 +353,8 @@
350353
title: Safe Stable Diffusion
351354
- local: api/pipelines/stable_diffusion/stable_diffusion_2
352355
title: Stable Diffusion 2
356+
- local: api/pipelines/stable_diffusion/stable_diffusion_3
357+
title: Stable Diffusion 3
353358
- local: api/pipelines/stable_diffusion/stable_diffusion_xl
354359
title: Stable Diffusion XL
355360
- local: api/pipelines/stable_diffusion/sdxl_turbo
@@ -382,8 +387,8 @@
382387
- local: api/pipelines/wuerstchen
383388
title: Wuerstchen
384389
title: Pipelines
385-
isExpanded: false
386-
- sections:
390+
- isExpanded: false
391+
sections:
387392
- local: api/schedulers/overview
388393
title: Overview
389394
- local: api/schedulers/cm_stochastic_iterative
@@ -414,6 +419,8 @@
414419
title: EulerAncestralDiscreteScheduler
415420
- local: api/schedulers/euler
416421
title: EulerDiscreteScheduler
422+
- local: api/schedulers/flow_match_euler_discrete
423+
title: FlowMatchEulerDiscreteScheduler
417424
- local: api/schedulers/heun
418425
title: HeunDiscreteScheduler
419426
- local: api/schedulers/ipndm
@@ -443,8 +450,8 @@
443450
- local: api/schedulers/vq_diffusion
444451
title: VQDiffusionScheduler
445452
title: Schedulers
446-
isExpanded: false
447-
- sections:
453+
- isExpanded: false
454+
sections:
448455
- local: api/internal_classes_overview
449456
title: Overview
450457
- local: api/attnprocessor
@@ -460,5 +467,4 @@
460467
- local: api/video_processor
461468
title: Video Processor
462469
title: Internal classes
463-
isExpanded: false
464470
title: API
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# SD3 Transformer Model
14+
15+
The Transformer model introduced in [Stable Diffusion 3](https://hf.co/papers/2403.03206). Its novelty lies in the MMDiT transformer block.
16+
17+
## SD3Transformer2DModel
18+
19+
[[autodoc]] SD3Transformer2DModel
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Stable Diffusion 3
14+
15+
Stable Diffusion 3 (SD3) was proposed in [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/pdf/2403.03206.pdf) by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach.
16+
17+
The abstract from the paper is:
18+
19+
*Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations.*
20+
21+
22+
## Usage Example
23+
24+
_As the model is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._
25+
26+
Use the command below to log in:
27+
28+
```bash
29+
huggingface-cli login
30+
```
31+
32+
<Tip>
33+
34+
The SD3 pipeline uses three text encoders to generate an image. Model offloading is necessary in order for it to run on most commodity hardware. Please use the `torch.float16` data type for additional memory savings.
35+
36+
</Tip>
37+
38+
39+
```python
40+
import torch
41+
from diffusers import StableDiffusion3Pipeline
42+
43+
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
44+
pipe.to("cuda")
45+
46+
image = pipe(
47+
prompt="a photo of a cat holding a sign that says hello world",
48+
negative_prompt="",
49+
num_inference_steps=28,
50+
height=1024,
51+
width=1024,
52+
guidance_scale=7.0,
53+
).images[0]
54+
55+
image.save("sd3_hello_world.png")
56+
```
57+
58+
## Memory Optimisations for SD3
59+
60+
SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.
61+
62+
### Running Inference with Model Offloading
63+
64+
The most basic memory optimization available in Diffusers allows you to offload the components of the model to CPU during inference in order to save memory, while seeing a slight increase in inference latency. Model offloading will only move a model component onto the GPU when it needs to be executed, while keeping the remaining components on the CPU.
65+
66+
```python
67+
import torch
68+
from diffusers import StableDiffusion3Pipeline
69+
70+
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
71+
pipe.enable_model_cpu_offload()
72+
73+
image = pipe(
74+
prompt="a photo of a cat holding a sign that says hello world",
75+
negative_prompt="",
76+
num_inference_steps=28,
77+
height=1024,
78+
width=1024,
79+
guidance_scale=7.0,
80+
).images[0]
81+
82+
image.save("sd3_hello_world.png")
83+
```
84+
85+
### Dropping the T5 Text Encoder during Inference
86+
87+
Removing the memory-intensive 4.7B parameter T5-XXL text encoder during inference can significantly decrease the memory requirements for SD3 with only a slight loss in performance.
88+
89+
```python
90+
import torch
91+
from diffusers import StableDiffusion3Pipeline
92+
93+
pipe = StableDiffusion3Pipeline.from_pretrained(
94+
"stabilityai/stable-diffusion-3-medium-diffusers",
95+
text_encoder_3=None,
96+
tokenizer_3=None,
97+
torch_dtype=torch.float16
98+
)
99+
pipe.to("cuda")
100+
101+
image = pipe(
102+
prompt="a photo of a cat holding a sign that says hello world",
103+
negative_prompt="",
104+
num_inference_steps=28,
105+
height=1024,
106+
width=1024,
107+
guidance_scale=7.0,
108+
).images[0]
109+
110+
image.save("sd3_hello_world-no-T5.png")
111+
```
112+
113+
### Using a Quantized Version of the T5 Text Encoder
114+
115+
We can leverage the `bitsandbytes` library to load and quantize the T5-XXL text encoder to 8-bit precision. This allows you to keep using all three text encoders while only slightly impacting performance.
116+
117+
First install the `bitsandbytes` library.
118+
119+
```shell
120+
pip install bitsandbytes
121+
```
122+
123+
Then load the T5-XXL model using the `BitsAndBytesConfig`.
124+
125+
```python
126+
import torch
127+
from diffusers import StableDiffusion3Pipeline
128+
from transformers import T5EncoderModel, BitsAndBytesConfig
129+
130+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
131+
132+
model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
133+
text_encoder = T5EncoderModel.from_pretrained(
134+
model_id,
135+
subfolder="text_encoder_3",
136+
quantization_config=quantization_config,
137+
)
138+
pipe = StableDiffusion3Pipeline.from_pretrained(
139+
model_id,
140+
text_encoder_3=text_encoder,
141+
device_map="balanced",
142+
torch_dtype=torch.float16
143+
)
144+
145+
image = pipe(
146+
prompt="a photo of a cat holding a sign that says hello world",
147+
negative_prompt="",
148+
num_inference_steps=28,
149+
height=1024,
150+
width=1024,
151+
guidance_scale=7.0,
152+
).images[0]
153+
154+
image.save("sd3_hello_world-8bit-T5.png")
155+
```
156+
157+
You can find the end-to-end script [here](https://gist.github.com/sayakpaul/82acb5976509851f2db1a83456e504f1).
158+
159+
## Performance Optimizations for SD3
160+
161+
### Using Torch Compile to Speed Up Inference
162+
163+
Using compiled components in the SD3 pipeline can speed up inference by as much as 4X. The following code snippet demonstrates how to compile the Transformer and VAE components of the SD3 pipeline.
164+
165+
```python
166+
import torch
167+
from diffusers import StableDiffusion3Pipeline
168+
169+
torch.set_float32_matmul_precision("high")
170+
171+
torch._inductor.config.conv_1x1_as_mm = True
172+
torch._inductor.config.coordinate_descent_tuning = True
173+
torch._inductor.config.epilogue_fusion = False
174+
torch._inductor.config.coordinate_descent_check_all_directions = True
175+
176+
pipe = StableDiffusion3Pipeline.from_pretrained(
177+
"stabilityai/stable-diffusion-3-medium-diffusers",
178+
torch_dtype=torch.float16
179+
).to("cuda")
180+
pipe.set_progress_bar_config(disable=True)
181+
182+
pipe.transformer.to(memory_format=torch.channels_last)
183+
pipe.vae.to(memory_format=torch.channels_last)
184+
185+
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
186+
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)
187+
188+
# Warm Up
189+
prompt = "a photo of a cat holding a sign that says hello world",
190+
for _ in range(3):
191+
_ = pipe(prompt=prompt, generator=torch.manual_seed(1))
192+
193+
# Run Inference
194+
image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0]
195+
image.save("sd3_hello_world.png")
196+
```
197+
198+
Check out the full script [here](https://gist.github.com/sayakpaul/508d89d7aad4f454900813da5d42ca97).
199+
200+
## Loading the original checkpoints via `from_single_file`
201+
202+
The `SD3Transformer2DModel` and `StableDiffusion3Pipeline` classes support loading the original checkpoints via the `from_single_file` method. This method allows you to load the original checkpoint files that were used to train the models.
203+
204+
## Loading the original checkpoints for the `SD3Transformer2DModel`
205+
206+
```python
207+
from diffusers import SD3Transformer2DModel
208+
209+
model = SD3Transformer2DModel.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium.safetensors")
210+
```
211+
212+
## Loading the single checkpoint for the `StableDiffusion3Pipeline`
213+
214+
```python
215+
from diffusers import StableDiffusion3Pipeline
216+
from transformers import T5EncoderModel
217+
218+
text_encoder_3 = T5EncoderModel.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", subfolder="text_encoder_3", torch_dtype=torch.float16)
219+
pipe = StableDiffusion3Pipeline.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips.safetensors", torch_dtype=torch.float16, text_encoder_3=text_encoder_3)
220+
```
221+
222+
<Tip>
223+
`from_single_file` support for the `fp8` version of the checkpoints is coming soon. Watch this space.
224+
</Tip>
225+
226+
## StableDiffusion3Pipeline
227+
228+
[[autodoc]] StableDiffusion3Pipeline
229+
- all
230+
- __call__
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# FlowMatchEulerDiscreteScheduler
14+
15+
`FlowMatchEulerDiscreteScheduler` is based on the flow-matching sampling introduced in [Stable Diffusion 3](https://arxiv.org/abs/2403.03206).
16+
17+
## FlowMatchEulerDiscreteScheduler
18+
[[autodoc]] FlowMatchEulerDiscreteScheduler

0 commit comments

Comments
 (0)