Skip to content

Commit d868ddb

Browse files
authored
Merge branch 'main' into sd3.5_IPAdapter
2 parents ab0d904 + 6131a93 commit d868ddb

File tree

28 files changed

+3419
-85
lines changed

28 files changed

+3419
-85
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -314,6 +314,8 @@
314314
title: AutoencoderKLMochi
315315
- local: api/models/asymmetricautoencoderkl
316316
title: AsymmetricAutoencoderKL
317+
- local: api/models/autoencoder_dc
318+
title: AutoencoderDC
317319
- local: api/models/consistency_decoder_vae
318320
title: ConsistencyDecoderVAE
319321
- local: api/models/autoencoder_oobleck
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderDC
13+
14+
The 2D Autoencoder model used in [SANA](https://huggingface.co/papers/2410.10629) and introduced in [DCAE](https://huggingface.co/papers/2410.10733) by authors Junyu Chen\*, Han Cai\*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han from MIT HAN Lab.
15+
16+
The abstract from the paper is:
17+
18+
*We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at [this https URL](https://github.com/mit-han-lab/efficientvit).*
19+
20+
The following DCAE models are released and supported in Diffusers.
21+
22+
| Diffusers format | Original format |
23+
|:----------------:|:---------------:|
24+
| [`mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-sana-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0)
25+
| [`mit-han-lab/dc-ae-f32c32-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-in-1.0)
26+
| [`mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-mix-1.0)
27+
| [`mit-han-lab/dc-ae-f64c128-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f64c128-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-in-1.0)
28+
| [`mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f64c128-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-mix-1.0)
29+
| [`mit-han-lab/dc-ae-f128c512-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f128c512-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0)
30+
| [`mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f128c512-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0)
31+
32+
Load a model in Diffusers format with [`~ModelMixin.from_pretrained`].
33+
34+
```python
35+
from diffusers import AutoencoderDC
36+
37+
ae = AutoencoderDC.from_pretrained("mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers", torch_dtype=torch.float32).to("cuda")
38+
```
39+
40+
## AutoencoderDC
41+
42+
[[autodoc]] AutoencoderDC
43+
- encode
44+
- decode
45+
- all
46+
47+
## DecoderOutput
48+
49+
[[autodoc]] models.autoencoders.vae.DecoderOutput
50+

examples/controlnet/README_sd3.md

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# ControlNet training example for Stable Diffusion 3 (SD3)
1+
# ControlNet training example for Stable Diffusion 3/3.5 (SD3/3.5)
22

3-
The `train_controlnet_sd3.py` script shows how to implement the ControlNet training procedure and adapt it for [Stable Diffusion 3](https://arxiv.org/abs/2403.03206).
3+
The `train_controlnet_sd3.py` script shows how to implement the ControlNet training procedure and adapt it for [Stable Diffusion 3](https://arxiv.org/abs/2403.03206) and [Stable Diffusion 3.5](https://stability.ai/news/introducing-stable-diffusion-3-5).
44

55
## Running locally with PyTorch
66

@@ -51,9 +51,9 @@ Please download the dataset and unzip it in the directory `fill50k` in the `exam
5151

5252
## Training
5353

54-
First download the SD3 model from [Hugging Face Hub](https://huggingface.co/stabilityai/stable-diffusion-3-medium). We will use it as a base model for the ControlNet training.
54+
First download the SD3 model from [Hugging Face Hub](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) or the SD3.5 model from [Hugging Face Hub](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium). We will use it as a base model for the ControlNet training.
5555
> [!NOTE]
56-
> As the model is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:
56+
> As the model is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) or [Stable Diffusion 3.5 Large Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:
5757
5858
```bash
5959
huggingface-cli login
@@ -90,6 +90,8 @@ accelerate launch train_controlnet_sd3.py \
9090
--gradient_accumulation_steps=4
9191
```
9292

93+
To train a ControlNet model for Stable Diffusion 3.5, replace the `MODEL_DIR` with `stabilityai/stable-diffusion-3.5-medium`.
94+
9395
To better track our training experiments, we're using flags `validation_image`, `validation_prompt`, and `validation_steps` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.
9496

9597
Our experiments were conducted on a single 40GB A100 GPU.
@@ -124,6 +126,8 @@ image = pipe(
124126
image.save("./output.png")
125127
```
126128

129+
Similarly, for SD3.5, replace the `base_model_path` with `stabilityai/stable-diffusion-3.5-medium` and controlnet_path `DavyMorgan/sd35-controlnet-out'.
130+
127131
## Notes
128132

129133
### GPU usage
@@ -135,6 +139,8 @@ Make sure to use the right GPU when configuring the [accelerator](https://huggin
135139

136140
## Example results
137141

142+
### SD3
143+
138144
#### After 500 steps with batch size 8
139145

140146
| | |
@@ -150,3 +156,20 @@ Make sure to use the right GPU when configuring the [accelerator](https://huggin
150156
|| pale golden rod circle with old lace background |
151157
![conditioning image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png) | ![pale golden rod circle with old lace background](https://huggingface.co/datasets/DavyMorgan/sd3-controlnet-results/resolve/main/step-6500.png) |
152158

159+
### SD3.5
160+
161+
#### After 500 steps with batch size 8
162+
163+
| | |
164+
|-------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------:|
165+
|| pale golden rod circle with old lace background |
166+
![conditioning image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png) | ![pale golden rod circle with old lace background](https://huggingface.co/datasets/DavyMorgan/sd3-controlnet-results/resolve/main/step-500-3.5.png) |
167+
168+
169+
#### After 3000 steps with batch size 8:
170+
171+
| | |
172+
|-------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------:|
173+
|| pale golden rod circle with old lace background |
174+
![conditioning image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png) | ![pale golden rod circle with old lace background](https://huggingface.co/datasets/DavyMorgan/sd3-controlnet-results/resolve/main/step-3000-3.5.png) |
175+

examples/controlnet/test_controlnet.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,27 @@ def test_controlnet_sd3(self):
138138
self.assertTrue(os.path.isfile(os.path.join(tmpdir, "diffusion_pytorch_model.safetensors")))
139139

140140

141+
class ControlNetSD35(ExamplesTestsAccelerate):
142+
def test_controlnet_sd3(self):
143+
with tempfile.TemporaryDirectory() as tmpdir:
144+
test_args = f"""
145+
examples/controlnet/train_controlnet_sd3.py
146+
--pretrained_model_name_or_path=hf-internal-testing/tiny-sd35-pipe
147+
--dataset_name=hf-internal-testing/fill10
148+
--output_dir={tmpdir}
149+
--resolution=64
150+
--train_batch_size=1
151+
--gradient_accumulation_steps=1
152+
--controlnet_model_name_or_path=DavyMorgan/tiny-controlnet-sd35
153+
--max_train_steps=4
154+
--checkpointing_steps=2
155+
""".split()
156+
157+
run_command(self._launch_args + test_args)
158+
159+
self.assertTrue(os.path.isfile(os.path.join(tmpdir, "diffusion_pytorch_model.safetensors")))
160+
161+
141162
class ControlNetflux(ExamplesTestsAccelerate):
142163
def test_controlnet_flux(self):
143164
with tempfile.TemporaryDirectory() as tmpdir:

examples/controlnet/train_controlnet_sd3.py

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,12 @@ def parse_args(input_args=None):
263263
help="Path to pretrained controlnet model or model identifier from huggingface.co/models."
264264
" If not specified controlnet weights are initialized from unet.",
265265
)
266+
parser.add_argument(
267+
"--num_extra_conditioning_channels",
268+
type=int,
269+
default=0,
270+
help="Number of extra conditioning channels for controlnet.",
271+
)
266272
parser.add_argument(
267273
"--revision",
268274
type=str,
@@ -539,6 +545,9 @@ def parse_args(input_args=None):
539545
default=77,
540546
help="Maximum sequence length to use with with the T5 text encoder",
541547
)
548+
parser.add_argument(
549+
"--dataset_preprocess_batch_size", type=int, default=1000, help="Batch size for preprocessing dataset."
550+
)
542551
parser.add_argument(
543552
"--validation_prompt",
544553
type=str,
@@ -986,7 +995,9 @@ def main(args):
986995
controlnet = SD3ControlNetModel.from_pretrained(args.controlnet_model_name_or_path)
987996
else:
988997
logger.info("Initializing controlnet weights from transformer")
989-
controlnet = SD3ControlNetModel.from_transformer(transformer)
998+
controlnet = SD3ControlNetModel.from_transformer(
999+
transformer, num_extra_conditioning_channels=args.num_extra_conditioning_channels
1000+
)
9901001

9911002
transformer.requires_grad_(False)
9921003
vae.requires_grad_(False)
@@ -1123,7 +1134,12 @@ def compute_text_embeddings(batch, text_encoders, tokenizers):
11231134
# fingerprint used by the cache for the other processes to load the result
11241135
# details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
11251136
new_fingerprint = Hasher.hash(args)
1126-
train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint)
1137+
train_dataset = train_dataset.map(
1138+
compute_embeddings_fn,
1139+
batched=True,
1140+
batch_size=args.dataset_preprocess_batch_size,
1141+
new_fingerprint=new_fingerprint,
1142+
)
11271143

11281144
del text_encoder_one, text_encoder_two, text_encoder_three
11291145
del tokenizer_one, tokenizer_two, tokenizer_three

0 commit comments

Comments
 (0)