Skip to content

Commit c4df147

Browse files
committed
Merge branch 'main' into dduf
2 parents 81bd097 + 6131a93 commit c4df147

32 files changed

+3624
-152
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -314,6 +314,8 @@
314314
title: AutoencoderKLMochi
315315
- local: api/models/asymmetricautoencoderkl
316316
title: AsymmetricAutoencoderKL
317+
- local: api/models/autoencoder_dc
318+
title: AutoencoderDC
317319
- local: api/models/consistency_decoder_vae
318320
title: ConsistencyDecoderVAE
319321
- local: api/models/autoencoder_oobleck
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderDC
13+
14+
The 2D Autoencoder model used in [SANA](https://huggingface.co/papers/2410.10629) and introduced in [DCAE](https://huggingface.co/papers/2410.10733) by authors Junyu Chen\*, Han Cai\*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han from MIT HAN Lab.
15+
16+
The abstract from the paper is:
17+
18+
*We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at [this https URL](https://github.com/mit-han-lab/efficientvit).*
19+
20+
The following DCAE models are released and supported in Diffusers.
21+
22+
| Diffusers format | Original format |
23+
|:----------------:|:---------------:|
24+
| [`mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-sana-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0)
25+
| [`mit-han-lab/dc-ae-f32c32-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-in-1.0)
26+
| [`mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-mix-1.0)
27+
| [`mit-han-lab/dc-ae-f64c128-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f64c128-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-in-1.0)
28+
| [`mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f64c128-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-mix-1.0)
29+
| [`mit-han-lab/dc-ae-f128c512-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f128c512-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0)
30+
| [`mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f128c512-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0)
31+
32+
Load a model in Diffusers format with [`~ModelMixin.from_pretrained`].
33+
34+
```python
35+
from diffusers import AutoencoderDC
36+
37+
ae = AutoencoderDC.from_pretrained("mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers", torch_dtype=torch.float32).to("cuda")
38+
```
39+
40+
## AutoencoderDC
41+
42+
[[autodoc]] AutoencoderDC
43+
- encode
44+
- decode
45+
- all
46+
47+
## DecoderOutput
48+
49+
[[autodoc]] models.autoencoders.vae.DecoderOutput
50+

docs/source/en/quantization/bitsandbytes.md

Lines changed: 205 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,12 @@ specific language governing permissions and limitations under the License.
1717

1818
4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
1919

20+
This guide demonstrates how quantization can enable running
21+
[FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)
22+
on less than 16GB of VRAM and even on a free Google
23+
Colab instance.
24+
25+
![comparison image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/comparison.png)
2026

2127
To use bitsandbytes, make sure you have the following libraries installed:
2228

@@ -31,70 +37,167 @@ Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixi
3137

3238
Quantizing a model in 8-bit halves the memory-usage:
3339

40+
bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
41+
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
42+
43+
For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.
44+
45+
> [!TIP]
46+
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.
47+
3448
```py
35-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
49+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
50+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
3651

37-
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
52+
from diffusers import FluxTransformer2DModel
53+
from transformers import T5EncoderModel
3854

39-
model_8bit = FluxTransformer2DModel.from_pretrained(
40-
"black-forest-labs/FLUX.1-dev",
55+
quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,)
56+
57+
text_encoder_2_8bit = T5EncoderModel.from_pretrained(
58+
"black-forest-labs/FLUX.1-dev",
59+
subfolder="text_encoder_2",
60+
quantization_config=quant_config,
61+
torch_dtype=torch.float16,
62+
)
63+
64+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,)
65+
66+
transformer_8bit = FluxTransformer2DModel.from_pretrained(
67+
"black-forest-labs/FLUX.1-dev",
4168
subfolder="transformer",
42-
quantization_config=quantization_config
69+
quantization_config=quant_config,
70+
torch_dtype=torch.float16,
4371
)
4472
```
4573

46-
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
74+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
4775

48-
```py
49-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
76+
```diff
77+
transformer_8bit = FluxTransformer2DModel.from_pretrained(
78+
"black-forest-labs/FLUX.1-dev",
79+
subfolder="transformer",
80+
quantization_config=quant_config,
81+
+ torch_dtype=torch.float32,
82+
)
83+
```
5084

51-
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
85+
Let's generate an image using our quantized models.
5286

53-
model_8bit = FluxTransformer2DModel.from_pretrained(
54-
"black-forest-labs/FLUX.1-dev",
55-
subfolder="transformer",
56-
quantization_config=quantization_config,
57-
torch_dtype=torch.float32
87+
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the
88+
CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
89+
90+
```py
91+
pipe = FluxPipeline.from_pretrained(
92+
"black-forest-labs/FLUX.1-dev",
93+
transformer=transformer_8bit,
94+
text_encoder_2=text_encoder_2_8bit,
95+
torch_dtype=torch.float16,
96+
device_map="auto",
5897
)
59-
model_8bit.transformer_blocks.layers[-1].norm2.weight.dtype
98+
99+
pipe_kwargs = {
100+
"prompt": "A cat holding a sign that says hello world",
101+
"height": 1024,
102+
"width": 1024,
103+
"guidance_scale": 3.5,
104+
"num_inference_steps": 50,
105+
"max_sequence_length": 512,
106+
}
107+
108+
image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
60109
```
61110

62-
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
111+
<div class="flex justify-center">
112+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/8bit.png"/>
113+
</div>
114+
115+
When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.
116+
117+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
63118

64119
</hfoption>
65120
<hfoption id="4-bit">
66121

67122
Quantizing a model in 4-bit reduces your memory-usage by 4x:
68123

124+
bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
125+
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
126+
127+
For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.
128+
129+
> [!TIP]
130+
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.
131+
69132
```py
70-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
133+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
134+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
71135

72-
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
136+
from diffusers import FluxTransformer2DModel
137+
from transformers import T5EncoderModel
73138

74-
model_4bit = FluxTransformer2DModel.from_pretrained(
75-
"black-forest-labs/FLUX.1-dev",
139+
quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,)
140+
141+
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
142+
"black-forest-labs/FLUX.1-dev",
143+
subfolder="text_encoder_2",
144+
quantization_config=quant_config,
145+
torch_dtype=torch.float16,
146+
)
147+
148+
quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,)
149+
150+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
151+
"black-forest-labs/FLUX.1-dev",
76152
subfolder="transformer",
77-
quantization_config=quantization_config
153+
quantization_config=quant_config,
154+
torch_dtype=torch.float16,
78155
)
79156
```
80157

81-
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
158+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
82159

83-
```py
84-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
160+
```diff
161+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
162+
"black-forest-labs/FLUX.1-dev",
163+
subfolder="transformer",
164+
quantization_config=quant_config,
165+
+ torch_dtype=torch.float32,
166+
)
167+
```
85168

86-
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
169+
Let's generate an image using our quantized models.
87170

88-
model_4bit = FluxTransformer2DModel.from_pretrained(
89-
"black-forest-labs/FLUX.1-dev",
90-
subfolder="transformer",
91-
quantization_config=quantization_config,
92-
torch_dtype=torch.float32
171+
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
172+
173+
```py
174+
pipe = FluxPipeline.from_pretrained(
175+
"black-forest-labs/FLUX.1-dev",
176+
transformer=transformer_4bit,
177+
text_encoder_2=text_encoder_2_4bit,
178+
torch_dtype=torch.float16,
179+
device_map="auto",
93180
)
94-
model_4bit.transformer_blocks.layers[-1].norm2.weight.dtype
181+
182+
pipe_kwargs = {
183+
"prompt": "A cat holding a sign that says hello world",
184+
"height": 1024,
185+
"width": 1024,
186+
"guidance_scale": 3.5,
187+
"num_inference_steps": 50,
188+
"max_sequence_length": 512,
189+
}
190+
191+
image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
95192
```
96193

97-
Call [`~ModelMixin.push_to_hub`] after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
194+
<div class="flex justify-center">
195+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/4bit.png"/>
196+
</div>
197+
198+
When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.
199+
200+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
98201

99202
</hfoption>
100203
</hfoptions>
@@ -199,17 +302,34 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dty
199302
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
200303

201304
```py
202-
from diffusers import BitsAndBytesConfig
305+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
306+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
307+
308+
from diffusers import FluxTransformer2DModel
309+
from transformers import T5EncoderModel
203310

204-
nf4_config = BitsAndBytesConfig(
311+
quant_config = TransformersBitsAndBytesConfig(
205312
load_in_4bit=True,
206313
bnb_4bit_quant_type="nf4",
207314
)
208315

209-
model_nf4 = SD3Transformer2DModel.from_pretrained(
210-
"stabilityai/stable-diffusion-3-medium-diffusers",
316+
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
317+
"black-forest-labs/FLUX.1-dev",
318+
subfolder="text_encoder_2",
319+
quantization_config=quant_config,
320+
torch_dtype=torch.float16,
321+
)
322+
323+
quant_config = DiffusersBitsAndBytesConfig(
324+
load_in_4bit=True,
325+
bnb_4bit_quant_type="nf4",
326+
)
327+
328+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
329+
"black-forest-labs/FLUX.1-dev",
211330
subfolder="transformer",
212-
quantization_config=nf4_config,
331+
quantization_config=quant_config,
332+
torch_dtype=torch.float16,
213333
)
214334
```
215335

@@ -220,38 +340,74 @@ For inference, the `bnb_4bit_quant_type` does not have a huge impact on performa
220340
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.
221341

222342
```py
223-
from diffusers import BitsAndBytesConfig
343+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
344+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
345+
346+
from diffusers import FluxTransformer2DModel
347+
from transformers import T5EncoderModel
224348

225-
double_quant_config = BitsAndBytesConfig(
349+
quant_config = TransformersBitsAndBytesConfig(
226350
load_in_4bit=True,
227351
bnb_4bit_use_double_quant=True,
228352
)
229353

230-
double_quant_model = SD3Transformer2DModel.from_pretrained(
231-
"stabilityai/stable-diffusion-3-medium-diffusers",
354+
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
355+
"black-forest-labs/FLUX.1-dev",
356+
subfolder="text_encoder_2",
357+
quantization_config=quant_config,
358+
torch_dtype=torch.float16,
359+
)
360+
361+
quant_config = DiffusersBitsAndBytesConfig(
362+
load_in_4bit=True,
363+
bnb_4bit_use_double_quant=True,
364+
)
365+
366+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
367+
"black-forest-labs/FLUX.1-dev",
232368
subfolder="transformer",
233-
quantization_config=double_quant_config,
369+
quantization_config=quant_config,
370+
torch_dtype=torch.float16,
234371
)
235372
```
236373

237374
## Dequantizing `bitsandbytes` models
238375

239-
Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model.
376+
Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model.
240377

241378
```python
242-
from diffusers import BitsAndBytesConfig
379+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
380+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
243381

244-
double_quant_config = BitsAndBytesConfig(
382+
from diffusers import FluxTransformer2DModel
383+
from transformers import T5EncoderModel
384+
385+
quant_config = TransformersBitsAndBytesConfig(
245386
load_in_4bit=True,
246387
bnb_4bit_use_double_quant=True,
247388
)
248389

249-
double_quant_model = SD3Transformer2DModel.from_pretrained(
250-
"stabilityai/stable-diffusion-3-medium-diffusers",
390+
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
391+
"black-forest-labs/FLUX.1-dev",
392+
subfolder="text_encoder_2",
393+
quantization_config=quant_config,
394+
torch_dtype=torch.float16,
395+
)
396+
397+
quant_config = DiffusersBitsAndBytesConfig(
398+
load_in_4bit=True,
399+
bnb_4bit_use_double_quant=True,
400+
)
401+
402+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
403+
"black-forest-labs/FLUX.1-dev",
251404
subfolder="transformer",
252-
quantization_config=double_quant_config,
405+
quantization_config=quant_config,
406+
torch_dtype=torch.float16,
253407
)
254-
model.dequantize()
408+
409+
text_encoder_2_4bit.dequantize()
410+
transformer_4bit.dequantize()
255411
```
256412

257413
## Resources

0 commit comments

Comments
 (0)