You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+
the License. You may obtain a copy of the License at
5
+
6
+
http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+
specific language governing permissions and limitations under the License. -->
11
+
12
+
# AutoencoderDC
13
+
14
+
The 2D Autoencoder model used in [SANA](https://huggingface.co/papers/2410.10629) and introduced in [DCAE](https://huggingface.co/papers/2410.10733) by authors Junyu Chen\*, Han Cai\*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han from MIT HAN Lab.
15
+
16
+
The abstract from the paper is:
17
+
18
+
*We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at [this https URL](https://github.com/mit-han-lab/efficientvit).*
19
+
20
+
The following DCAE models are released and supported in Diffusers.
To use bitsandbytes, make sure you have the following libraries installed:
22
28
@@ -31,70 +37,167 @@ Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixi
31
37
32
38
Quantizing a model in 8-bit halves the memory-usage:
33
39
40
+
bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
41
+
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
42
+
43
+
For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.
44
+
45
+
> [!TIP]
46
+
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.
47
+
34
48
```py
35
-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
49
+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
50
+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
74
+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
47
75
48
-
```py
49
-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.
116
+
117
+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
63
118
64
119
</hfoption>
65
120
<hfoptionid="4-bit">
66
121
67
122
Quantizing a model in 4-bit reduces your memory-usage by 4x:
68
123
124
+
bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
125
+
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
126
+
127
+
For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.
128
+
129
+
> [!TIP]
130
+
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.
131
+
69
132
```py
70
-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
133
+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
134
+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
158
+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
82
159
83
-
```py
84
-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
Call [`~ModelMixin.push_to_hub`] after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.
199
+
200
+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
200
303
201
304
```py
202
-
from diffusers import BitsAndBytesConfig
305
+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
306
+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
@@ -220,38 +340,74 @@ For inference, the `bnb_4bit_quant_type` does not have a huge impact on performa
220
340
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.
221
341
222
342
```py
223
-
from diffusers import BitsAndBytesConfig
343
+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
344
+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model.
376
+
Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model.
240
377
241
378
```python
242
-
from diffusers import BitsAndBytesConfig
379
+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
380
+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
0 commit comments