Skip to content

Commit 5ef13c7

Browse files
Merge branch 'main' into refac_training_utils
2 parents 220e27f + 1b64772 commit 5ef13c7

File tree

100 files changed

+4316
-220
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

100 files changed

+4316
-220
lines changed

.github/workflows/pr_test_peft_backend.yml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -92,12 +92,14 @@ jobs:
9292
run: |
9393
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
9494
python -m uv pip install -e [quality,test]
95+
# TODO (sayakpaul, DN6): revisit `--no-deps`
9596
if [ "${{ matrix.lib-versions }}" == "main" ]; then
96-
python -m pip install -U peft@git+https://github.com/huggingface/peft.git
97-
python -m uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
98-
pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
97+
python -m pip install -U peft@git+https://github.com/huggingface/peft.git --no-deps
98+
python -m uv pip install -U transformers@git+https://github.com/huggingface/transformers.git --no-deps
99+
pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps
99100
else
100-
python -m uv pip install -U peft transformers accelerate
101+
python -m uv pip install -U peft --no-deps
102+
python -m uv pip install -U transformers accelerate --no-deps
101103
fi
102104
103105
- name: Environment

docs/source/en/_toctree.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,12 @@
150150
title: Reinforcement learning training with DDPO
151151
title: Methods
152152
title: Training
153+
- sections:
154+
- local: quantization/overview
155+
title: Getting Started
156+
- local: quantization/bitsandbytes
157+
title: bitsandbytes
158+
title: Quantization Methods
153159
- sections:
154160
- local: optimization/fp16
155161
title: Speed up inference
@@ -209,6 +215,8 @@
209215
title: Logging
210216
- local: api/outputs
211217
title: Outputs
218+
- local: api/quantization
219+
title: Quantization
212220
title: Main Classes
213221
- isExpanded: false
214222
sections:

docs/source/en/api/pipelines/controlnet_flux.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!--Copyright 2024 The HuggingFace Team and The InstantX Team. All rights reserved.
1+
<!--Copyright 2024 The HuggingFace Team, The InstantX Team, and the XLabs Team. All rights reserved.
22
33
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
44
the License. You may obtain a copy of the License at
@@ -31,6 +31,14 @@ This controlnet code is implemented by [The InstantX Team](https://huggingface.c
3131
| Depth | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Depth) |
3232
| Union | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Union) |
3333

34+
XLabs ControlNets are also supported, which was contributed by the [XLabs team](https://huggingface.co/XLabs-AI).
35+
36+
| ControlNet type | Developer | Link |
37+
| -------- | ---------- | ---- |
38+
| Canny | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-canny-diffusers) |
39+
| Depth | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-depth-diffusers) |
40+
| HED | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-hed-diffusers) |
41+
3442

3543
<Tip>
3644

docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,11 @@ image = pipe(
5454
image.save("sd3_hello_world.png")
5555
```
5656

57+
**Note:** Stable Diffusion 3.5 can also be run using the SD3 pipeline, and all mentioned optimizations and techniques apply to it as well. In total there are three official models in the SD3 family:
58+
- [`stabilityai/stable-diffusion-3-medium-diffusers`](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers)
59+
- [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large)
60+
- [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo)
61+
5762
## Memory Optimisations for SD3
5863

5964
SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.

docs/source/en/api/quantization.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
-->
13+
14+
# Quantization
15+
16+
Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Diffusers supports 8-bit and 4-bit quantization with [bitsandbytes](https://huggingface.co/docs/bitsandbytes/en/index).
17+
18+
Quantization techniques that aren't supported in Transformers can be added with the [`DiffusersQuantizer`] class.
19+
20+
<Tip>
21+
22+
Learn how to quantize models in the [Quantization](../quantization/overview) guide.
23+
24+
</Tip>
25+
26+
27+
## BitsAndBytesConfig
28+
29+
[[autodoc]] BitsAndBytesConfig
30+
31+
## DiffusersQuantizer
32+
33+
[[autodoc]] quantizers.base.DiffusersQuantizer
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
-->
13+
14+
# bitsandbytes
15+
16+
[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.
17+
18+
4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
19+
20+
21+
To use bitsandbytes, make sure you have the following libraries installed:
22+
23+
```bash
24+
pip install diffusers transformers accelerate bitsandbytes -U
25+
```
26+
27+
Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
28+
29+
<hfoptions id="bnb">
30+
<hfoption id="8-bit">
31+
32+
Quantizing a model in 8-bit halves the memory-usage:
33+
34+
```py
35+
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
36+
37+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
38+
39+
model_8bit = FluxTransformer2DModel.from_pretrained(
40+
"black-forest-labs/FLUX.1-dev",
41+
subfolder="transformer",
42+
quantization_config=quantization_config
43+
)
44+
```
45+
46+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
47+
48+
```py
49+
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
50+
51+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
52+
53+
model_8bit = FluxTransformer2DModel.from_pretrained(
54+
"black-forest-labs/FLUX.1-dev",
55+
subfolder="transformer",
56+
quantization_config=quantization_config,
57+
torch_dtype=torch.float32
58+
)
59+
model_8bit.transformer_blocks.layers[-1].norm2.weight.dtype
60+
```
61+
62+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights.
63+
64+
```py
65+
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
66+
67+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
68+
69+
model_8bit = FluxTransformer2DModel.from_pretrained(
70+
"black-forest-labs/FLUX.1-dev",
71+
subfolder="transformer",
72+
quantization_config=quantization_config
73+
)
74+
```
75+
76+
</hfoption>
77+
<hfoption id="4-bit">
78+
79+
Quantizing a model in 4-bit reduces your memory-usage by 4x:
80+
81+
```py
82+
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
83+
84+
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
85+
86+
model_4bit = FluxTransformer2DModel.from_pretrained(
87+
"black-forest-labs/FLUX.1-dev",
88+
subfolder="transformer",
89+
quantization_config=quantization_config
90+
)
91+
```
92+
93+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
94+
95+
```py
96+
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
97+
98+
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
99+
100+
model_4bit = FluxTransformer2DModel.from_pretrained(
101+
"black-forest-labs/FLUX.1-dev",
102+
subfolder="transformer",
103+
quantization_config=quantization_config,
104+
torch_dtype=torch.float32
105+
)
106+
model_4bit.transformer_blocks.layers[-1].norm2.weight.dtype
107+
```
108+
109+
Call [`~ModelMixin.push_to_hub`] after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
110+
111+
</hfoption>
112+
</hfoptions>
113+
114+
<Tip warning={true}>
115+
116+
Training with 8-bit and 4-bit weights are only supported for training *extra* parameters.
117+
118+
</Tip>
119+
120+
Check your memory footprint with the `get_memory_footprint` method:
121+
122+
```py
123+
print(model.get_memory_footprint())
124+
```
125+
126+
Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters:
127+
128+
```py
129+
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
130+
131+
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
132+
133+
model_4bit = FluxTransformer2DModel.from_pretrained(
134+
"sayakpaul/flux.1-dev-nf4-pkg", subfolder="transformer"
135+
)
136+
```
137+
138+
## 8-bit (LLM.int8() algorithm)
139+
140+
<Tip>
141+
142+
Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)!
143+
144+
</Tip>
145+
146+
This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion.
147+
148+
### Outlier threshold
149+
150+
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
151+
152+
To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:
153+
154+
```py
155+
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
156+
157+
quantization_config = BitsAndBytesConfig(
158+
load_in_8bit=True, llm_int8_threshold=10,
159+
)
160+
161+
model_8bit = FluxTransformer2DModel.from_pretrained(
162+
"black-forest-labs/FLUX.1-dev",
163+
subfolder="transformer",
164+
quantization_config=quantization_config,
165+
)
166+
```
167+
168+
### Skip module conversion
169+
170+
For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
171+
172+
```py
173+
from diffusers import SD3Transformer2DModel, BitsAndBytesConfig
174+
175+
quantization_config = BitsAndBytesConfig(
176+
load_in_8bit=True, llm_int8_skip_modules=["proj_out"],
177+
)
178+
179+
model_8bit = SD3Transformer2DModel.from_pretrained(
180+
"stabilityai/stable-diffusion-3-medium-diffusers",
181+
subfolder="transformer",
182+
quantization_config=quantization_config,
183+
)
184+
```
185+
186+
187+
## 4-bit (QLoRA algorithm)
188+
189+
<Tip>
190+
191+
Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
192+
193+
</Tip>
194+
195+
This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
196+
197+
198+
### Compute data type
199+
200+
To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
201+
202+
```py
203+
import torch
204+
from diffusers import BitsAndBytesConfig
205+
206+
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
207+
```
208+
209+
### Normal Float 4 (NF4)
210+
211+
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
212+
213+
```py
214+
from diffusers import BitsAndBytesConfig
215+
216+
nf4_config = BitsAndBytesConfig(
217+
load_in_4bit=True,
218+
bnb_4bit_quant_type="nf4",
219+
)
220+
221+
model_nf4 = SD3Transformer2DModel.from_pretrained(
222+
"stabilityai/stable-diffusion-3-medium-diffusers",
223+
subfolder="transformer",
224+
quantization_config=nf4_config,
225+
)
226+
```
227+
228+
For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.
229+
230+
### Nested quantization
231+
232+
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.
233+
234+
```py
235+
from diffusers import BitsAndBytesConfig
236+
237+
double_quant_config = BitsAndBytesConfig(
238+
load_in_4bit=True,
239+
bnb_4bit_use_double_quant=True,
240+
)
241+
242+
double_quant_model = SD3Transformer2DModel.from_pretrained(
243+
"stabilityai/stable-diffusion-3-medium-diffusers",
244+
subfolder="transformer",
245+
quantization_config=double_quant_config,
246+
)
247+
```
248+
249+
## Dequantizing `bitsandbytes` models
250+
251+
Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model.
252+
253+
```python
254+
from diffusers import BitsAndBytesConfig
255+
256+
double_quant_config = BitsAndBytesConfig(
257+
load_in_4bit=True,
258+
bnb_4bit_use_double_quant=True,
259+
)
260+
261+
double_quant_model = SD3Transformer2DModel.from_pretrained(
262+
"stabilityai/stable-diffusion-3-medium-diffusers",
263+
subfolder="transformer",
264+
quantization_config=double_quant_config,
265+
)
266+
model.dequantize()
267+
```

0 commit comments

Comments
 (0)