Skip to content

Commit daccd75

Browse files
committed
chore: review suggestions
1 parent 1d96c52 commit daccd75

File tree

1 file changed

+30
-28
lines changed

1 file changed

+30
-28
lines changed

docs/source/en/quantization/bitsandbytes.md

Lines changed: 30 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -40,16 +40,20 @@ Quantizing a model in 8-bit halves the memory-usage:
4040
bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
4141
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
4242

43+
> [!Note]
44+
> Depending on the GPU, set your `torch_dtype`. For Ada and higher series GPUs support `torch.bfloat16` and we suggest using it when applicable.
45+
46+
> [!Note]
47+
> We do not qunatize the `CLIPTextModel` and the `AutoencoderKL` due to their small size, and also for the fact that `AutoencoderKL` has very few `torch.nn.Linear` layers.
48+
4349
```py
4450
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
4551
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
4652

4753
from diffusers import FluxTransformer2DModel
4854
from transformers import T5EncoderModel
4955

50-
quant_config = TransformersBitsAndBytesConfig(
51-
load_in_8bit=True,
52-
)
56+
quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,)
5357

5458
text_encoder_2_8bit = T5EncoderModel.from_pretrained(
5559
"black-forest-labs/FLUX.1-dev",
@@ -58,9 +62,7 @@ text_encoder_2_8bit = T5EncoderModel.from_pretrained(
5862
torch_dtype=torch.float16,
5963
)
6064

61-
quant_config = DiffusersBitsAndBytesConfig(
62-
load_in_8bit=True,
63-
)
65+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,)
6466

6567
transformer_8bit = FluxTransformer2DModel.from_pretrained(
6668
"black-forest-labs/FLUX.1-dev",
@@ -72,12 +74,12 @@ transformer_8bit = FluxTransformer2DModel.from_pretrained(
7274

7375
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
7476

75-
```py
77+
```diff
7678
transformer_8bit = FluxTransformer2DModel.from_pretrained(
7779
"black-forest-labs/FLUX.1-dev",
7880
subfolder="transformer",
7981
quantization_config=quant_config,
80-
torch_dtype=torch.float32,
82+
+ torch_dtype=torch.float32,
8183
)
8284
```
8385

@@ -104,18 +106,17 @@ pipe_kwargs = {
104106
"max_sequence_length": 512,
105107
}
106108

107-
image = pipe(
108-
generator=torch.Generator("cpu").manual_seed(0),
109-
**pipe_kwargs,
110-
).images[0]
111-
112-
image.resize((224, 224))
109+
image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
113110
```
114111

115112
<div class="flex justify-center">
116113
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/8bit.png"/>
117114
</div>
118115

116+
> [!Note]
117+
> When memory permits, one can directly mode the pipeline (`pipe` here) to the GPU using the `.to("cuda")` API.
118+
> One can also use the `enable_model_cpu_offload()` to optimize GPU VRAM usage.
119+
119120
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
120121

121122
</hfoption>
@@ -126,16 +127,20 @@ Quantizing a model in 4-bit reduces your memory-usage by 4x:
126127
bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
127128
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
128129

130+
> [!Note]
131+
> Depending on the GPU, set your `torch_dtype`. For Ada and higher series GPUs support `torch.bfloat16` and we suggest using it when applicable.
132+
133+
> [!Note]
134+
> We do not qunatize the `CLIPTextModel` and the `AutoencoderKL` due to their small size, and also for the fact that `AutoencoderKL` has very few `torch.nn.Linear` layers.
135+
129136
```py
130137
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
131138
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
132139

133140
from diffusers import FluxTransformer2DModel
134141
from transformers import T5EncoderModel
135142

136-
quant_config = TransformersBitsAndBytesConfig(
137-
load_in_4bit=True,
138-
)
143+
quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,)
139144

140145
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
141146
"black-forest-labs/FLUX.1-dev",
@@ -144,9 +149,7 @@ text_encoder_2_4bit = T5EncoderModel.from_pretrained(
144149
torch_dtype=torch.float16,
145150
)
146151

147-
quant_config = DiffusersBitsAndBytesConfig(
148-
load_in_4bit=True,
149-
)
152+
quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,)
150153

151154
transformer_4bit = FluxTransformer2DModel.from_pretrained(
152155
"black-forest-labs/FLUX.1-dev",
@@ -158,12 +161,12 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
158161

159162
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
160163

161-
```py
164+
```diff
162165
transformer_4bit = FluxTransformer2DModel.from_pretrained(
163166
"black-forest-labs/FLUX.1-dev",
164167
subfolder="transformer",
165168
quantization_config=quant_config,
166-
torch_dtype=torch.float32,
169+
+ torch_dtype=torch.float32,
167170
)
168171
```
169172

@@ -189,18 +192,17 @@ pipe_kwargs = {
189192
"max_sequence_length": 512,
190193
}
191194

192-
image = pipe(
193-
generator=torch.Generator("cpu").manual_seed(0),
194-
**pipe_kwargs,
195-
).images[0]
196-
197-
image.resize((224, 224))
195+
image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
198196
```
199197

200198
<div class="flex justify-center">
201199
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/4bit.png"/>
202200
</div>
203201

202+
> [!Note]
203+
> When memory permits, one can directly mode the pipeline (`pipe` here) to the GPU using the `.to("cuda")` API.
204+
> One can also use the `enable_model_cpu_offload()` to optimize GPU VRAM usage.
205+
204206
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
205207

206208
</hfoption>

0 commit comments

Comments
 (0)