You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/quantization/bitsandbytes.md
+21-52Lines changed: 21 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,13 +13,9 @@ specific language governing permissions and limitations under the License.
13
13
14
14
# bitsandbytes
15
15
16
-
[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing
17
-
a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8,
18
-
converts the non-outlier values back to fp16, and then adds them together to return the weights in
19
-
fp16. This reduces the degradative effect outlier values have on a model's performance.
16
+
[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.
20
17
21
-
4-bit quantization compresses a model even further, and it is commonly used with
22
-
[QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
18
+
4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
23
19
24
20
This guide demonstrates how quantization can enable running
Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`].
38
-
This works for any model in any modality, as long as it supports loading with
39
-
[Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
33
+
Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
40
34
41
35
<hfoptionsid="bnb">
42
36
<hfoptionid="8-bit">
43
37
44
-
Quantizing a model in 8-bit halves the memory-usage.
38
+
Quantizing a model in 8-bit halves the memory-usage:
45
39
46
40
bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
47
41
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`.
80
-
You can change the data type of these modules with the `torch_dtype` parameter.
73
+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method.
127
-
The quantization `config.json` file is pushed first, followed by the quantized model weights.
128
-
You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
119
+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
129
120
130
121
</hfoption>
131
122
<hfoptionid="4-bit">
132
123
133
-
Quantizing a model in 4-bit reduces your memory-usage by 4x.
124
+
Quantizing a model in 4-bit reduces your memory-usage by 4x:
134
125
135
126
bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
136
127
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`.
169
-
You can change the data type of these modules with the `torch_dtype` parameter.
159
+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
Let's generate an image using our quantized models.
181
171
182
-
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the
183
-
CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
172
+
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method.
216
-
The quantization `config.json` file is pushed first, followed by the quantized model weights.
217
-
You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
204
+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
218
205
219
206
</hfoption>
220
207
</hfoptions>
@@ -231,8 +218,7 @@ Check your memory footprint with the `get_memory_footprint` method:
231
218
print(model.get_memory_footprint())
232
219
```
233
220
234
-
Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to
235
-
specify the `quantization_config` parameters:
221
+
Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters:
236
222
237
223
```py
238
224
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
@@ -252,19 +238,13 @@ Learn more about the details of 8-bit quantization in this [blog post](https://h
252
238
253
239
</Tip>
254
240
255
-
This section explores some of the specific features of 8-bit models, such as outlier thresholds and
256
-
skipping module conversion.
241
+
This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion.
257
242
258
243
### Outlier threshold
259
244
260
-
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed
261
-
in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be
262
-
very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5,
263
-
but beyond that, there is a significant performance penalty. A good default threshold value is 6,
264
-
but a lower threshold may be needed for more unstable models (small models or finetuning).
245
+
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
265
246
266
-
To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold`
267
-
parameter in [`BitsAndBytesConfig`]:
247
+
To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:
268
248
269
249
```py
270
250
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
For some models, you don't need to quantize every module to 8-bit which can actually cause instability.
286
-
For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3),
287
-
the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
265
+
For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
288
266
289
267
```py
290
268
from diffusers import SD3Transformer2DModel, BitsAndBytesConfig
@@ -309,14 +287,12 @@ Learn more about its details in this [blog post](https://huggingface.co/blog/4bi
309
287
310
288
</Tip>
311
289
312
-
This section explores some of the specific features of 4-bit models, such as changing the compute
313
-
data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
290
+
This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
314
291
315
292
316
293
### Compute data type
317
294
318
-
To speedup computation, you can change the data type from float32 (the default value) to bf16 using
319
-
the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
295
+
To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for
331
-
weights initialized from a normal distribution. You should use NF4 for training 4-bit base models.
332
-
This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
306
+
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
333
307
334
308
```py
335
309
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to
367
-
remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and
368
-
`torch_dtype` values.
340
+
For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.
369
341
370
342
### Nested quantization
371
343
372
-
Nested quantization is a technique that can save additional memory at no additional performance cost.
373
-
This feature performs a second quantization of the already quantized weights to save an additional
374
-
0.4 bits/parameter.
344
+
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.
375
345
376
346
```py
377
347
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
Once quantized, you can dequantize a model to its original precision, but this might result in a
411
-
small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model.
380
+
Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model.
412
381
413
382
```python
414
383
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
0 commit comments