Skip to content

Commit 1d96c52

Browse files
committed
review suggestions
1 parent 0bf4c49 commit 1d96c52

File tree

1 file changed

+21
-52
lines changed

1 file changed

+21
-52
lines changed

docs/source/en/quantization/bitsandbytes.md

Lines changed: 21 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,9 @@ specific language governing permissions and limitations under the License.
1313

1414
# bitsandbytes
1515

16-
[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing
17-
a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8,
18-
converts the non-outlier values back to fp16, and then adds them together to return the weights in
19-
fp16. This reduces the degradative effect outlier values have on a model's performance.
16+
[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.
2017

21-
4-bit quantization compresses a model even further, and it is commonly used with
22-
[QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
18+
4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
2319

2420
This guide demonstrates how quantization can enable running
2521
[FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)
@@ -34,14 +30,12 @@ To use bitsandbytes, make sure you have the following libraries installed:
3430
pip install diffusers transformers accelerate bitsandbytes -U
3531
```
3632

37-
Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`].
38-
This works for any model in any modality, as long as it supports loading with
39-
[Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
33+
Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
4034

4135
<hfoptions id="bnb">
4236
<hfoption id="8-bit">
4337

44-
Quantizing a model in 8-bit halves the memory-usage.
38+
Quantizing a model in 8-bit halves the memory-usage:
4539

4640
bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
4741
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
@@ -76,8 +70,7 @@ transformer_8bit = FluxTransformer2DModel.from_pretrained(
7670
)
7771
```
7872

79-
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`.
80-
You can change the data type of these modules with the `torch_dtype` parameter.
73+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
8174

8275
```py
8376
transformer_8bit = FluxTransformer2DModel.from_pretrained(
@@ -123,14 +116,12 @@ image.resize((224, 224))
123116
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/8bit.png"/>
124117
</div>
125118

126-
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method.
127-
The quantization `config.json` file is pushed first, followed by the quantized model weights.
128-
You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
119+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
129120

130121
</hfoption>
131122
<hfoption id="4-bit">
132123

133-
Quantizing a model in 4-bit reduces your memory-usage by 4x.
124+
Quantizing a model in 4-bit reduces your memory-usage by 4x:
134125

135126
bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
136127
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
@@ -165,8 +156,7 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
165156
)
166157
```
167158

168-
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`.
169-
You can change the data type of these modules with the `torch_dtype` parameter.
159+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
170160

171161
```py
172162
transformer_4bit = FluxTransformer2DModel.from_pretrained(
@@ -179,8 +169,7 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
179169

180170
Let's generate an image using our quantized models.
181171

182-
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the
183-
CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
172+
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
184173

185174
```py
186175
pipe = FluxPipeline.from_pretrained(
@@ -212,9 +201,7 @@ image.resize((224, 224))
212201
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/4bit.png"/>
213202
</div>
214203

215-
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method.
216-
The quantization `config.json` file is pushed first, followed by the quantized model weights.
217-
You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
204+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
218205

219206
</hfoption>
220207
</hfoptions>
@@ -231,8 +218,7 @@ Check your memory footprint with the `get_memory_footprint` method:
231218
print(model.get_memory_footprint())
232219
```
233220

234-
Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to
235-
specify the `quantization_config` parameters:
221+
Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters:
236222

237223
```py
238224
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
@@ -252,19 +238,13 @@ Learn more about the details of 8-bit quantization in this [blog post](https://h
252238

253239
</Tip>
254240

255-
This section explores some of the specific features of 8-bit models, such as outlier thresholds and
256-
skipping module conversion.
241+
This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion.
257242

258243
### Outlier threshold
259244

260-
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed
261-
in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be
262-
very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5,
263-
but beyond that, there is a significant performance penalty. A good default threshold value is 6,
264-
but a lower threshold may be needed for more unstable models (small models or finetuning).
245+
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
265246

266-
To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold`
267-
parameter in [`BitsAndBytesConfig`]:
247+
To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:
268248

269249
```py
270250
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
@@ -282,9 +262,7 @@ model_8bit = FluxTransformer2DModel.from_pretrained(
282262

283263
### Skip module conversion
284264

285-
For some models, you don't need to quantize every module to 8-bit which can actually cause instability.
286-
For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3),
287-
the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
265+
For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
288266

289267
```py
290268
from diffusers import SD3Transformer2DModel, BitsAndBytesConfig
@@ -309,14 +287,12 @@ Learn more about its details in this [blog post](https://huggingface.co/blog/4bi
309287

310288
</Tip>
311289

312-
This section explores some of the specific features of 4-bit models, such as changing the compute
313-
data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
290+
This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
314291

315292

316293
### Compute data type
317294

318-
To speedup computation, you can change the data type from float32 (the default value) to bf16 using
319-
the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
295+
To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
320296

321297
```py
322298
import torch
@@ -327,9 +303,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dty
327303

328304
### Normal Float 4 (NF4)
329305

330-
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for
331-
weights initialized from a normal distribution. You should use NF4 for training 4-bit base models.
332-
This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
306+
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
333307

334308
```py
335309
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
@@ -363,15 +337,11 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
363337
)
364338
```
365339

366-
For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to
367-
remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and
368-
`torch_dtype` values.
340+
For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.
369341

370342
### Nested quantization
371343

372-
Nested quantization is a technique that can save additional memory at no additional performance cost.
373-
This feature performs a second quantization of the already quantized weights to save an additional
374-
0.4 bits/parameter.
344+
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.
375345

376346
```py
377347
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
@@ -407,8 +377,7 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
407377

408378
## Dequantizing `bitsandbytes` models
409379

410-
Once quantized, you can dequantize a model to its original precision, but this might result in a
411-
small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model.
380+
Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model.
412381

413382
```python
414383
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig

0 commit comments

Comments
 (0)