|  | 
|  | 1 | +<!--Copyright 2024 The HuggingFace Team. All rights reserved. | 
|  | 2 | +
 | 
|  | 3 | +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | 
|  | 4 | +the License. You may obtain a copy of the License at | 
|  | 5 | +
 | 
|  | 6 | +http://www.apache.org/licenses/LICENSE-2.0 | 
|  | 7 | +
 | 
|  | 8 | +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | 
|  | 9 | +an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | 
|  | 10 | +specific language governing permissions and limitations under the License. | 
|  | 11 | +
 | 
|  | 12 | +--> | 
|  | 13 | + | 
|  | 14 | +# bitsandbytes | 
|  | 15 | + | 
|  | 16 | +[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. | 
|  | 17 | + | 
|  | 18 | +4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs. | 
|  | 19 | + | 
|  | 20 | + | 
|  | 21 | +To use bitsandbytes, make sure you have the following libraries installed: | 
|  | 22 | + | 
|  | 23 | +```bash | 
|  | 24 | +pip install diffusers transformers accelerate bitsandbytes -U | 
|  | 25 | +``` | 
|  | 26 | + | 
|  | 27 | +Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. | 
|  | 28 | + | 
|  | 29 | +<hfoptions id="bnb"> | 
|  | 30 | +<hfoption id="8-bit"> | 
|  | 31 | + | 
|  | 32 | +Quantizing a model in 8-bit halves the memory-usage: | 
|  | 33 | + | 
|  | 34 | +```py | 
|  | 35 | +from diffusers import FluxTransformer2DModel, BitsAndBytesConfig | 
|  | 36 | + | 
|  | 37 | +quantization_config = BitsAndBytesConfig(load_in_8bit=True) | 
|  | 38 | + | 
|  | 39 | +model_8bit = FluxTransformer2DModel.from_pretrained( | 
|  | 40 | +    "black-forest-labs/FLUX.1-dev",  | 
|  | 41 | +    subfolder="transformer", | 
|  | 42 | +    quantization_config=quantization_config | 
|  | 43 | +) | 
|  | 44 | +``` | 
|  | 45 | + | 
|  | 46 | +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want: | 
|  | 47 | + | 
|  | 48 | +```py | 
|  | 49 | +from diffusers import FluxTransformer2DModel, BitsAndBytesConfig | 
|  | 50 | + | 
|  | 51 | +quantization_config = BitsAndBytesConfig(load_in_8bit=True) | 
|  | 52 | + | 
|  | 53 | +model_8bit = FluxTransformer2DModel.from_pretrained( | 
|  | 54 | +    "black-forest-labs/FLUX.1-dev",  | 
|  | 55 | +    subfolder="transformer", | 
|  | 56 | +    quantization_config=quantization_config, | 
|  | 57 | +    torch_dtype=torch.float32 | 
|  | 58 | +) | 
|  | 59 | +model_8bit.transformer_blocks.layers[-1].norm2.weight.dtype | 
|  | 60 | +``` | 
|  | 61 | + | 
|  | 62 | +Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. | 
|  | 63 | + | 
|  | 64 | +```py | 
|  | 65 | +from diffusers import FluxTransformer2DModel, BitsAndBytesConfig | 
|  | 66 | + | 
|  | 67 | +quantization_config = BitsAndBytesConfig(load_in_8bit=True) | 
|  | 68 | + | 
|  | 69 | +model_8bit = FluxTransformer2DModel.from_pretrained( | 
|  | 70 | +    "black-forest-labs/FLUX.1-dev",  | 
|  | 71 | +    subfolder="transformer", | 
|  | 72 | +    quantization_config=quantization_config | 
|  | 73 | +) | 
|  | 74 | +``` | 
|  | 75 | + | 
|  | 76 | +</hfoption> | 
|  | 77 | +<hfoption id="4-bit"> | 
|  | 78 | + | 
|  | 79 | +Quantizing a model in 4-bit reduces your memory-usage by 4x: | 
|  | 80 | + | 
|  | 81 | +```py | 
|  | 82 | +from diffusers import FluxTransformer2DModel, BitsAndBytesConfig | 
|  | 83 | + | 
|  | 84 | +quantization_config = BitsAndBytesConfig(load_in_4bit=True) | 
|  | 85 | + | 
|  | 86 | +model_4bit = FluxTransformer2DModel.from_pretrained( | 
|  | 87 | +    "black-forest-labs/FLUX.1-dev",  | 
|  | 88 | +    subfolder="transformer", | 
|  | 89 | +    quantization_config=quantization_config | 
|  | 90 | +) | 
|  | 91 | +``` | 
|  | 92 | + | 
|  | 93 | +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want: | 
|  | 94 | + | 
|  | 95 | +```py | 
|  | 96 | +from diffusers import FluxTransformer2DModel, BitsAndBytesConfig | 
|  | 97 | + | 
|  | 98 | +quantization_config = BitsAndBytesConfig(load_in_4bit=True) | 
|  | 99 | + | 
|  | 100 | +model_4bit = FluxTransformer2DModel.from_pretrained( | 
|  | 101 | +    "black-forest-labs/FLUX.1-dev",  | 
|  | 102 | +    subfolder="transformer", | 
|  | 103 | +    quantization_config=quantization_config, | 
|  | 104 | +    torch_dtype=torch.float32 | 
|  | 105 | +) | 
|  | 106 | +model_4bit.transformer_blocks.layers[-1].norm2.weight.dtype | 
|  | 107 | +``` | 
|  | 108 | + | 
|  | 109 | +Call [`~ModelMixin.push_to_hub`] after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].   | 
|  | 110 | + | 
|  | 111 | +</hfoption> | 
|  | 112 | +</hfoptions> | 
|  | 113 | + | 
|  | 114 | +<Tip warning={true}> | 
|  | 115 | + | 
|  | 116 | +Training with 8-bit and 4-bit weights are only supported for training *extra* parameters. | 
|  | 117 | + | 
|  | 118 | +</Tip> | 
|  | 119 | + | 
|  | 120 | +Check your memory footprint with the `get_memory_footprint` method: | 
|  | 121 | + | 
|  | 122 | +```py | 
|  | 123 | +print(model.get_memory_footprint()) | 
|  | 124 | +``` | 
|  | 125 | + | 
|  | 126 | +Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters: | 
|  | 127 | + | 
|  | 128 | +```py | 
|  | 129 | +from diffusers import FluxTransformer2DModel, BitsAndBytesConfig | 
|  | 130 | + | 
|  | 131 | +quantization_config = BitsAndBytesConfig(load_in_4bit=True) | 
|  | 132 | + | 
|  | 133 | +model_4bit = FluxTransformer2DModel.from_pretrained( | 
|  | 134 | +    "sayakpaul/flux.1-dev-nf4-pkg", subfolder="transformer" | 
|  | 135 | +) | 
|  | 136 | +``` | 
|  | 137 | + | 
|  | 138 | +## 8-bit (LLM.int8() algorithm) | 
|  | 139 | + | 
|  | 140 | +<Tip> | 
|  | 141 | + | 
|  | 142 | +Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)! | 
|  | 143 | + | 
|  | 144 | +</Tip> | 
|  | 145 | + | 
|  | 146 | +This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion. | 
|  | 147 | + | 
|  | 148 | +### Outlier threshold | 
|  | 149 | + | 
|  | 150 | +An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). | 
|  | 151 | + | 
|  | 152 | +To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]: | 
|  | 153 | + | 
|  | 154 | +```py | 
|  | 155 | +from diffusers import FluxTransformer2DModel, BitsAndBytesConfig | 
|  | 156 | + | 
|  | 157 | +quantization_config = BitsAndBytesConfig( | 
|  | 158 | +    load_in_8bit=True, llm_int8_threshold=10, | 
|  | 159 | +) | 
|  | 160 | + | 
|  | 161 | +model_8bit = FluxTransformer2DModel.from_pretrained( | 
|  | 162 | +    "black-forest-labs/FLUX.1-dev", | 
|  | 163 | +    subfolder="transformer", | 
|  | 164 | +    quantization_config=quantization_config, | 
|  | 165 | +) | 
|  | 166 | +``` | 
|  | 167 | + | 
|  | 168 | +### Skip module conversion | 
|  | 169 | + | 
|  | 170 | +For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]: | 
|  | 171 | + | 
|  | 172 | +```py | 
|  | 173 | +from diffusers import SD3Transformer2DModel, BitsAndBytesConfig | 
|  | 174 | + | 
|  | 175 | +quantization_config = BitsAndBytesConfig( | 
|  | 176 | +    load_in_8bit=True, llm_int8_skip_modules=["proj_out"], | 
|  | 177 | +) | 
|  | 178 | + | 
|  | 179 | +model_8bit = SD3Transformer2DModel.from_pretrained( | 
|  | 180 | +    "stabilityai/stable-diffusion-3-medium-diffusers", | 
|  | 181 | +    subfolder="transformer", | 
|  | 182 | +    quantization_config=quantization_config, | 
|  | 183 | +) | 
|  | 184 | +``` | 
|  | 185 | + | 
|  | 186 | + | 
|  | 187 | +## 4-bit (QLoRA algorithm) | 
|  | 188 | + | 
|  | 189 | +<Tip> | 
|  | 190 | + | 
|  | 191 | +Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). | 
|  | 192 | + | 
|  | 193 | +</Tip> | 
|  | 194 | + | 
|  | 195 | +This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization. | 
|  | 196 | + | 
|  | 197 | + | 
|  | 198 | +### Compute data type | 
|  | 199 | + | 
|  | 200 | +To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]: | 
|  | 201 | + | 
|  | 202 | +```py | 
|  | 203 | +import torch | 
|  | 204 | +from diffusers import BitsAndBytesConfig | 
|  | 205 | + | 
|  | 206 | +quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) | 
|  | 207 | +``` | 
|  | 208 | + | 
|  | 209 | +### Normal Float 4 (NF4) | 
|  | 210 | + | 
|  | 211 | +NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]: | 
|  | 212 | + | 
|  | 213 | +```py | 
|  | 214 | +from diffusers import BitsAndBytesConfig | 
|  | 215 | + | 
|  | 216 | +nf4_config = BitsAndBytesConfig( | 
|  | 217 | +    load_in_4bit=True, | 
|  | 218 | +    bnb_4bit_quant_type="nf4", | 
|  | 219 | +) | 
|  | 220 | + | 
|  | 221 | +model_nf4 = SD3Transformer2DModel.from_pretrained( | 
|  | 222 | +    "stabilityai/stable-diffusion-3-medium-diffusers", | 
|  | 223 | +    subfolder="transformer", | 
|  | 224 | +    quantization_config=nf4_config, | 
|  | 225 | +) | 
|  | 226 | +``` | 
|  | 227 | + | 
|  | 228 | +For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values. | 
|  | 229 | + | 
|  | 230 | +### Nested quantization | 
|  | 231 | + | 
|  | 232 | +Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.  | 
|  | 233 | + | 
|  | 234 | +```py | 
|  | 235 | +from diffusers import BitsAndBytesConfig | 
|  | 236 | + | 
|  | 237 | +double_quant_config = BitsAndBytesConfig( | 
|  | 238 | +    load_in_4bit=True, | 
|  | 239 | +    bnb_4bit_use_double_quant=True, | 
|  | 240 | +) | 
|  | 241 | + | 
|  | 242 | +double_quant_model = SD3Transformer2DModel.from_pretrained( | 
|  | 243 | +    "stabilityai/stable-diffusion-3-medium-diffusers", | 
|  | 244 | +    subfolder="transformer", | 
|  | 245 | +    quantization_config=double_quant_config, | 
|  | 246 | +) | 
|  | 247 | +``` | 
|  | 248 | + | 
|  | 249 | +## Dequantizing `bitsandbytes` models | 
|  | 250 | + | 
|  | 251 | +Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model.  | 
|  | 252 | + | 
|  | 253 | +```python | 
|  | 254 | +from diffusers import BitsAndBytesConfig | 
|  | 255 | + | 
|  | 256 | +double_quant_config = BitsAndBytesConfig( | 
|  | 257 | +    load_in_4bit=True, | 
|  | 258 | +    bnb_4bit_use_double_quant=True, | 
|  | 259 | +) | 
|  | 260 | + | 
|  | 261 | +double_quant_model = SD3Transformer2DModel.from_pretrained( | 
|  | 262 | +    "stabilityai/stable-diffusion-3-medium-diffusers", | 
|  | 263 | +    subfolder="transformer", | 
|  | 264 | +    quantization_config=double_quant_config, | 
|  | 265 | +) | 
|  | 266 | +model.dequantize() | 
|  | 267 | +``` | 
0 commit comments