review suggestions

ariG23498 · ariG23498 · commit 1d96c521857f · 2024-12-05T01:00:35.000+05:30
diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md
@@ -13,13 +13,9 @@ specific language governing permissions and limitations under the License.
 
 # bitsandbytes
 
-[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing
-a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8,
-converts the non-outlier values back to fp16, and then adds them together to return the weights in
-fp16. This reduces the degradative effect outlier values have on a model's performance.
+[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.
 
-4-bit quantization compresses a model even further, and it is commonly used with
-[QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
+4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
 
 This guide demonstrates how quantization can enable running
 [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)
@@ -34,14 +30,12 @@ To use bitsandbytes, make sure you have the following libraries installed:
 pip install diffusers transformers accelerate bitsandbytes -U
 ```
 
-Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`].
-This works for any model in any modality, as long as it supports loading with
-[Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
+Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
 
 <hfoptions id="bnb">
 <hfoption id="8-bit">
 
-Quantizing a model in 8-bit halves the memory-usage.
+Quantizing a model in 8-bit halves the memory-usage:
 
 bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
 [`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
@@ -76,8 +70,7 @@ transformer_8bit = FluxTransformer2DModel.from_pretrained(
 )
 ```
 
-By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`.
-You can change the data type of these modules with the `torch_dtype` parameter.
+By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
 
 ```py
 transformer_8bit = FluxTransformer2DModel.from_pretrained(
@@ -123,14 +116,12 @@ image.resize((224, 224))
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/8bit.png"/>
 </div>
 
-Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method.
-The quantization `config.json` file is pushed first, followed by the quantized model weights.
-You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
+Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
 
 </hfoption>
 <hfoption id="4-bit">
 
-Quantizing a model in 4-bit reduces your memory-usage by 4x.
+Quantizing a model in 4-bit reduces your memory-usage by 4x:
 
 bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
 [`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
@@ -165,8 +156,7 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
 )
 ```
 
-By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`.
-You can change the data type of these modules with the `torch_dtype` parameter.
+By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
 
 ```py
 transformer_4bit = FluxTransformer2DModel.from_pretrained(
@@ -179,8 +169,7 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
 
 Let's generate an image using our quantized models.
 
-Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the
-CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
+Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
 
 ```py
 pipe = FluxPipeline.from_pretrained(
@@ -212,9 +201,7 @@ image.resize((224, 224))
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/4bit.png"/>
 </div>
 
-Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method.
-The quantization `config.json` file is pushed first, followed by the quantized model weights.
-You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
+Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
 
 </hfoption>
 </hfoptions>
@@ -231,8 +218,7 @@ Check your memory footprint with the `get_memory_footprint` method:
 print(model.get_memory_footprint())
 ```
 
-Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to
-specify the `quantization_config` parameters:
+Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters:
 
 ```py
 from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
@@ -252,19 +238,13 @@ Learn more about the details of 8-bit quantization in this [blog post](https://h
 
 </Tip>
 
-This section explores some of the specific features of 8-bit models, such as outlier thresholds and
-skipping module conversion.
+This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion.
 
 ### Outlier threshold
 
-An "outlier" is a hidden state value greater than a certain threshold, and these values are computed
-in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be
-very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5,
-but beyond that, there is a significant performance penalty. A good default threshold value is 6,
-but a lower threshold may be needed for more unstable models (small models or finetuning).
+An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
 
-To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold`
-parameter in [`BitsAndBytesConfig`]:
+To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:
 
 ```py
 from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
@@ -282,9 +262,7 @@ model_8bit = FluxTransformer2DModel.from_pretrained(
 
 ### Skip module conversion
 
-For some models, you don't need to quantize every module to 8-bit which can actually cause instability.
-For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3),
-the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
+For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
 
 ```py
 from diffusers import SD3Transformer2DModel, BitsAndBytesConfig
@@ -309,14 +287,12 @@ Learn more about its details in this [blog post](https://huggingface.co/blog/4bi
 
 </Tip>
 
-This section explores some of the specific features of 4-bit models, such as changing the compute
-data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
+This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
 
 
 ### Compute data type
 
-To speedup computation, you can change the data type from float32 (the default value) to bf16 using
-the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
+To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
 
 ```py
 import torch
@@ -327,9 +303,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dty
 
 ### Normal Float 4 (NF4)
 
-NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for
-weights initialized from a normal distribution. You should use NF4 for training 4-bit base models.
-This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
+NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
 
 ```py
 from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
@@ -363,15 +337,11 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
 )
 ```
 
-For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to
-remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and
-`torch_dtype` values.
+For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.
 
 ### Nested quantization
 
-Nested quantization is a technique that can save additional memory at no additional performance cost.
-This feature performs a second quantization of the already quantized weights to save an additional
-0.4 bits/parameter. 
+Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. 
 
 ```py
 from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
@@ -407,8 +377,7 @@ transformer_4bit = FluxTransformer2DModel.from_pretrained(
 
 ## Dequantizing `bitsandbytes` models
 
-Once quantized, you can dequantize a model to its original precision, but this might result in a
-small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model. 
+Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model. 
 
 ```python
 from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig