update

stevhliu · stevhliu · commit 3c70be206e12 · 2025-09-25T12:10:17.000-07:00
diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md
@@ -32,7 +32,7 @@ By default, all the other modules such as `torch.nn.LayerNorm` are converted to
 
 This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
 
-> [!TIP]
+> [!NOTE]
 > For Ada and higher-series GPUs, change `torch_dtype` to `torch.bfloat16`.
 
 <hfoptions id="bnb">
diff --git a/docs/source/en/quantization/gguf.md b/docs/source/en/quantization/gguf.md
@@ -13,74 +13,80 @@ specific language governing permissions and limitations under the License.
 
 # GGUF
 
-The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.
+GGUF is a binary file format for storing and loading [GGML](https://github.com/ggerganov/ggml) models for inference. It's designed to support various blockwise quantization options, single-file deployment, and fast loading and saving.
 
-The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant.
+Diffusers only supports loading GGUF *model* files as opposed to an entire GGUF pipeline checkpoint.
 
-Before starting please install gguf in your environment
+<details>
+<summary>Supported quantization types</summary>
 
-```shell
-pip install -U gguf
-```
+- BF16
+- Q4_0
+- Q4_1
+- Q5_0
+- Q5_1
+- Q8_0
+- Q2_K
+- Q3_K
+- Q4_K
+- Q5_K
+- Q6_K
 
-Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`].
+</details>
 
-When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.
+Make sure gguf is installed.
+
+```bash
+pip install -U gguf
+```
 
-The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original [`numpy`](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py) implementation by [compilade](https://github.com/compilade).
+Load GGUF files with [`~loaders.FromSingleFileMixin.from_single_file`] and pass [`GGUFQuantizationConfig`] to configure the `compute_type`. Quantized weights remain in a low memory data type and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model.
 
 ```python
 import torch
+from diffusers import FluxPipeline, AutoModel, GGUFQuantizationConfig
 
-from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
-
-ckpt_path = (
-    "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
-)
-transformer = FluxTransformer2DModel.from_single_file(
-    ckpt_path,
+transformer = AutoModel.from_single_file(
+    "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf",
     quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
     torch_dtype=torch.bfloat16,
 )
-pipe = FluxPipeline.from_pretrained(
+pipeline = FluxPipeline.from_pretrained(
     "black-forest-labs/FLUX.1-dev",
     transformer=transformer,
     torch_dtype=torch.bfloat16,
+    device_map="cuda"
 )
-pipe.enable_model_cpu_offload()
-prompt = "A cat holding a sign that says hello world"
-image = pipe(prompt, generator=torch.manual_seed(0)).images[0]
+prompt = """
+cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
+highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
+"""
+image = pipeline(prompt).images[0]
 image.save("flux-gguf.png")
 ```
 
-## Using Optimized CUDA Kernels with GGUF
+## CUDA kernels
 
-Optimized CUDA kernels can accelerate GGUF quantized model inference by approximately 10%. This functionality requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the kernels library:
+Optimized CUDA kernels can accelerate GGUF model inference by ~10%. It requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the [kernels](https://huggingface.co/docs/kernels/index) library.
 
-```shell
+```bash
 pip install -U kernels
 ```
 
-Once installed, set `DIFFUSERS_GGUF_CUDA_KERNELS=true`  to use optimized kernels when available. Note that CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images. To disable CUDA kernel usage, set the environment variable `DIFFUSERS_GGUF_CUDA_KERNELS=false`.
+Set `DIFFUSERS_GGUF_CUDA_KERNELS=true` to enable optimized kernels. CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images.
 
-## Supported Quantization Types
+```python
+import os
 
-- BF16
-- Q4_0
-- Q4_1
-- Q5_0
-- Q5_1
-- Q8_0
-- Q2_K
-- Q3_K
-- Q4_K
-- Q5_K
-- Q6_K
+# Enable CUDA kernels for ~10% speedup
+os.environ["DIFFUSERS_GGUF_CUDA_KERNELS"] = "true"
+# Disable CUDA kernels
+# os.environ["DIFFUSERS_GGUF_CUDA_KERNELS"] = "false"
+```
 
 ## Convert to GGUF
 
-Use the Space below to convert a Diffusers checkpoint into the GGUF format for inference.
-run conversion:
+Use the Space below to convert a Diffusers checkpoint into a GGUF file.
 
 <iframe
 	src="https://diffusers-internal-dev-diffusers-to-gguf.hf.space"
@@ -89,32 +95,17 @@ run conversion:
 	height="450"
 ></iframe>
 
+GGUF files stored in the [Diffusers format](../using-diffusers/other-formats) require the model's `config` path. If the model config is inside a subfolder, provide the `subfolder` argument as well.
 
 ```py
 import torch
+from diffusers import FluxPipeline, AutoModel, GGUFQuantizationConfig
 
-from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
-
-ckpt_path = (
-    "https://huggingface.co/sayakpaul/different-lora-from-civitai/blob/main/flux_dev_diffusers-q4_0.gguf"
-)
-transformer = FluxTransformer2DModel.from_single_file(
-    ckpt_path,
+transformer = AutoModel.from_single_file(
+    "https://huggingface.co/sayakpaul/different-lora-from-civitai/blob/main/flux_dev_diffusers-q4_0.gguf",
     quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
     config="black-forest-labs/FLUX.1-dev",
     subfolder="transformer",
     torch_dtype=torch.bfloat16,
 )
-pipe = FluxPipeline.from_pretrained(
-    "black-forest-labs/FLUX.1-dev",
-    transformer=transformer,
-    torch_dtype=torch.bfloat16,
-)
-pipe.enable_model_cpu_offload()
-prompt = "A cat holding a sign that says hello world"
-image = pipe(prompt, generator=torch.manual_seed(0)).images[0]
-image.save("flux-gguf.png")
-```
-
-When using Diffusers format GGUF checkpoints, it's a must to provide the model `config` path. If the
-model config resides in a `subfolder`, that needs to be specified, too.
+```
diff --git a/docs/source/en/quantization/overview.md b/docs/source/en/quantization/overview.md
@@ -31,7 +31,7 @@ Initialize [`~quantizers.PipelineQuantizationConfig`] with these parameters.
 - `quant_backend` specifies which quantization backend to use. Supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`.
 - `quant_kwargs` specifies the quantization arguments to use.
 
-> [!TIP]
+> [!NOTE]
 > The `quant_kwargs` arguments differ for each backend. Refer to the [Quantization API](../api/quantization) docs to view the specific arguments for each backend.
 
 - `components_to_quantize` specifies which component(s) of the pipeline to quantize. Quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact.
diff --git a/docs/source/en/quantization/quanto.md b/docs/source/en/quantization/quanto.md
@@ -13,136 +13,121 @@ specific language governing permissions and limitations under the License.
 
 # Quanto
 
-[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/en/index). It has been designed with versatility and simplicity in mind:
+[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/index). It has been designed with versatility and simplicity in mind:
 
 - All features are available in eager mode (works with non-traceable models)
 - Supports quantization aware training
 - Quantized models are compatible with `torch.compile`
 - Quantized models are Device agnostic (e.g CUDA,XPU,MPS,CPU)
 
-In order to use the Quanto backend, you will first need to install `optimum-quanto>=0.2.6` and `accelerate`
+Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model.
 
-```shell
-pip install optimum-quanto accelerate
+Make sure Quanto and [Accelerate](https://huggingface.co/docs/optimum/index) are installed.
+
+```bash
+pip install -U optimum-quanto accelerate
 ```
 
-Now you can quantize a model by passing the `QuantoConfig` object to the `from_pretrained()` method. Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model. The following snippet demonstrates how to apply `float8` quantization with Quanto.   
+Create and pass `weights_dtype` to [`QuantoConfig`] configure the target data type to quantize a model to. The example below quantizes the model to `float8`. Check [`QuantoConfig`] for a list of supported weight types.
 
 ```python
 import torch
-from diffusers import FluxTransformer2DModel, QuantoConfig
+from diffusers import AutoModel, QuantoConfig, FluxPipeline
 
-model_id = "black-forest-labs/FLUX.1-dev"
 quantization_config = QuantoConfig(weights_dtype="float8")
 transformer = FluxTransformer2DModel.from_pretrained(
-      model_id,
+      "black-forest-labs/FLUX.1-dev",
       subfolder="transformer",
       quantization_config=quantization_config,
       torch_dtype=torch.bfloat16,
 )
 
-pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch_dtype)
-pipe.to("cuda")
-
-prompt = "A cat holding a sign that says hello world"
-image = pipe(
-    prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
-).images[0]
-image.save("output.png")
+pipeline = FluxPipeline.from_pretrained(
+    "black-forest-labs/FLUX.1-dev",
+    transformer=transformer,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda"
+)
+prompt = """
+cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
+highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
+"""
+image = pipeline(prompt).images[0]
+image.save("flux-quanto.png")
 ```
 
-## Skipping Quantization on specific modules
-
-It is possible to skip applying quantization on certain modules using the `modules_to_not_convert` argument in the `QuantoConfig`. Please ensure that the modules passed in to this argument match the keys of the modules in the `state_dict`  
+[`QuantoConfig`] also works with single files with [`~loaders.FromOriginalModelMixin.from_single_file`].
 
 ```python
 import torch
-from diffusers import FluxTransformer2DModel, QuantoConfig
+from diffusers import AutoModel, QuantoConfig
 
-model_id = "black-forest-labs/FLUX.1-dev"
-quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"])
-transformer = FluxTransformer2DModel.from_pretrained(
-      model_id,
-      subfolder="transformer",
-      quantization_config=quantization_config,
-      torch_dtype=torch.bfloat16,
+quantization_config = QuantoConfig(weights_dtype="float8")
+transformer = AutoModel.from_single_file(
+    "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors",
+    quantization_config=quantization_config,
+    torch_dtype=torch.bfloat16
 )
 ```
 
-## Using `from_single_file` with the Quanto Backend
+## torch.compile
 
-`QuantoConfig` is compatible with `~FromOriginalModelMixin.from_single_file`. 
+Quanto supports torch.compile for `int8` weights only.
 
 ```python
 import torch
-from diffusers import FluxTransformer2DModel, QuantoConfig
+from diffusers import FluxPipeline, AutoModel, QuantoConfig
 
-ckpt_path = "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors"
-quantization_config = QuantoConfig(weights_dtype="float8")
-transformer = FluxTransformer2DModel.from_single_file(ckpt_path, quantization_config=quantization_config, torch_dtype=torch.bfloat16)
+quantization_config = QuantoConfig(weights_dtype="int8")
+transformer = FluxTransformer2DModel.from_pretrained(
+    "black-forest-labs/FLUX.1-dev",
+    subfolder="transformer",
+    quantization_config=quantization_config,
+    torch_dtype=torch.bfloat16,
+)
+transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True)
+pipeline = FluxPipeline.from_pretrained(
+    "black-forest-labs/FLUX.1-dev",
+    transformer=transformer,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda"
+)
 ```
 
-## Saving Quantized models
-
-Diffusers supports serializing Quanto models using the `~ModelMixin.save_pretrained` method.
+## Skipping quantization on specific modules
 
-The serialization and loading requirements are different for models quantized directly with the Quanto library and models quantized
-with Diffusers using Quanto as the backend. It is currently not possible to load models quantized directly with Quanto into Diffusers using `~ModelMixin.from_pretrained`
+Use `modules_to_not_convert` to skip quantization on specific modules. The modules passed to this argument must match the module keys in `state_dict`.
 
 ```python
 import torch
-from diffusers import FluxTransformer2DModel, QuantoConfig
+from diffusers import AutoModel, QuantoConfig
 
-model_id = "black-forest-labs/FLUX.1-dev"
-quantization_config = QuantoConfig(weights_dtype="float8")
-transformer = FluxTransformer2DModel.from_pretrained(
-      model_id,
+quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"])
+transformer = AutoModel.from_pretrained(
+      "black-forest-labs/FLUX.1-dev",
       subfolder="transformer",
       quantization_config=quantization_config,
       torch_dtype=torch.bfloat16,
 )
-# save quantized model to reuse
-transformer.save_pretrained("<your quantized model save path>")
-
-# you can reload your quantized model with
-model = FluxTransformer2DModel.from_pretrained("<your quantized model save path>")
 ```
 
-## Using `torch.compile` with Quanto
-
-Currently the Quanto backend supports `torch.compile` for the following quantization types:
+## Saving quantized models
 
-- `int8` weights 
+Save a Quanto model with [`~ModelMixin.save_pretrained`]. Models quantized directly with the Quanto library - not as a backend in Diffusers - can't be loaded in Diffusers with [`~ModelMixin.from_pretrained`].
 
 ```python
 import torch
-from diffusers import FluxPipeline, FluxTransformer2DModel, QuantoConfig
-
-model_id = "black-forest-labs/FLUX.1-dev"
-quantization_config = QuantoConfig(weights_dtype="int8")
-transformer = FluxTransformer2DModel.from_pretrained(
-    model_id,
-    subfolder="transformer",
-    quantization_config=quantization_config,
-    torch_dtype=torch.bfloat16,
-)
-transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True)
+from diffusers import AutoModel, QuantoConfig
 
-pipe = FluxPipeline.from_pretrained(
-    model_id, transformer=transformer, torch_dtype=torch_dtype
+quantization_config = QuantoConfig(weights_dtype="float8")
+transformer = AutoModel.from_pretrained(
+      "black-forest-labs/FLUX.1-dev",
+      subfolder="transformer",
+      quantization_config=quantization_config,
+      torch_dtype=torch.bfloat16,
 )
-pipe.to("cuda")
-images = pipe("A cat holding a sign that says hello").images[0]
-images.save("flux-quanto-compile.png")
-```
-
-## Supported Quantization Types
-
-### Weights
-
-- float8
-- int8
-- int4
-- int2
-
+transformer.save_pretrained("path/to/saved/model")
 
+# Reload quantized model
+model = AutoModel.from_pretrained("path/to/saved/model")
+```