You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/quantization/bitsandbytes.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ By default, all the other modules such as `torch.nn.LayerNorm` are converted to
32
32
33
33
This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
34
34
35
-
> [!TIP]
35
+
> [!NOTE]
36
36
> For Ada and higher-series GPUs, change `torch_dtype` to `torch.bfloat16`.
Copy file name to clipboardExpand all lines: docs/source/en/quantization/gguf.md
+49-58Lines changed: 49 additions & 58 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,74 +13,80 @@ specific language governing permissions and limitations under the License.
13
13
14
14
# GGUF
15
15
16
-
The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml)and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file`loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.
16
+
GGUF is a binary file format for storing and loading [GGML](https://github.com/ggerganov/ggml)models for inference. It's designed to support various blockwise quantization options, single-file deployment, and fast loading and saving.
17
17
18
-
The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant.
18
+
Diffusers only supports loading GGUF *model* files as opposed to an entire GGUF pipeline checkpoint.
19
19
20
-
Before starting please install gguf in your environment
20
+
<details>
21
+
<summary>Supported quantization types</summary>
21
22
22
-
```shell
23
-
pip install -U gguf
24
-
```
23
+
- BF16
24
+
- Q4_0
25
+
- Q4_1
26
+
- Q5_0
27
+
- Q5_1
28
+
- Q8_0
29
+
- Q2_K
30
+
- Q3_K
31
+
- Q4_K
32
+
- Q5_K
33
+
- Q6_K
25
34
26
-
Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`].
35
+
</details>
27
36
28
-
When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.
37
+
Make sure gguf is installed.
38
+
39
+
```bash
40
+
pip install -U gguf
41
+
```
29
42
30
-
The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original [`numpy`](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py) implementation by [compilade](https://github.com/compilade).
43
+
Load GGUF files with [`~loaders.FromSingleFileMixin.from_single_file`] and pass [`GGUFQuantizationConfig`] to configure the `compute_type`. Quantized weights remain in a low memory data type and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model.
31
44
32
45
```python
33
46
import torch
47
+
from diffusers import FluxPipeline, AutoModel, GGUFQuantizationConfig
34
48
35
-
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
62
+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
63
+
"""
64
+
image = pipeline(prompt).images[0]
53
65
image.save("flux-gguf.png")
54
66
```
55
67
56
-
## Using Optimized CUDA Kernels with GGUF
68
+
## CUDA kernels
57
69
58
-
Optimized CUDA kernels can accelerate GGUF quantized model inference by approximately 10%. This functionality requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the kernels library:
70
+
Optimized CUDA kernels can accelerate GGUF model inference by ~10%. It requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the [kernels](https://huggingface.co/docs/kernels/index) library.
59
71
60
-
```shell
72
+
```bash
61
73
pip install -U kernels
62
74
```
63
75
64
-
Once installed, set `DIFFUSERS_GGUF_CUDA_KERNELS=true`to use optimized kernels when available. Note that CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images. To disable CUDA kernel usage, set the environment variable `DIFFUSERS_GGUF_CUDA_KERNELS=false`.
76
+
Set `DIFFUSERS_GGUF_CUDA_KERNELS=true` to enable optimized kernels. CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images.
GGUF files stored in the [Diffusers format](../using-diffusers/other-formats) require the model's `config` path. If the model config is inside a subfolder, provide the `subfolder` argument as well.
92
99
93
100
```py
94
101
import torch
102
+
from diffusers import FluxPipeline, AutoModel, GGUFQuantizationConfig
95
103
96
-
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
Copy file name to clipboardExpand all lines: docs/source/en/quantization/overview.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,7 @@ Initialize [`~quantizers.PipelineQuantizationConfig`] with these parameters.
31
31
-`quant_backend` specifies which quantization backend to use. Supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`.
32
32
-`quant_kwargs` specifies the quantization arguments to use.
33
33
34
-
> [!TIP]
34
+
> [!NOTE]
35
35
> The `quant_kwargs` arguments differ for each backend. Refer to the [Quantization API](../api/quantization) docs to view the specific arguments for each backend.
36
36
37
37
-`components_to_quantize` specifies which component(s) of the pipeline to quantize. Quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact.
Copy file name to clipboardExpand all lines: docs/source/en/quantization/quanto.md
+64-79Lines changed: 64 additions & 79 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,136 +13,121 @@ specific language governing permissions and limitations under the License.
13
13
14
14
# Quanto
15
15
16
-
[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/en/index). It has been designed with versatility and simplicity in mind:
16
+
[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/index). It has been designed with versatility and simplicity in mind:
17
17
18
18
- All features are available in eager mode (works with non-traceable models)
19
19
- Supports quantization aware training
20
20
- Quantized models are compatible with `torch.compile`
21
21
- Quantized models are Device agnostic (e.g CUDA,XPU,MPS,CPU)
22
22
23
-
In order to use the Quanto backend, you will first need to install `optimum-quanto>=0.2.6` and `accelerate`
23
+
Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model.
24
24
25
-
```shell
26
-
pip install optimum-quanto accelerate
25
+
Make sure Quanto and [Accelerate](https://huggingface.co/docs/optimum/index) are installed.
26
+
27
+
```bash
28
+
pip install -U optimum-quanto accelerate
27
29
```
28
30
29
-
Now you can quantize a model by passing the `QuantoConfig` object to the `from_pretrained()` method. Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model. The following snippet demonstrates how to apply `float8` quantization with Quanto.
31
+
Create and pass `weights_dtype`to [`QuantoConfig`] configure the target data type to quantize a model to. The example below quantizes the model to `float8`. Check [`QuantoConfig`] for a list of supported weight types.
30
32
31
33
```python
32
34
import torch
33
-
from diffusers importFluxTransformer2DModel, QuantoConfig
35
+
from diffusers importAutoModel, QuantoConfig, FluxPipeline
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
53
+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
54
+
"""
55
+
image = pipeline(prompt).images[0]
56
+
image.save("flux-quanto.png")
52
57
```
53
58
54
-
## Skipping Quantization on specific modules
55
-
56
-
It is possible to skip applying quantization on certain modules using the `modules_to_not_convert` argument in the `QuantoConfig`. Please ensure that the modules passed in to this argument match the keys of the modules in the `state_dict`
59
+
[`QuantoConfig`] also works with single files with [`~loaders.FromOriginalModelMixin.from_single_file`].
57
60
58
61
```python
59
62
import torch
60
-
from diffusers importFluxTransformer2DModel, QuantoConfig
Diffusers supports serializing Quanto models using the `~ModelMixin.save_pretrained` method.
97
+
## Skipping quantization on specific modules
88
98
89
-
The serialization and loading requirements are different for models quantized directly with the Quanto library and models quantized
90
-
with Diffusers using Quanto as the backend. It is currently not possible to load models quantized directly with Quanto into Diffusers using `~ModelMixin.from_pretrained`
99
+
Use `modules_to_not_convert` to skip quantization on specific modules. The modules passed to this argument must match the module keys in `state_dict`.
91
100
92
101
```python
93
102
import torch
94
-
from diffusers importFluxTransformer2DModel, QuantoConfig
transformer.save_pretrained("<your quantized model save path>")
106
-
107
-
# you can reload your quantized model with
108
-
model = FluxTransformer2DModel.from_pretrained("<your quantized model save path>")
109
112
```
110
113
111
-
## Using `torch.compile` with Quanto
112
-
113
-
Currently the Quanto backend supports `torch.compile` for the following quantization types:
114
+
## Saving quantized models
114
115
115
-
-`int8` weights
116
+
Save a Quanto model with [`~ModelMixin.save_pretrained`]. Models quantized directly with the Quanto library - not as a backend in Diffusers - can't be loaded in Diffusers with [`~ModelMixin.from_pretrained`].
116
117
117
118
```python
118
119
import torch
119
-
from diffusers import FluxPipeline, FluxTransformer2DModel, QuantoConfig
0 commit comments