Skip to content

Commit 3c70be2

Browse files
committed
update
1 parent c82df2c commit 3c70be2

File tree

4 files changed

+115
-139
lines changed

4 files changed

+115
-139
lines changed

docs/source/en/quantization/bitsandbytes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ By default, all the other modules such as `torch.nn.LayerNorm` are converted to
3232

3333
This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
3434

35-
> [!TIP]
35+
> [!NOTE]
3636
> For Ada and higher-series GPUs, change `torch_dtype` to `torch.bfloat16`.
3737
3838
<hfoptions id="bnb">

docs/source/en/quantization/gguf.md

Lines changed: 49 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -13,74 +13,80 @@ specific language governing permissions and limitations under the License.
1313

1414
# GGUF
1515

16-
The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.
16+
GGUF is a binary file format for storing and loading [GGML](https://github.com/ggerganov/ggml) models for inference. It's designed to support various blockwise quantization options, single-file deployment, and fast loading and saving.
1717

18-
The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant.
18+
Diffusers only supports loading GGUF *model* files as opposed to an entire GGUF pipeline checkpoint.
1919

20-
Before starting please install gguf in your environment
20+
<details>
21+
<summary>Supported quantization types</summary>
2122

22-
```shell
23-
pip install -U gguf
24-
```
23+
- BF16
24+
- Q4_0
25+
- Q4_1
26+
- Q5_0
27+
- Q5_1
28+
- Q8_0
29+
- Q2_K
30+
- Q3_K
31+
- Q4_K
32+
- Q5_K
33+
- Q6_K
2534

26-
Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`].
35+
</details>
2736

28-
When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.
37+
Make sure gguf is installed.
38+
39+
```bash
40+
pip install -U gguf
41+
```
2942

30-
The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original [`numpy`](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py) implementation by [compilade](https://github.com/compilade).
43+
Load GGUF files with [`~loaders.FromSingleFileMixin.from_single_file`] and pass [`GGUFQuantizationConfig`] to configure the `compute_type`. Quantized weights remain in a low memory data type and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model.
3144

3245
```python
3346
import torch
47+
from diffusers import FluxPipeline, AutoModel, GGUFQuantizationConfig
3448

35-
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
36-
37-
ckpt_path = (
38-
"https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
39-
)
40-
transformer = FluxTransformer2DModel.from_single_file(
41-
ckpt_path,
49+
transformer = AutoModel.from_single_file(
50+
"https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf",
4251
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
4352
torch_dtype=torch.bfloat16,
4453
)
45-
pipe = FluxPipeline.from_pretrained(
54+
pipeline = FluxPipeline.from_pretrained(
4655
"black-forest-labs/FLUX.1-dev",
4756
transformer=transformer,
4857
torch_dtype=torch.bfloat16,
58+
device_map="cuda"
4959
)
50-
pipe.enable_model_cpu_offload()
51-
prompt = "A cat holding a sign that says hello world"
52-
image = pipe(prompt, generator=torch.manual_seed(0)).images[0]
60+
prompt = """
61+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
62+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
63+
"""
64+
image = pipeline(prompt).images[0]
5365
image.save("flux-gguf.png")
5466
```
5567

56-
## Using Optimized CUDA Kernels with GGUF
68+
## CUDA kernels
5769

58-
Optimized CUDA kernels can accelerate GGUF quantized model inference by approximately 10%. This functionality requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the kernels library:
70+
Optimized CUDA kernels can accelerate GGUF model inference by ~10%. It requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the [kernels](https://huggingface.co/docs/kernels/index) library.
5971

60-
```shell
72+
```bash
6173
pip install -U kernels
6274
```
6375

64-
Once installed, set `DIFFUSERS_GGUF_CUDA_KERNELS=true` to use optimized kernels when available. Note that CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images. To disable CUDA kernel usage, set the environment variable `DIFFUSERS_GGUF_CUDA_KERNELS=false`.
76+
Set `DIFFUSERS_GGUF_CUDA_KERNELS=true` to enable optimized kernels. CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images.
6577

66-
## Supported Quantization Types
78+
```python
79+
import os
6780

68-
- BF16
69-
- Q4_0
70-
- Q4_1
71-
- Q5_0
72-
- Q5_1
73-
- Q8_0
74-
- Q2_K
75-
- Q3_K
76-
- Q4_K
77-
- Q5_K
78-
- Q6_K
81+
# Enable CUDA kernels for ~10% speedup
82+
os.environ["DIFFUSERS_GGUF_CUDA_KERNELS"] = "true"
83+
# Disable CUDA kernels
84+
# os.environ["DIFFUSERS_GGUF_CUDA_KERNELS"] = "false"
85+
```
7986

8087
## Convert to GGUF
8188

82-
Use the Space below to convert a Diffusers checkpoint into the GGUF format for inference.
83-
run conversion:
89+
Use the Space below to convert a Diffusers checkpoint into a GGUF file.
8490

8591
<iframe
8692
src="https://diffusers-internal-dev-diffusers-to-gguf.hf.space"
@@ -89,32 +95,17 @@ run conversion:
8995
height="450"
9096
></iframe>
9197
98+
GGUF files stored in the [Diffusers format](../using-diffusers/other-formats) require the model's `config` path. If the model config is inside a subfolder, provide the `subfolder` argument as well.
9299

93100
```py
94101
import torch
102+
from diffusers import FluxPipeline, AutoModel, GGUFQuantizationConfig
95103

96-
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
97-
98-
ckpt_path = (
99-
"https://huggingface.co/sayakpaul/different-lora-from-civitai/blob/main/flux_dev_diffusers-q4_0.gguf"
100-
)
101-
transformer = FluxTransformer2DModel.from_single_file(
102-
ckpt_path,
104+
transformer = AutoModel.from_single_file(
105+
"https://huggingface.co/sayakpaul/different-lora-from-civitai/blob/main/flux_dev_diffusers-q4_0.gguf",
103106
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
104107
config="black-forest-labs/FLUX.1-dev",
105108
subfolder="transformer",
106109
torch_dtype=torch.bfloat16,
107110
)
108-
pipe = FluxPipeline.from_pretrained(
109-
"black-forest-labs/FLUX.1-dev",
110-
transformer=transformer,
111-
torch_dtype=torch.bfloat16,
112-
)
113-
pipe.enable_model_cpu_offload()
114-
prompt = "A cat holding a sign that says hello world"
115-
image = pipe(prompt, generator=torch.manual_seed(0)).images[0]
116-
image.save("flux-gguf.png")
117-
```
118-
119-
When using Diffusers format GGUF checkpoints, it's a must to provide the model `config` path. If the
120-
model config resides in a `subfolder`, that needs to be specified, too.
111+
```

docs/source/en/quantization/overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Initialize [`~quantizers.PipelineQuantizationConfig`] with these parameters.
3131
- `quant_backend` specifies which quantization backend to use. Supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`.
3232
- `quant_kwargs` specifies the quantization arguments to use.
3333

34-
> [!TIP]
34+
> [!NOTE]
3535
> The `quant_kwargs` arguments differ for each backend. Refer to the [Quantization API](../api/quantization) docs to view the specific arguments for each backend.
3636
3737
- `components_to_quantize` specifies which component(s) of the pipeline to quantize. Quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact.

docs/source/en/quantization/quanto.md

Lines changed: 64 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -13,136 +13,121 @@ specific language governing permissions and limitations under the License.
1313

1414
# Quanto
1515

16-
[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/en/index). It has been designed with versatility and simplicity in mind:
16+
[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/index). It has been designed with versatility and simplicity in mind:
1717

1818
- All features are available in eager mode (works with non-traceable models)
1919
- Supports quantization aware training
2020
- Quantized models are compatible with `torch.compile`
2121
- Quantized models are Device agnostic (e.g CUDA,XPU,MPS,CPU)
2222

23-
In order to use the Quanto backend, you will first need to install `optimum-quanto>=0.2.6` and `accelerate`
23+
Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model.
2424

25-
```shell
26-
pip install optimum-quanto accelerate
25+
Make sure Quanto and [Accelerate](https://huggingface.co/docs/optimum/index) are installed.
26+
27+
```bash
28+
pip install -U optimum-quanto accelerate
2729
```
2830

29-
Now you can quantize a model by passing the `QuantoConfig` object to the `from_pretrained()` method. Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model. The following snippet demonstrates how to apply `float8` quantization with Quanto.
31+
Create and pass `weights_dtype` to [`QuantoConfig`] configure the target data type to quantize a model to. The example below quantizes the model to `float8`. Check [`QuantoConfig`] for a list of supported weight types.
3032

3133
```python
3234
import torch
33-
from diffusers import FluxTransformer2DModel, QuantoConfig
35+
from diffusers import AutoModel, QuantoConfig, FluxPipeline
3436

35-
model_id = "black-forest-labs/FLUX.1-dev"
3637
quantization_config = QuantoConfig(weights_dtype="float8")
3738
transformer = FluxTransformer2DModel.from_pretrained(
38-
model_id,
39+
"black-forest-labs/FLUX.1-dev",
3940
subfolder="transformer",
4041
quantization_config=quantization_config,
4142
torch_dtype=torch.bfloat16,
4243
)
4344

44-
pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch_dtype)
45-
pipe.to("cuda")
46-
47-
prompt = "A cat holding a sign that says hello world"
48-
image = pipe(
49-
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
50-
).images[0]
51-
image.save("output.png")
45+
pipeline = FluxPipeline.from_pretrained(
46+
"black-forest-labs/FLUX.1-dev",
47+
transformer=transformer,
48+
torch_dtype=torch.bfloat16,
49+
device_map="cuda"
50+
)
51+
prompt = """
52+
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
53+
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
54+
"""
55+
image = pipeline(prompt).images[0]
56+
image.save("flux-quanto.png")
5257
```
5358

54-
## Skipping Quantization on specific modules
55-
56-
It is possible to skip applying quantization on certain modules using the `modules_to_not_convert` argument in the `QuantoConfig`. Please ensure that the modules passed in to this argument match the keys of the modules in the `state_dict`
59+
[`QuantoConfig`] also works with single files with [`~loaders.FromOriginalModelMixin.from_single_file`].
5760

5861
```python
5962
import torch
60-
from diffusers import FluxTransformer2DModel, QuantoConfig
63+
from diffusers import AutoModel, QuantoConfig
6164

62-
model_id = "black-forest-labs/FLUX.1-dev"
63-
quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"])
64-
transformer = FluxTransformer2DModel.from_pretrained(
65-
model_id,
66-
subfolder="transformer",
67-
quantization_config=quantization_config,
68-
torch_dtype=torch.bfloat16,
65+
quantization_config = QuantoConfig(weights_dtype="float8")
66+
transformer = AutoModel.from_single_file(
67+
"https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors",
68+
quantization_config=quantization_config,
69+
torch_dtype=torch.bfloat16
6970
)
7071
```
7172

72-
## Using `from_single_file` with the Quanto Backend
73+
## torch.compile
7374

74-
`QuantoConfig` is compatible with `~FromOriginalModelMixin.from_single_file`.
75+
Quanto supports torch.compile for `int8` weights only.
7576

7677
```python
7778
import torch
78-
from diffusers import FluxTransformer2DModel, QuantoConfig
79+
from diffusers import FluxPipeline, AutoModel, QuantoConfig
7980

80-
ckpt_path = "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors"
81-
quantization_config = QuantoConfig(weights_dtype="float8")
82-
transformer = FluxTransformer2DModel.from_single_file(ckpt_path, quantization_config=quantization_config, torch_dtype=torch.bfloat16)
81+
quantization_config = QuantoConfig(weights_dtype="int8")
82+
transformer = FluxTransformer2DModel.from_pretrained(
83+
"black-forest-labs/FLUX.1-dev",
84+
subfolder="transformer",
85+
quantization_config=quantization_config,
86+
torch_dtype=torch.bfloat16,
87+
)
88+
transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True)
89+
pipeline = FluxPipeline.from_pretrained(
90+
"black-forest-labs/FLUX.1-dev",
91+
transformer=transformer,
92+
torch_dtype=torch.bfloat16,
93+
device_map="cuda"
94+
)
8395
```
8496

85-
## Saving Quantized models
86-
87-
Diffusers supports serializing Quanto models using the `~ModelMixin.save_pretrained` method.
97+
## Skipping quantization on specific modules
8898

89-
The serialization and loading requirements are different for models quantized directly with the Quanto library and models quantized
90-
with Diffusers using Quanto as the backend. It is currently not possible to load models quantized directly with Quanto into Diffusers using `~ModelMixin.from_pretrained`
99+
Use `modules_to_not_convert` to skip quantization on specific modules. The modules passed to this argument must match the module keys in `state_dict`.
91100

92101
```python
93102
import torch
94-
from diffusers import FluxTransformer2DModel, QuantoConfig
103+
from diffusers import AutoModel, QuantoConfig
95104

96-
model_id = "black-forest-labs/FLUX.1-dev"
97-
quantization_config = QuantoConfig(weights_dtype="float8")
98-
transformer = FluxTransformer2DModel.from_pretrained(
99-
model_id,
105+
quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"])
106+
transformer = AutoModel.from_pretrained(
107+
"black-forest-labs/FLUX.1-dev",
100108
subfolder="transformer",
101109
quantization_config=quantization_config,
102110
torch_dtype=torch.bfloat16,
103111
)
104-
# save quantized model to reuse
105-
transformer.save_pretrained("<your quantized model save path>")
106-
107-
# you can reload your quantized model with
108-
model = FluxTransformer2DModel.from_pretrained("<your quantized model save path>")
109112
```
110113

111-
## Using `torch.compile` with Quanto
112-
113-
Currently the Quanto backend supports `torch.compile` for the following quantization types:
114+
## Saving quantized models
114115

115-
- `int8` weights
116+
Save a Quanto model with [`~ModelMixin.save_pretrained`]. Models quantized directly with the Quanto library - not as a backend in Diffusers - can't be loaded in Diffusers with [`~ModelMixin.from_pretrained`].
116117

117118
```python
118119
import torch
119-
from diffusers import FluxPipeline, FluxTransformer2DModel, QuantoConfig
120-
121-
model_id = "black-forest-labs/FLUX.1-dev"
122-
quantization_config = QuantoConfig(weights_dtype="int8")
123-
transformer = FluxTransformer2DModel.from_pretrained(
124-
model_id,
125-
subfolder="transformer",
126-
quantization_config=quantization_config,
127-
torch_dtype=torch.bfloat16,
128-
)
129-
transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True)
120+
from diffusers import AutoModel, QuantoConfig
130121

131-
pipe = FluxPipeline.from_pretrained(
132-
model_id, transformer=transformer, torch_dtype=torch_dtype
122+
quantization_config = QuantoConfig(weights_dtype="float8")
123+
transformer = AutoModel.from_pretrained(
124+
"black-forest-labs/FLUX.1-dev",
125+
subfolder="transformer",
126+
quantization_config=quantization_config,
127+
torch_dtype=torch.bfloat16,
133128
)
134-
pipe.to("cuda")
135-
images = pipe("A cat holding a sign that says hello").images[0]
136-
images.save("flux-quanto-compile.png")
137-
```
138-
139-
## Supported Quantization Types
140-
141-
### Weights
142-
143-
- float8
144-
- int8
145-
- int4
146-
- int2
147-
129+
transformer.save_pretrained("path/to/saved/model")
148130

131+
# Reload quantized model
132+
model = AutoModel.from_pretrained("path/to/saved/model")
133+
```

0 commit comments

Comments
 (0)