Skip to content
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
b5eeaa4
update
DN6 Oct 21, 2024
71897b1
update
DN6 Oct 21, 2024
89ea1ee
update
DN6 Oct 24, 2024
f0bcd94
update
DN6 Oct 24, 2024
60d1385
update
DN6 Oct 29, 2024
22ed0b0
update
DN6 Oct 31, 2024
2e6d340
update
DN6 Nov 3, 2024
b5f927c
update
DN6 Nov 11, 2024
b9666c7
Merge branch 'main' into gguf-support
DN6 Nov 11, 2024
6dc5d22
update
DN6 Nov 13, 2024
428e44b
update
DN6 Nov 15, 2024
d7f09f2
update
DN6 Nov 19, 2024
1649936
update
DN6 Nov 19, 2024
28d3a64
update
DN6 Nov 19, 2024
c34a451
update
DN6 Nov 21, 2024
84493db
update
DN6 Nov 21, 2024
50bd784
update
DN6 Nov 21, 2024
8f604b3
Merge branch 'main' into gguf-support
DN6 Dec 3, 2024
afd5d7d
update
DN6 Dec 4, 2024
e1b964a
Merge branch 'main' into gguf-support
sayakpaul Dec 4, 2024
0ed31bc
update
DN6 Dec 4, 2024
af381ad
update
DN6 Dec 4, 2024
52a1bcb
update
DN6 Dec 4, 2024
66ae46e
Merge branch 'gguf-support' of https://github.com/huggingface/diffuse…
DN6 Dec 4, 2024
67f1700
update
DN6 Dec 4, 2024
8abfa55
update
DN6 Dec 5, 2024
d4b88d7
update
DN6 Dec 5, 2024
30f13ed
update
DN6 Dec 5, 2024
9310035
update
DN6 Dec 5, 2024
e9303a0
update
DN6 Dec 5, 2024
e56c266
update
DN6 Dec 5, 2024
1209c3a
Update src/diffusers/quantizers/gguf/utils.py
DN6 Dec 5, 2024
db9b6f3
update
DN6 Dec 5, 2024
4c0360a
Merge branch 'gguf-support' of https://github.com/huggingface/diffuse…
DN6 Dec 5, 2024
aa7659b
Merge branch 'main' into gguf-support
DN6 Dec 5, 2024
78c7861
update
DN6 Dec 5, 2024
33eb431
update
DN6 Dec 5, 2024
9651ddc
update
DN6 Dec 5, 2024
746fd2f
update
DN6 Dec 5, 2024
e027d46
update
DN6 Dec 5, 2024
9db2396
update
DN6 Dec 6, 2024
7ee89f4
update
DN6 Dec 6, 2024
edf3e54
update
DN6 Dec 6, 2024
d3eb54f
update
DN6 Dec 6, 2024
82606cb
Merge branch 'main' into gguf-support
sayakpaul Dec 9, 2024
4f34f14
Update docs/source/en/quantization/gguf.md
DN6 Dec 11, 2024
090efdb
update
DN6 Dec 11, 2024
391b5a9
Merge branch 'main' into gguf-support
DN6 Dec 17, 2024
e67c25a
update
DN6 Dec 17, 2024
e710bde
update
DN6 Dec 17, 2024
f59e07a
update
DN6 Dec 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/nightly_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,8 @@ jobs:
config:
- backend: "bitsandbytes"
test_location: "bnb"
- backend: "gguf"
test_location: "gguf"
runs-on:
group: aws-g6e-xlarge-plus
container:
Expand Down Expand Up @@ -519,4 +521,4 @@ jobs:
# if: always()
# run: |
# pip install slack_sdk tabulate
# python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
# python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@
title: Getting Started
- local: quantization/bitsandbytes
title: bitsandbytes
- local: quantization/gguf
title: gguf
title: Quantization Methods
- sections:
- local: optimization/fp16
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/api/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ Learn how to quantize models in the [Quantization](../quantization/overview) gui

[[autodoc]] BitsAndBytesConfig

## GGUFQuantizationConfig

[[autodoc]] GGUFQuantizationConfig

## DiffusersQuantizer

[[autodoc]] quantizers.base.DiffusersQuantizer
68 changes: 68 additions & 0 deletions docs/source/en/quantization/gguf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# GGUF

The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.

The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant.

Before starting please install gguf in your environment

```shell
pip install -U gguf
```

Since GGUF is a single file format, we will be using `from_single_file` to load the model and pass in the `GGUFQuantizationConfig` when loading the model.

When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`, typically `torch.unint8` and are dynamically dequantized and cast to the configured `compute_dtype` when running a forward pass through each module in the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype` for the forward pass of each module. The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the pytorch dequantization code is based on the numpy code from llama.cpp written by @compilade - I believe he should be credited here as well :)


```python
import torch

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig

ckpt_path = (
"https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
)
transformer = FluxTransformer2DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
transformer=transformer,
generator=torch.manual_seed(0),
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
prompt = "A cat holding a sign that says hello world"
image = pipe(prompt).images[0]
image.save("flux-gguf.png")
```

## Supported Quantization Types

- BF16
- Q4_0
- Q4_1
- Q5_0
- Q5_1
- Q8_0
- Q2_K
- Q3_K
- Q4_K
- Q5_K
- Q6_K

8 changes: 6 additions & 2 deletions docs/source/en/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Quantization techniques focus on representing data with less information while a

<Tip>

Interested in adding a new quantization method to Transformers? Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) to learn more about adding a new quantization method.
Interested in adding a new quantization method to Diffusers? Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) to learn more about adding a new quantization method.

</Tip>

Expand All @@ -32,4 +32,8 @@ If you are new to the quantization field, we recommend you to check out these be

## When to use what?

This section will be expanded once Diffusers has multiple quantization backends. Currently, we only support `bitsandbytes`. [This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.
Diffusers currently supports the following quantization methods.
- `bitsandbytes`
- `gguf`

[This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.
4 changes: 2 additions & 2 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"loaders": ["FromOriginalModelMixin"],
"models": [],
"pipelines": [],
"quantizers.quantization_config": ["BitsAndBytesConfig"],
"quantizers.quantization_config": ["BitsAndBytesConfig", "GGUFQuantizationConfig"],
"schedulers": [],
"utils": [
"OptionalDependencyNotAvailable",
Expand Down Expand Up @@ -553,7 +553,7 @@

if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from .configuration_utils import ConfigMixin
from .quantizers.quantization_config import BitsAndBytesConfig
from .quantizers.quantization_config import BitsAndBytesConfig, GGUFQuantizationConfig

try:
if not is_onnx_available():
Expand Down
46 changes: 44 additions & 2 deletions src/diffusers/loaders/single_file_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,10 @@
from contextlib import nullcontext
from typing import Optional

import torch
from huggingface_hub.utils import validate_hf_hub_args

from ..quantizers import DiffusersAutoQuantizer
from ..utils import deprecate, is_accelerate_available, logging
from .single_file_utils import (
SingleFileComponentError,
Expand Down Expand Up @@ -202,6 +204,8 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
subfolder = kwargs.pop("subfolder", None)
revision = kwargs.pop("revision", None)
torch_dtype = kwargs.pop("torch_dtype", None)
quantization_config = kwargs.pop("quantization_config", None)
device = kwargs.pop("device", None)

if isinstance(pretrained_model_link_or_path_or_dict, dict):
checkpoint = pretrained_model_link_or_path_or_dict
Expand All @@ -215,6 +219,12 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
local_files_only=local_files_only,
revision=revision,
)
if quantization_config is not None:
hf_quantizer = DiffusersAutoQuantizer.from_config(quantization_config)
hf_quantizer.validate_environment()

else:
hf_quantizer = None
Comment on lines +234 to +239
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GGUF files, I'm thinking if it would be nice to allow the user to load the model without having necessarily to specify quantization_config=GGUFQuantizationConfig(compute_dtype=xxx). If we detect that this is a gguf, we can set by default quantization_config = GGUFQuantizationConfig(compute_dtype=torch.float32).
I'm suggesting this because usually, when you pass a quantization_config, it means either that the model is not quantized (bnb) or that the model is quantized (there is a quantization_config in the config.json) but we want to change a few arguments.

Also, what happens when the user pass a gguf without specifying the quantization_config ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good point! I think for most users, the entrypoint for GGUF files is going to be through from_single_file() and I agree with the logic you mentioned.

Copy link
Collaborator Author

@DN6 DN6 Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a nice convenience. GGUF does have all the information we need to auto fetch the config (honestly it's possible to skip the config all together), but it would mean that loading semantics would be different for GGUF vs other quant types. e.g.

GGUF

model = FluxTransformer2DModel.from_single_file("<>.gguf")

BnB and TorchAO (assuming these can be supported):

model = FluxTransformer2DModel.from_single_file("<path>", quantization_config=BnBConfig)
model = FluxTransformer2DModel.from_single_file("<path>", quantization_config=TorchAOConfig)

GGUF can also be used through from_pretrained (assuming quants of diffusers format checkpoints show up as some point) and we would have to pass a quant config in that case. I understand it's not ideal, but I feel it's better to preserve consistency across the different quant loading methods.

@SunMarc if the config isn't passed you get shape mismatch errors when you hit load_model_dict_into_meta since the quant shapes are different from the expected shapes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting this because usually, when you pass a quantization_config, it means either that the model is not quantized (bnb) or that the model is quantized (there is a quantization_config in the config.json) but we want to change a few arguments.

yeah I thought about that too, but I think the API for from_single_file and from_pretrained might just have to be different. It is a bit confusing but I'm not sure if there is a way to make it consistent between from_single_file and from_pertrained, if we also want to make sure the same API is consistent across different quant types

GGUF is a special case here because it has built-in config. Normally, for single-file it is just a checkpoint without config, so you will always have to pass a config (at least I think so, is it? @DN6 ). So for loading a regular quantized model (e.g. BNB) we can load it with from_pretrained without passing a config, but for from_single_file, we will have to manually pass a config

so agree with @DN6 here I think it more important to make the same API (from_pretrained API or from_single_file) consistent for different quant types; if we have to choose one

but if there a way to make it consistent between from_pretrained and from_single_file and across all quant types it will be great!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, want to know this: do we plan to support quantizing a model infrom_single_file? @DN6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGUF is a special case here because it has built-in config. Normally, for single-file it is just a checkpoint without config, so you will always have to pass a config (at least I think so, is it? @DN6 ). So for loading a regular quantized model (e.g. BNB) we can load it with from_pretrained without passing a config, but for from_single_file, we will have to manually pass a config

Would it make sense to at least make the user aware when the passed config and the determined config mismatch and if that could lead to unintentional consequences?

also, want to know this: do we plan to support quantizing a model infrom_single_file? @DN6

Supporting quantizing in the GGUF format (regardless of from_pretrained() or from_single_file()) would be reallllly nice.

Copy link
Collaborator Author

@DN6 DN6 Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiyixuxu Yeah we can definitely support quantizing a model via single file. For GGUF I can look into in a follow up because we would have to port the quantize functions to torch (the gguf library uses numpy). We could use the gguf library interally to quantize but it's quite slow since we would have to move tensors off GPU, convert to numpy and then quantize.

I think with torch AO I'm pretty sure it would work just out of the box.

You would have to save it with save_pretrained though since we don't support serializing single file checkpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what I am hearing is saving a GGUF quantized model would be added in a follow-up PR? That is also okay but it could be quite an enabling factor for the community.

For GGUF I can look into in a follow up because we would have to port the quantize functions to torch (the gguf library uses numpy). We could use the gguf library interally to quantize but it's quite slow since we would have to move tensors off GPU, convert to numpy and then quantize.

I think the porting option is more preferrable.

I think with torch AO I'm pretty sure it would work just out of the box.

You mean serializing with torchao but with quantization configs similar to the ones provided in GGUF?


mapping_functions = SINGLE_FILE_LOADABLE_CLASSES[mapping_class_name]

Expand Down Expand Up @@ -296,8 +306,36 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
with ctx():
model = cls.from_config(diffusers_model_config)

# Check if `_keep_in_fp32_modules` is not None
use_keep_in_fp32_modules = (cls._keep_in_fp32_modules is not None) and (
(torch_dtype == torch.float16) or hasattr(hf_quantizer, "use_keep_in_fp32_modules")
)
if use_keep_in_fp32_modules:
keep_in_fp32_modules = cls._keep_in_fp32_modules
if not isinstance(keep_in_fp32_modules, list):
keep_in_fp32_modules = [keep_in_fp32_modules]

else:
keep_in_fp32_modules = []

if hf_quantizer is not None:
hf_quantizer.preprocess_model(
model=model,
device_map=None,
state_dict=diffusers_format_checkpoint,
keep_in_fp32_modules=keep_in_fp32_modules,
)

if is_accelerate_available():
unexpected_keys = load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
param_device = torch.device(device) if device else torch.device("cpu")
unexpected_keys = load_model_dict_into_meta(
model,
diffusers_format_checkpoint,
dtype=torch_dtype,
device=param_device,
hf_quantizer=hf_quantizer,
keep_in_fp32_modules=keep_in_fp32_modules,
)

else:
_, unexpected_keys = model.load_state_dict(diffusers_format_checkpoint, strict=False)
Expand All @@ -311,7 +349,11 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
f"Some weights of the model checkpoint were not used when initializing {cls.__name__}: \n {[', '.join(unexpected_keys)]}"
)

if torch_dtype is not None:
if hf_quantizer is not None:
hf_quantizer.postprocess_model(model)
model.hf_quantizer = hf_quantizer

if torch_dtype is not None and hf_quantizer is None:
model.to(torch_dtype)

model.eval()
Expand Down
25 changes: 19 additions & 6 deletions src/diffusers/loaders/single_file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,14 @@
"open_clip_sd3": "text_encoders.clip_g.transformer.text_model.embeddings.position_embedding.weight",
"stable_cascade_stage_b": "down_blocks.1.0.channelwise.0.weight",
"stable_cascade_stage_c": "clip_txt_mapper.weight",
"sd3": "model.diffusion_model.joint_blocks.0.context_block.adaLN_modulation.1.bias",
"sd35_large": "model.diffusion_model.joint_blocks.37.x_block.mlp.fc1.weight",
"sd3": [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make this change because SD3/3.5 GGUF single file checkpoints use different keys than the original model from SAI..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything special for Flux?

"joint_blocks.0.context_block.adaLN_modulation.1.bias",
"model.diffusion_model.joint_blocks.0.context_block.adaLN_modulation.1.bias",
],
"sd35_large": [
"joint_blocks.37.x_block.mlp.fc1.weight",
"model.diffusion_model.joint_blocks.37.x_block.mlp.fc1.weight",
],
"animatediff": "down_blocks.0.motion_modules.0.temporal_transformer.transformer_blocks.0.attention_blocks.0.pos_encoder.pe",
"animatediff_v2": "mid_block.motion_modules.0.temporal_transformer.norm.bias",
"animatediff_sdxl_beta": "up_blocks.2.motion_modules.0.temporal_transformer.norm.weight",
Expand Down Expand Up @@ -529,13 +535,20 @@ def infer_diffusers_model_type(checkpoint):
):
model_type = "stable_cascade_stage_b"

elif CHECKPOINT_KEY_NAMES["sd3"] in checkpoint and checkpoint[CHECKPOINT_KEY_NAMES["sd3"]].shape[-1] == 9216:
if checkpoint["model.diffusion_model.pos_embed"].shape[1] == 36864:
elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["sd3"]) and any(
checkpoint[key].shape[-1] == 9216 if key in checkpoint else False for key in CHECKPOINT_KEY_NAMES["sd3"]
):
if "model.diffusion_model.pos_embed" in checkpoint:
key = "model.diffusion_model.pos_embed"
else:
key = "pos_embed"

if checkpoint[key].shape[1] == 36864:
model_type = "sd3"
elif checkpoint["model.diffusion_model.pos_embed"].shape[1] == 147456:
elif checkpoint[key].shape[1] == 147456:
model_type = "sd35_medium"

elif CHECKPOINT_KEY_NAMES["sd35_large"] in checkpoint:
elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["sd35_large"]):
model_type = "sd35_large"

elif CHECKPOINT_KEY_NAMES["animatediff"] in checkpoint:
Expand Down
Loading
Loading