Skip to content

Commit 2d61c1b

Browse files
authored
Merge branch 'main' into handle-unload-lora-control
2 parents 90469b3 + 41ba8c0 commit 2d61c1b

File tree

77 files changed

+6210
-341
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+6210
-341
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,8 @@
238238
title: Textual Inversion
239239
- local: api/loaders/unet
240240
title: UNet
241+
- local: api/loaders/transformer_sd3
242+
title: SD3Transformer2D
241243
- local: api/loaders/peft
242244
title: PEFT
243245
title: Loaders
@@ -400,6 +402,8 @@
400402
title: DiT
401403
- local: api/pipelines/flux
402404
title: Flux
405+
- local: api/pipelines/control_flux_inpaint
406+
title: FluxControlInpaint
403407
- local: api/pipelines/hunyuandit
404408
title: Hunyuan-DiT
405409
- local: api/pipelines/hunyuan_video

docs/source/en/api/attnprocessor.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,8 @@ An attention processor is a class for applying different types of attention mech
8686

8787
[[autodoc]] models.attention_processor.IPAdapterAttnProcessor2_0
8888

89+
[[autodoc]] models.attention_processor.SD3IPAdapterJointAttnProcessor2_0
90+
8991
## JointAttnProcessor2_0
9092

9193
[[autodoc]] models.attention_processor.JointAttnProcessor2_0

docs/source/en/api/loaders/ip_adapter.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,12 @@ Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading]
2424

2525
[[autodoc]] loaders.ip_adapter.IPAdapterMixin
2626

27+
## SD3IPAdapterMixin
28+
29+
[[autodoc]] loaders.ip_adapter.SD3IPAdapterMixin
30+
- all
31+
- is_ip_adapter_active
32+
2733
## IPAdapterMaskProcessor
2834

2935
[[autodoc]] image_processor.IPAdapterMaskProcessor
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# SD3Transformer2D
14+
15+
This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and SD3Transformer2DModel, check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead.
16+
17+
The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs.
18+
19+
<Tip>
20+
21+
To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
22+
23+
</Tip>
24+
25+
## SD3Transformer2DLoadersMixin
26+
27+
[[autodoc]] loaders.transformer_sd3.SD3Transformer2DLoadersMixin
28+
- all
29+
- _load_ip_adapter_weights
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
<!--Copyright 2024 The HuggingFace Team, The Black Forest Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# FluxControlInpaint
14+
15+
FluxControlInpaintPipeline is an implementation of Inpainting for Flux.1 Depth/Canny models. It is a pipeline that allows you to inpaint images using the Flux.1 Depth/Canny models. The pipeline takes an image and a mask as input and returns the inpainted image.
16+
17+
FLUX.1 Depth and Canny [dev] is a 12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image. **This is not a ControlNet model**.
18+
19+
| Control type | Developer | Link |
20+
| -------- | ---------- | ---- |
21+
| Depth | [Black Forest Labs](https://huggingface.co/black-forest-labs) | [Link](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev) |
22+
| Canny | [Black Forest Labs](https://huggingface.co/black-forest-labs) | [Link](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev) |
23+
24+
25+
<Tip>
26+
27+
Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c).
28+
29+
</Tip>
30+
31+
```python
32+
import torch
33+
from diffusers import FluxControlInpaintPipeline
34+
from diffusers.models.transformers import FluxTransformer2DModel
35+
from transformers import T5EncoderModel
36+
from diffusers.utils import load_image, make_image_grid
37+
from image_gen_aux import DepthPreprocessor # https://github.com/huggingface/image_gen_aux
38+
from PIL import Image
39+
import numpy as np
40+
41+
pipe = FluxControlInpaintPipeline.from_pretrained(
42+
"black-forest-labs/FLUX.1-Depth-dev",
43+
torch_dtype=torch.bfloat16,
44+
)
45+
# use following lines if you have GPU constraints
46+
# ---------------------------------------------------------------
47+
transformer = FluxTransformer2DModel.from_pretrained(
48+
"sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="transformer", torch_dtype=torch.bfloat16
49+
)
50+
text_encoder_2 = T5EncoderModel.from_pretrained(
51+
"sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="text_encoder_2", torch_dtype=torch.bfloat16
52+
)
53+
pipe.transformer = transformer
54+
pipe.text_encoder_2 = text_encoder_2
55+
pipe.enable_model_cpu_offload()
56+
# ---------------------------------------------------------------
57+
pipe.to("cuda")
58+
59+
prompt = "a blue robot singing opera with human-like expressions"
60+
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
61+
62+
head_mask = np.zeros_like(image)
63+
head_mask[65:580,300:642] = 255
64+
mask_image = Image.fromarray(head_mask)
65+
66+
processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
67+
control_image = processor(image)[0].convert("RGB")
68+
69+
output = pipe(
70+
prompt=prompt,
71+
image=image,
72+
control_image=control_image,
73+
mask_image=mask_image,
74+
num_inference_steps=30,
75+
strength=0.9,
76+
guidance_scale=10.0,
77+
generator=torch.Generator().manual_seed(42),
78+
).images[0]
79+
make_image_grid([image, control_image, mask_image, output.resize(image.size)], rows=1, cols=4).save("output.png")
80+
```
81+
82+
## FluxControlInpaintPipeline
83+
[[autodoc]] FluxControlInpaintPipeline
84+
- all
85+
- __call__
86+
87+
88+
## FluxPipelineOutput
89+
[[autodoc]] pipelines.flux.pipeline_output.FluxPipelineOutput

docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,76 @@ image.save("sd3_hello_world.png")
5959
- [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large)
6060
- [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo)
6161

62+
## Image Prompting with IP-Adapters
63+
64+
An IP-Adapter lets you prompt SD3 with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images. To load and use an IP-Adapter, you need:
65+
66+
- `image_encoder`: Pre-trained vision model used to obtain image features, usually a CLIP image encoder.
67+
- `feature_extractor`: Image processor that prepares the input image for the chosen `image_encoder`.
68+
- `ip_adapter_id`: Checkpoint containing parameters of image cross attention layers and image projection.
69+
70+
IP-Adapters are trained for a specific model architecture, so they also work in finetuned variations of the base model. You can use the [`~SD3IPAdapterMixin.set_ip_adapter_scale`] function to adjust how strongly the output aligns with the image prompt. The higher the value, the more closely the model follows the image prompt. A default value of 0.5 is typically a good balance, ensuring the model considers both the text and image prompts equally.
71+
72+
```python
73+
import torch
74+
from PIL import Image
75+
76+
from diffusers import StableDiffusion3Pipeline
77+
from transformers import SiglipVisionModel, SiglipImageProcessor
78+
79+
image_encoder_id = "google/siglip-so400m-patch14-384"
80+
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"
81+
82+
feature_extractor = SiglipImageProcessor.from_pretrained(
83+
image_encoder_id,
84+
torch_dtype=torch.float16
85+
)
86+
image_encoder = SiglipVisionModel.from_pretrained(
87+
image_encoder_id,
88+
torch_dtype=torch.float16
89+
).to( "cuda")
90+
91+
pipe = StableDiffusion3Pipeline.from_pretrained(
92+
"stabilityai/stable-diffusion-3.5-large",
93+
torch_dtype=torch.float16,
94+
feature_extractor=feature_extractor,
95+
image_encoder=image_encoder,
96+
).to("cuda")
97+
98+
pipe.load_ip_adapter(ip_adapter_id)
99+
pipe.set_ip_adapter_scale(0.6)
100+
101+
ref_img = Image.open("image.jpg").convert('RGB')
102+
103+
image = pipe(
104+
width=1024,
105+
height=1024,
106+
prompt="a cat",
107+
negative_prompt="lowres, low quality, worst quality",
108+
num_inference_steps=24,
109+
guidance_scale=5.0,
110+
ip_adapter_image=ref_img
111+
).images[0]
112+
113+
image.save("result.jpg")
114+
```
115+
116+
<div class="justify-center">
117+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sd3_ip_adapter_example.png"/>
118+
<figcaption class="mt-2 text-sm text-center text-gray-500">IP-Adapter examples with prompt "a cat"</figcaption>
119+
</div>
120+
121+
122+
<Tip>
123+
124+
Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work.
125+
126+
</Tip>
127+
128+
62129
## Memory Optimisations for SD3
63130

64-
SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.
131+
SD3 uses three text encoders, one of which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.
65132

66133
### Running Inference with Model Offloading
67134

docs/source/en/quantization/gguf.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@ pip install -U gguf
2525

2626
Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`].
2727

28-
When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.unint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.
28+
When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.
2929

30-
The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original (`numpy`)[https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py] implementation by [compilade](https://github.com/compilade).
30+
The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original [`numpy`](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py) implementation by [compilade](https://github.com/compilade).
3131

3232
```python
3333
import torch

docs/source/en/quantization/overview.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ If you are new to the quantization field, we recommend you to check out these be
3333
## When to use what?
3434

3535
Diffusers currently supports the following quantization methods.
36-
- [BitsandBytes]()
37-
- [TorchAO]()
38-
- [GGUF]()
36+
- [BitsandBytes](./bitsandbytes.md)
37+
- [TorchAO](./torchao.md)
38+
- [GGUF](./gguf.md)
3939

4040
[This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.

docs/source/en/tutorials/using_peft_for_inference.md

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ image
5656

5757
With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images and call it `"pixel"`.
5858

59-
The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter, but you can activate the `"pixel"` adapter with the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method:
59+
The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter, but you can activate the `"pixel"` adapter with the [`~PeftAdapterMixin.set_adapters`] method:
6060

6161
```python
6262
pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
@@ -85,7 +85,7 @@ By default, if the most up-to-date versions of PEFT and Transformers are detecte
8585

8686
You can also merge different adapter checkpoints for inference to blend their styles together.
8787

88-
Once again, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method to activate the `pixel` and `toy` adapters and specify the weights for how they should be merged.
88+
Once again, use the [`~PeftAdapterMixin.set_adapters`] method to activate the `pixel` and `toy` adapters and specify the weights for how they should be merged.
8989

9090
```python
9191
pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0])
@@ -114,7 +114,7 @@ Impressive! As you can see, the model generated an image that mixed the characte
114114
> [!TIP]
115115
> Through its PEFT integration, Diffusers also offers more efficient merging methods which you can learn about in the [Merge LoRAs](../using-diffusers/merge_loras) guide!
116116
117-
To return to only using one adapter, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method to activate the `"toy"` adapter:
117+
To return to only using one adapter, use the [`~PeftAdapterMixin.set_adapters`] method to activate the `"toy"` adapter:
118118

119119
```python
120120
pipe.set_adapters("toy")
@@ -127,7 +127,7 @@ image = pipe(
127127
image
128128
```
129129

130-
Or to disable all adapters entirely, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.disable_lora`] method to return the base model.
130+
Or to disable all adapters entirely, use the [`~PeftAdapterMixin.disable_lora`] method to return the base model.
131131

132132
```python
133133
pipe.disable_lora()
@@ -140,7 +140,8 @@ image
140140
![no-lora](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_20_1.png)
141141

142142
### Customize adapters strength
143-
For even more customization, you can control how strongly the adapter affects each part of the pipeline. For this, pass a dictionary with the control strengths (called "scales") to [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`].
143+
144+
For even more customization, you can control how strongly the adapter affects each part of the pipeline. For this, pass a dictionary with the control strengths (called "scales") to [`~PeftAdapterMixin.set_adapters`].
144145

145146
For example, here's how you can turn on the adapter for the `down` parts, but turn it off for the `mid` and `up` parts:
146147
```python
@@ -195,7 +196,7 @@ image
195196

196197
![block-lora-mixed](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_block_mixed.png)
197198

198-
## Manage active adapters
199+
## Manage adapters
199200

200201
You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, use the [`~diffusers.loaders.StableDiffusionLoraLoaderMixin.get_active_adapters`] method to check the list of active adapters:
201202

@@ -212,3 +213,11 @@ list_adapters_component_wise = pipe.get_list_adapters()
212213
list_adapters_component_wise
213214
{"text_encoder": ["toy", "pixel"], "unet": ["toy", "pixel"], "text_encoder_2": ["toy", "pixel"]}
214215
```
216+
217+
The [`~PeftAdapterMixin.delete_adapters`] function completely removes an adapter and their LoRA layers from a model.
218+
219+
```py
220+
pipe.delete_adapters("toy")
221+
pipe.get_active_adapters()
222+
["pixel"]
223+
```

examples/community/pipeline_hunyuandit_differential_img2img.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1008,6 +1008,8 @@ def __call__(
10081008
self.transformer.inner_dim // self.transformer.num_heads,
10091009
grid_crops_coords,
10101010
(grid_height, grid_width),
1011+
device=device,
1012+
output_type="pt",
10111013
)
10121014

10131015
style = torch.tensor([0], device=device)

0 commit comments

Comments
 (0)