Skip to content

Commit cac3f58

Browse files
authored
Merge branch 'main' into missing-no-split-modules
2 parents 050885b + 3191248 commit cac3f58

File tree

23 files changed

+1116
-43
lines changed

23 files changed

+1116
-43
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,8 @@
238238
title: Textual Inversion
239239
- local: api/loaders/unet
240240
title: UNet
241+
- local: api/loaders/transformer_sd3
242+
title: SD3Transformer2D
241243
- local: api/loaders/peft
242244
title: PEFT
243245
title: Loaders

docs/source/en/api/attnprocessor.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,8 @@ An attention processor is a class for applying different types of attention mech
8686

8787
[[autodoc]] models.attention_processor.IPAdapterAttnProcessor2_0
8888

89+
[[autodoc]] models.attention_processor.SD3IPAdapterJointAttnProcessor2_0
90+
8991
## JointAttnProcessor2_0
9092

9193
[[autodoc]] models.attention_processor.JointAttnProcessor2_0

docs/source/en/api/loaders/ip_adapter.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,12 @@ Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading]
2424

2525
[[autodoc]] loaders.ip_adapter.IPAdapterMixin
2626

27+
## SD3IPAdapterMixin
28+
29+
[[autodoc]] loaders.ip_adapter.SD3IPAdapterMixin
30+
- all
31+
- is_ip_adapter_active
32+
2733
## IPAdapterMaskProcessor
2834

2935
[[autodoc]] image_processor.IPAdapterMaskProcessor
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# SD3Transformer2D
14+
15+
This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and SD3Transformer2DModel, check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead.
16+
17+
The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs.
18+
19+
<Tip>
20+
21+
To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
22+
23+
</Tip>
24+
25+
## SD3Transformer2DLoadersMixin
26+
27+
[[autodoc]] loaders.transformer_sd3.SD3Transformer2DLoadersMixin
28+
- all
29+
- _load_ip_adapter_weights

docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,76 @@ image.save("sd3_hello_world.png")
5959
- [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large)
6060
- [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo)
6161

62+
## Image Prompting with IP-Adapters
63+
64+
An IP-Adapter lets you prompt SD3 with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images. To load and use an IP-Adapter, you need:
65+
66+
- `image_encoder`: Pre-trained vision model used to obtain image features, usually a CLIP image encoder.
67+
- `feature_extractor`: Image processor that prepares the input image for the chosen `image_encoder`.
68+
- `ip_adapter_id`: Checkpoint containing parameters of image cross attention layers and image projection.
69+
70+
IP-Adapters are trained for a specific model architecture, so they also work in finetuned variations of the base model. You can use the [`~SD3IPAdapterMixin.set_ip_adapter_scale`] function to adjust how strongly the output aligns with the image prompt. The higher the value, the more closely the model follows the image prompt. A default value of 0.5 is typically a good balance, ensuring the model considers both the text and image prompts equally.
71+
72+
```python
73+
import torch
74+
from PIL import Image
75+
76+
from diffusers import StableDiffusion3Pipeline
77+
from transformers import SiglipVisionModel, SiglipImageProcessor
78+
79+
image_encoder_id = "google/siglip-so400m-patch14-384"
80+
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"
81+
82+
feature_extractor = SiglipImageProcessor.from_pretrained(
83+
image_encoder_id,
84+
torch_dtype=torch.float16
85+
)
86+
image_encoder = SiglipVisionModel.from_pretrained(
87+
image_encoder_id,
88+
torch_dtype=torch.float16
89+
).to( "cuda")
90+
91+
pipe = StableDiffusion3Pipeline.from_pretrained(
92+
"stabilityai/stable-diffusion-3.5-large",
93+
torch_dtype=torch.float16,
94+
feature_extractor=feature_extractor,
95+
image_encoder=image_encoder,
96+
).to("cuda")
97+
98+
pipe.load_ip_adapter(ip_adapter_id)
99+
pipe.set_ip_adapter_scale(0.6)
100+
101+
ref_img = Image.open("image.jpg").convert('RGB')
102+
103+
image = pipe(
104+
width=1024,
105+
height=1024,
106+
prompt="a cat",
107+
negative_prompt="lowres, low quality, worst quality",
108+
num_inference_steps=24,
109+
guidance_scale=5.0,
110+
ip_adapter_image=ref_img
111+
).images[0]
112+
113+
image.save("result.jpg")
114+
```
115+
116+
<div class="justify-center">
117+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sd3_ip_adapter_example.png"/>
118+
<figcaption class="mt-2 text-sm text-center text-gray-500">IP-Adapter examples with prompt "a cat"</figcaption>
119+
</div>
120+
121+
122+
<Tip>
123+
124+
Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work.
125+
126+
</Tip>
127+
128+
62129
## Memory Optimisations for SD3
63130

64-
SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.
131+
SD3 uses three text encoders, one of which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.
65132

66133
### Running Inference with Model Offloading
67134

src/diffusers/loaders/__init__.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ def text_encoder_attn_modules(text_encoder):
5656
if is_torch_available():
5757
_import_structure["single_file_model"] = ["FromOriginalModelMixin"]
5858

59+
_import_structure["transformer_sd3"] = ["SD3Transformer2DLoadersMixin"]
5960
_import_structure["unet"] = ["UNet2DConditionLoadersMixin"]
6061
_import_structure["utils"] = ["AttnProcsLayers"]
6162
if is_transformers_available():
@@ -74,19 +75,26 @@ def text_encoder_attn_modules(text_encoder):
7475
"SanaLoraLoaderMixin",
7576
]
7677
_import_structure["textual_inversion"] = ["TextualInversionLoaderMixin"]
77-
_import_structure["ip_adapter"] = ["IPAdapterMixin"]
78+
_import_structure["ip_adapter"] = [
79+
"IPAdapterMixin",
80+
"SD3IPAdapterMixin",
81+
]
7882

7983
_import_structure["peft"] = ["PeftAdapterMixin"]
8084

8185

8286
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
8387
if is_torch_available():
8488
from .single_file_model import FromOriginalModelMixin
89+
from .transformer_sd3 import SD3Transformer2DLoadersMixin
8590
from .unet import UNet2DConditionLoadersMixin
8691
from .utils import AttnProcsLayers
8792

8893
if is_transformers_available():
89-
from .ip_adapter import IPAdapterMixin
94+
from .ip_adapter import (
95+
IPAdapterMixin,
96+
SD3IPAdapterMixin,
97+
)
9098
from .lora_pipeline import (
9199
AmusedLoraLoaderMixin,
92100
CogVideoXLoraLoaderMixin,

0 commit comments

Comments
 (0)