Skip to content

Commit ecc16ed

Browse files
committed
Merge branch 'main' into flux-control-lora-training-script
2 parents 2610e6a + 26e80e0 commit ecc16ed

19 files changed

+6382
-86
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,8 @@
252252
title: SD3ControlNetModel
253253
- local: api/models/controlnet_sparsectrl
254254
title: SparseControlNetModel
255+
- local: api/models/controlnet_union
256+
title: ControlNetUnionModel
255257
title: ControlNets
256258
- sections:
257259
- local: api/models/allegro_transformer3d
@@ -368,6 +370,8 @@
368370
title: ControlNet-XS
369371
- local: api/pipelines/controlnetxs_sdxl
370372
title: ControlNet-XS with Stable Diffusion XL
373+
- local: api/pipelines/controlnet_union
374+
title: ControlNetUnion
371375
- local: api/pipelines/dance_diffusion
372376
title: Dance Diffusion
373377
- local: api/pipelines/ddim

docs/source/en/api/models/autoencoder_dc.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,26 @@ from diffusers import AutoencoderDC
3737
ae = AutoencoderDC.from_pretrained("mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers", torch_dtype=torch.float32).to("cuda")
3838
```
3939

40+
## Load a model in Diffusers via `from_single_file`
41+
42+
```python
43+
from difusers import AutoencoderDC
44+
45+
ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0/blob/main/model.safetensors"
46+
model = AutoencoderDC.from_single_file(ckpt_path)
47+
48+
```
49+
50+
The `AutoencoderDC` model has `in` and `mix` single file checkpoint variants that have matching checkpoint keys, but use different scaling factors. It is not possible for Diffusers to automatically infer the correct config file to use with the model based on just the checkpoint and will default to configuring the model using the `mix` variant config file. To override the automatically determined config, please use the `config` argument when using single file loading with `in` variant checkpoints.
51+
52+
```python
53+
from diffusers import AutoencoderDC
54+
55+
ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0/blob/main/model.safetensors"
56+
model = AutoencoderDC.from_single_file(ckpt_path, config="mit-han-lab/dc-ae-f128c512-in-1.0-diffusers")
57+
```
58+
59+
4060
## AutoencoderDC
4161

4262
[[autodoc]] AutoencoderDC
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
<!--Copyright 2024 The HuggingFace Team and The InstantX Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# ControlNetUnionModel
14+
15+
ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL.
16+
17+
The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation.
18+
19+
*We design a new architecture that can support 10+ control types in condition text-to-image generation and can generate high resolution images visually comparable with midjourney. The network is based on the original ControlNet architecture, we propose two new modules to: 1 Extend the original ControlNet to support different image conditions using the same network parameter. 2 Support multiple conditions input without increasing computation offload, which is especially important for designers who want to edit image in detail, different conditions use the same condition encoder, without adding extra computations or parameters.*
20+
21+
## Loading
22+
23+
By default the [`ControlNetUnionModel`] should be loaded with [`~ModelMixin.from_pretrained`].
24+
25+
```py
26+
from diffusers import StableDiffusionXLControlNetUnionPipeline, ControlNetUnionModel
27+
28+
controlnet = ControlNetUnionModel.from_pretrained("xinsir/controlnet-union-sdxl-1.0")
29+
pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet)
30+
```
31+
32+
## ControlNetUnionModel
33+
34+
[[autodoc]] ControlNetUnionModel
35+
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# ControlNetUnion
14+
15+
ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL.
16+
17+
The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation.
18+
19+
*We design a new architecture that can support 10+ control types in condition text-to-image generation and can generate high resolution images visually comparable with midjourney. The network is based on the original ControlNet architecture, we propose two new modules to: 1 Extend the original ControlNet to support different image conditions using the same network parameter. 2 Support multiple conditions input without increasing computation offload, which is especially important for designers who want to edit image in detail, different conditions use the same condition encoder, without adding extra computations or parameters.*
20+
21+
22+
## StableDiffusionXLControlNetUnionPipeline
23+
[[autodoc]] StableDiffusionXLControlNetUnionPipeline
24+
- all
25+
- __call__
26+
27+
## StableDiffusionXLControlNetUnionImg2ImgPipeline
28+
[[autodoc]] StableDiffusionXLControlNetUnionImg2ImgPipeline
29+
- all
30+
- __call__
31+
32+
## StableDiffusionXLControlNetUnionInpaintPipeline
33+
[[autodoc]] StableDiffusionXLControlNetUnionInpaintPipeline
34+
- all
35+
- __call__

examples/dreambooth/train_dreambooth.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1300,16 +1300,17 @@ def compute_text_embeddings(prompt):
13001300
# Since we predict the noise instead of x_0, the original formulation is slightly changed.
13011301
# This is discussed in Section 4.2 of the same paper.
13021302
snr = compute_snr(noise_scheduler, timesteps)
1303-
base_weight = (
1304-
torch.stack([snr, args.snr_gamma * torch.ones_like(timesteps)], dim=1).min(dim=1)[0] / snr
1305-
)
13061303

13071304
if noise_scheduler.config.prediction_type == "v_prediction":
13081305
# Velocity objective needs to be floored to an SNR weight of one.
1309-
mse_loss_weights = base_weight + 1
1306+
divisor = snr + 1
13101307
else:
1311-
# Epsilon and sample both use the same loss weights.
1312-
mse_loss_weights = base_weight
1308+
divisor = snr
1309+
1310+
mse_loss_weights = (
1311+
torch.stack([snr, args.snr_gamma * torch.ones_like(timesteps)], dim=1).min(dim=1)[0] / divisor
1312+
)
1313+
13131314
loss = F.mse_loss(model_pred.float(), target.float(), reduction="none")
13141315
loss = loss.mean(dim=list(range(1, len(loss.shape)))) * mse_loss_weights
13151316
loss = loss.mean()

src/diffusers/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@
9292
"CogView3PlusTransformer2DModel",
9393
"ConsistencyDecoderVAE",
9494
"ControlNetModel",
95+
"ControlNetUnionModel",
9596
"ControlNetXSAdapter",
9697
"DiTTransformer2DModel",
9798
"FluxControlNetModel",
@@ -378,6 +379,9 @@
378379
"StableDiffusionXLControlNetPAGImg2ImgPipeline",
379380
"StableDiffusionXLControlNetPAGPipeline",
380381
"StableDiffusionXLControlNetPipeline",
382+
"StableDiffusionXLControlNetUnionImg2ImgPipeline",
383+
"StableDiffusionXLControlNetUnionInpaintPipeline",
384+
"StableDiffusionXLControlNetUnionPipeline",
381385
"StableDiffusionXLControlNetXSPipeline",
382386
"StableDiffusionXLImg2ImgPipeline",
383387
"StableDiffusionXLInpaintPipeline",
@@ -586,6 +590,7 @@
586590
CogView3PlusTransformer2DModel,
587591
ConsistencyDecoderVAE,
588592
ControlNetModel,
593+
ControlNetUnionModel,
589594
ControlNetXSAdapter,
590595
DiTTransformer2DModel,
591596
FluxControlNetModel,
@@ -850,6 +855,9 @@
850855
StableDiffusionXLControlNetPAGImg2ImgPipeline,
851856
StableDiffusionXLControlNetPAGPipeline,
852857
StableDiffusionXLControlNetPipeline,
858+
StableDiffusionXLControlNetUnionImg2ImgPipeline,
859+
StableDiffusionXLControlNetUnionInpaintPipeline,
860+
StableDiffusionXLControlNetUnionPipeline,
853861
StableDiffusionXLControlNetXSPipeline,
854862
StableDiffusionXLImg2ImgPipeline,
855863
StableDiffusionXLInpaintPipeline,

src/diffusers/loaders/single_file_model.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from .single_file_utils import (
2424
SingleFileComponentError,
2525
convert_animatediff_checkpoint_to_diffusers,
26+
convert_autoencoder_dc_checkpoint_to_diffusers,
2627
convert_controlnet_checkpoint,
2728
convert_flux_transformer_checkpoint_to_diffusers,
2829
convert_ldm_unet_checkpoint,
@@ -82,6 +83,7 @@
8283
"checkpoint_mapping_fn": convert_flux_transformer_checkpoint_to_diffusers,
8384
"default_subfolder": "transformer",
8485
},
86+
"AutoencoderDC": {"checkpoint_mapping_fn": convert_autoencoder_dc_checkpoint_to_diffusers},
8587
}
8688

8789

src/diffusers/loaders/single_file_utils.py

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,8 @@
9292
"double_blocks.0.img_attn.norm.key_norm.scale",
9393
"model.diffusion_model.double_blocks.0.img_attn.norm.key_norm.scale",
9494
],
95+
"autoencoder-dc": "decoder.stages.1.op_list.0.main.conv.conv.bias",
96+
"autoencoder-dc-sana": "encoder.project_in.conv.bias",
9597
}
9698

9799
DIFFUSERS_DEFAULT_PIPELINE_PATHS = {
@@ -138,6 +140,10 @@
138140
"animatediff_rgb": {"pretrained_model_name_or_path": "guoyww/animatediff-sparsectrl-rgb"},
139141
"flux-dev": {"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev"},
140142
"flux-schnell": {"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-schnell"},
143+
"autoencoder-dc-f128c512": {"pretrained_model_name_or_path": "mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers"},
144+
"autoencoder-dc-f64c128": {"pretrained_model_name_or_path": "mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers"},
145+
"autoencoder-dc-f32c32": {"pretrained_model_name_or_path": "mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers"},
146+
"autoencoder-dc-f32c32-sana": {"pretrained_model_name_or_path": "mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers"},
141147
}
142148

143149
# Use to configure model sample size when original config is provided
@@ -564,6 +570,23 @@ def infer_diffusers_model_type(checkpoint):
564570
model_type = "flux-dev"
565571
else:
566572
model_type = "flux-schnell"
573+
574+
elif CHECKPOINT_KEY_NAMES["autoencoder-dc"] in checkpoint:
575+
encoder_key = "encoder.project_in.conv.conv.bias"
576+
decoder_key = "decoder.project_in.main.conv.weight"
577+
578+
if CHECKPOINT_KEY_NAMES["autoencoder-dc-sana"] in checkpoint:
579+
model_type = "autoencoder-dc-f32c32-sana"
580+
581+
elif checkpoint[encoder_key].shape[-1] == 64 and checkpoint[decoder_key].shape[1] == 32:
582+
model_type = "autoencoder-dc-f32c32"
583+
584+
elif checkpoint[encoder_key].shape[-1] == 64 and checkpoint[decoder_key].shape[1] == 128:
585+
model_type = "autoencoder-dc-f64c128"
586+
587+
else:
588+
model_type = "autoencoder-dc-f128c512"
589+
567590
else:
568591
model_type = "v1"
569592

@@ -2198,3 +2221,75 @@ def swap_scale_shift(weight):
21982221
)
21992222

22002223
return converted_state_dict
2224+
2225+
2226+
def convert_autoencoder_dc_checkpoint_to_diffusers(checkpoint, **kwargs):
2227+
converted_state_dict = {key: checkpoint.pop(key) for key in list(checkpoint.keys())}
2228+
2229+
def remap_qkv_(key: str, state_dict):
2230+
qkv = state_dict.pop(key)
2231+
q, k, v = torch.chunk(qkv, 3, dim=0)
2232+
parent_module, _, _ = key.rpartition(".qkv.conv.weight")
2233+
state_dict[f"{parent_module}.to_q.weight"] = q.squeeze()
2234+
state_dict[f"{parent_module}.to_k.weight"] = k.squeeze()
2235+
state_dict[f"{parent_module}.to_v.weight"] = v.squeeze()
2236+
2237+
def remap_proj_conv_(key: str, state_dict):
2238+
parent_module, _, _ = key.rpartition(".proj.conv.weight")
2239+
state_dict[f"{parent_module}.to_out.weight"] = state_dict.pop(key).squeeze()
2240+
2241+
AE_KEYS_RENAME_DICT = {
2242+
# common
2243+
"main.": "",
2244+
"op_list.": "",
2245+
"context_module": "attn",
2246+
"local_module": "conv_out",
2247+
# NOTE: The below two lines work because scales in the available configs only have a tuple length of 1
2248+
# If there were more scales, there would be more layers, so a loop would be better to handle this
2249+
"aggreg.0.0": "to_qkv_multiscale.0.proj_in",
2250+
"aggreg.0.1": "to_qkv_multiscale.0.proj_out",
2251+
"depth_conv.conv": "conv_depth",
2252+
"inverted_conv.conv": "conv_inverted",
2253+
"point_conv.conv": "conv_point",
2254+
"point_conv.norm": "norm",
2255+
"conv.conv.": "conv.",
2256+
"conv1.conv": "conv1",
2257+
"conv2.conv": "conv2",
2258+
"conv2.norm": "norm",
2259+
"proj.norm": "norm_out",
2260+
# encoder
2261+
"encoder.project_in.conv": "encoder.conv_in",
2262+
"encoder.project_out.0.conv": "encoder.conv_out",
2263+
"encoder.stages": "encoder.down_blocks",
2264+
# decoder
2265+
"decoder.project_in.conv": "decoder.conv_in",
2266+
"decoder.project_out.0": "decoder.norm_out",
2267+
"decoder.project_out.2.conv": "decoder.conv_out",
2268+
"decoder.stages": "decoder.up_blocks",
2269+
}
2270+
2271+
AE_F32C32_F64C128_F128C512_KEYS = {
2272+
"encoder.project_in.conv": "encoder.conv_in.conv",
2273+
"decoder.project_out.2.conv": "decoder.conv_out.conv",
2274+
}
2275+
2276+
AE_SPECIAL_KEYS_REMAP = {
2277+
"qkv.conv.weight": remap_qkv_,
2278+
"proj.conv.weight": remap_proj_conv_,
2279+
}
2280+
if "encoder.project_in.conv.bias" not in converted_state_dict:
2281+
AE_KEYS_RENAME_DICT.update(AE_F32C32_F64C128_F128C512_KEYS)
2282+
2283+
for key in list(converted_state_dict.keys()):
2284+
new_key = key[:]
2285+
for replace_key, rename_key in AE_KEYS_RENAME_DICT.items():
2286+
new_key = new_key.replace(replace_key, rename_key)
2287+
converted_state_dict[new_key] = converted_state_dict.pop(key)
2288+
2289+
for key in list(converted_state_dict.keys()):
2290+
for special_key, handler_fn_inplace in AE_SPECIAL_KEYS_REMAP.items():
2291+
if special_key not in key:
2292+
continue
2293+
handler_fn_inplace(key, converted_state_dict)
2294+
2295+
return converted_state_dict

src/diffusers/models/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
]
4646
_import_structure["controlnets.controlnet_sd3"] = ["SD3ControlNetModel", "SD3MultiControlNetModel"]
4747
_import_structure["controlnets.controlnet_sparsectrl"] = ["SparseControlNetModel"]
48+
_import_structure["controlnets.controlnet_union"] = ["ControlNetUnionModel"]
4849
_import_structure["controlnets.controlnet_xs"] = ["ControlNetXSAdapter", "UNetControlNetXSModel"]
4950
_import_structure["controlnets.multicontrolnet"] = ["MultiControlNetModel"]
5051
_import_structure["embeddings"] = ["ImageProjection"]
@@ -102,6 +103,7 @@
102103
)
103104
from .controlnets import (
104105
ControlNetModel,
106+
ControlNetUnionModel,
105107
ControlNetXSAdapter,
106108
FluxControlNetModel,
107109
FluxMultiControlNetModel,

src/diffusers/models/controlnets/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
SparseControlNetModel,
1616
SparseControlNetOutput,
1717
)
18+
from .controlnet_union import ControlNetUnionInput, ControlNetUnionInputProMax, ControlNetUnionModel
1819
from .controlnet_xs import ControlNetXSAdapter, ControlNetXSOutput, UNetControlNetXSModel
1920
from .multicontrolnet import MultiControlNetModel
2021

0 commit comments

Comments
 (0)