Skip to content

Commit 8a99701

Browse files
authored
Merge branch 'main' into bnb-follow-up
2 parents 3dbe41f + b0ffe92 commit 8a99701

File tree

75 files changed

+706
-198
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+706
-198
lines changed

docs/source/en/api/pipelines/controlnet_flux.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!--Copyright 2024 The HuggingFace Team and The InstantX Team. All rights reserved.
1+
<!--Copyright 2024 The HuggingFace Team, The InstantX Team, and the XLabs Team. All rights reserved.
22
33
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
44
the License. You may obtain a copy of the License at
@@ -31,6 +31,14 @@ This controlnet code is implemented by [The InstantX Team](https://huggingface.c
3131
| Depth | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Depth) |
3232
| Union | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Union) |
3333

34+
XLabs ControlNets are also supported, which was contributed by the [XLabs team](https://huggingface.co/XLabs-AI).
35+
36+
| ControlNet type | Developer | Link |
37+
| -------- | ---------- | ---- |
38+
| Canny | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-canny-diffusers) |
39+
| Depth | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-depth-diffusers) |
40+
| HED | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-hed-diffusers) |
41+
3442

3543
<Tip>
3644

docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,11 @@ image = pipe(
5454
image.save("sd3_hello_world.png")
5555
```
5656

57+
**Note:** Stable Diffusion 3.5 can also be run using the SD3 pipeline, and all mentioned optimizations and techniques apply to it as well. In total there are three official models in the SD3 family:
58+
- [`stabilityai/stable-diffusion-3-medium-diffusers`](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers)
59+
- [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large)
60+
- [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo)
61+
5762
## Memory Optimisations for SD3
5863

5964
SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.

examples/community/README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4336,19 +4336,19 @@ The Abstract of the paper:
43364336

43374337
**64x64**
43384338
:-------------------------:
4339-
| <img src="https://github.com/user-attachments/assets/9e7bb2cd-45a0-4bd1-adb8-23e283baed39" width="222" height="222" alt="bird_64"> |
4339+
| <img src="https://github.com/user-attachments/assets/032738eb-c6cd-4fd9-b4d7-a7317b4b6528" width="222" height="222" alt="bird_64_64"> |
43404340

43414341
- `256×256, nesting_level=1`: 1.776 GiB. With `150` DDIM inference steps:
43424342

43434343
**64x64** | **256x256**
43444344
:-------------------------:|:-------------------------:
4345-
| <img src="https://github.com/user-attachments/assets/6b724c2e-5e6a-4b63-9b65-c1182cbb67e0" width="222" height="222" alt="64x64"> | <img src="https://github.com/user-attachments/assets/7dbab2ad-bf40-4a73-ab04-f178347cb7d5" width="222" height="222" alt="256x256"> |
4345+
| <img src="https://github.com/user-attachments/assets/21b9ad8b-eea6-4603-80a2-31180f391589" width="222" height="222" alt="bird_256_64"> | <img src="https://github.com/user-attachments/assets/fc411682-8a36-422c-9488-395b77d4406e" width="222" height="222" alt="bird_256_256"> |
43464346

4347-
- `1024×1024, nesting_level=2`: 1.792 GiB. As one can realize the cost of adding another layer is really negligible. With `250` DDIM inference steps:
4347+
- `1024×1024, nesting_level=2`: 1.792 GiB. As one can realize the cost of adding another layer is really negligible in this context! With `250` DDIM inference steps:
43484348

43494349
**64x64** | **256x256** | **1024x1024**
43504350
:-------------------------:|:-------------------------:|:-------------------------:
4351-
| <img src="https://github.com/user-attachments/assets/4a9454e4-e20a-4736-a196-270e2ae796c0" width="222" height="222" alt="64x64"> | <img src="https://github.com/user-attachments/assets/4a96555d-0fda-4303-82b1-a4d886f770b9" width="222" height="222" alt="256x256"> | <img src="https://github.com/user-attachments/assets/e0239b7a-ab73-4d45-8f3e-b4e6b4b50abe" width="222" height="222" alt="1024x1024"> |
4351+
| <img src="https://github.com/user-attachments/assets/febf4b98-3dee-4a8e-9946-fd42e1f232e6" width="222" height="222" alt="bird_1024_64"> | <img src="https://github.com/user-attachments/assets/c5f85b40-5d6d-4267-a92a-c89dff015b9b" width="222" height="222" alt="bird_1024_256"> | <img src="https://github.com/user-attachments/assets/ad66b913-4367-4cb9-889e-bc06f4d96148" width="222" height="222" alt="bird_1024_1024"> |
43524352

43534353
```py
43544354
from diffusers import DiffusionPipeline
@@ -4362,8 +4362,7 @@ pipe = DiffusionPipeline.from_pretrained("tolgacangoz/matryoshka-diffusion-model
43624362

43634363
prompt0 = "a blue jay stops on the top of a helmet of Japanese samurai, background with sakura tree"
43644364
prompt = f"breathtaking {prompt0}. award-winning, professional, highly detailed"
4365-
negative_prompt = "deformed, mutated, ugly, disfigured, blur, blurry, noise, noisy"
4366-
image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=50).images
4365+
image = pipe(prompt, num_inference_steps=50).images
43674366
make_image_grid(image, rows=1, cols=len(image))
43684367

43694368
# pipe.change_nesting_level(<int>) # 0, 1, or 2

examples/community/matryoshka.py

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -107,15 +107,16 @@
107107
108108
>>> # nesting_level=0 -> 64x64; nesting_level=1 -> 256x256 - 64x64; nesting_level=2 -> 1024x1024 - 256x256 - 64x64
109109
>>> pipe = DiffusionPipeline.from_pretrained("tolgacangoz/matryoshka-diffusion-models",
110-
>>> custom_pipeline="matryoshka").to("cuda")
110+
... nesting_level=0,
111+
... trust_remote_code=False, # One needs to give permission for this code to run
112+
... ).to("cuda")
111113
112114
>>> prompt0 = "a blue jay stops on the top of a helmet of Japanese samurai, background with sakura tree"
113115
>>> prompt = f"breathtaking {prompt0}. award-winning, professional, highly detailed"
114-
>>> negative_prompt = "deformed, mutated, ugly, disfigured, blur, blurry, noise, noisy"
115-
>>> image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=50).images
116+
>>> image = pipe(prompt, num_inference_steps=50).images
116117
>>> make_image_grid(image, rows=1, cols=len(image))
117118
118-
>>> pipe.change_nesting_level(<int>) # 0, 1, or 2
119+
>>> # pipe.change_nesting_level(<int>) # 0, 1, or 2
119120
>>> # 50+, 100+, and 250+ num_inference_steps are recommended for nesting levels 0, 1, and 2 respectively.
120121
```
121122
"""
@@ -420,6 +421,7 @@ def __init__(
420421
self.timesteps = torch.from_numpy(np.arange(0, num_train_timesteps)[::-1].copy().astype(np.int64))
421422

422423
self.scales = None
424+
self.schedule_shifted_power = 1.0
423425

424426
def scale_model_input(self, sample: torch.Tensor, timestep: Optional[int] = None) -> torch.Tensor:
425427
"""
@@ -532,6 +534,7 @@ def set_timesteps(self, num_inference_steps: int, device: Union[str, torch.devic
532534

533535
def get_schedule_shifted(self, alpha_prod, scale_factor=None):
534536
if (scale_factor is not None) and (scale_factor > 1): # rescale noise schedule
537+
scale_factor = scale_factor**self.schedule_shifted_power
535538
snr = alpha_prod / (1 - alpha_prod)
536539
scaled_snr = snr / scale_factor
537540
alpha_prod = 1 / (1 + 1 / scaled_snr)
@@ -639,17 +642,14 @@ def step(
639642
# 4. Clip or threshold "predicted x_0"
640643
if self.config.thresholding:
641644
if len(model_output) > 1:
642-
pred_original_sample = [
643-
self._threshold_sample(p_o_s * scale) / scale
644-
for p_o_s, scale in zip(pred_original_sample, self.scales)
645-
]
645+
pred_original_sample = [self._threshold_sample(p_o_s) for p_o_s in pred_original_sample]
646646
else:
647647
pred_original_sample = self._threshold_sample(pred_original_sample)
648648
elif self.config.clip_sample:
649649
if len(model_output) > 1:
650650
pred_original_sample = [
651-
(p_o_s * scale).clamp(-self.config.clip_sample_range, self.config.clip_sample_range) / scale
652-
for p_o_s, scale in zip(pred_original_sample, self.scales)
651+
p_o_s.clamp(-self.config.clip_sample_range, self.config.clip_sample_range)
652+
for p_o_s in pred_original_sample
653653
]
654654
else:
655655
pred_original_sample = pred_original_sample.clamp(
@@ -3816,6 +3816,8 @@ def __init__(
38163816

38173817
if hasattr(unet, "nest_ratio"):
38183818
scheduler.scales = unet.nest_ratio + [1]
3819+
if nesting_level == 2:
3820+
scheduler.schedule_shifted_power = 2.0
38193821

38203822
self.register_modules(
38213823
text_encoder=text_encoder,
@@ -3842,12 +3844,14 @@ def change_nesting_level(self, nesting_level: int):
38423844
).to(self.device)
38433845
self.config.nesting_level = 1
38443846
self.scheduler.scales = self.unet.nest_ratio + [1]
3847+
self.scheduler.schedule_shifted_power = 1.0
38453848
elif nesting_level == 2:
38463849
self.unet = NestedUNet2DConditionModel.from_pretrained(
38473850
"tolgacangoz/matryoshka-diffusion-models", subfolder="unet/nesting_level_2"
38483851
).to(self.device)
38493852
self.config.nesting_level = 2
38503853
self.scheduler.scales = self.unet.nest_ratio + [1]
3854+
self.scheduler.schedule_shifted_power = 2.0
38513855
else:
38523856
raise ValueError("Currently, nesting levels 0, 1, and 2 are supported.")
38533857

@@ -4627,8 +4631,8 @@ def __call__(
46274631
image = latents
46284632

46294633
if self.scheduler.scales is not None:
4630-
for i, (img, scale) in enumerate(zip(image, self.scheduler.scales)):
4631-
image[i] = self.image_processor.postprocess(img * scale, output_type=output_type)[0]
4634+
for i, img in enumerate(image):
4635+
image[i] = self.image_processor.postprocess(img, output_type=output_type)[0]
46324636
else:
46334637
image = self.image_processor.postprocess(image, output_type=output_type)
46344638

examples/controlnet/README_sd3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ from diffusers.utils import load_image
104104
import torch
105105

106106
base_model_path = "stabilityai/stable-diffusion-3-medium-diffusers"
107-
controlnet_path = "sd3-controlnet-out/checkpoint-6500/controlnet"
107+
controlnet_path = "DavyMorgan/sd3-controlnet-out"
108108

109109
controlnet = SD3ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16)
110110
pipe = StableDiffusion3ControlNetPipeline.from_pretrained(

examples/controlnet/train_controlnet.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1048,7 +1048,9 @@ def load_model_hook(models, input_dir):
10481048

10491049
# Add noise to the latents according to the noise magnitude at each timestep
10501050
# (this is the forward diffusion process)
1051-
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
1051+
noisy_latents = noise_scheduler.add_noise(latents.float(), noise.float(), timesteps).to(
1052+
dtype=weight_dtype
1053+
)
10521054

10531055
# Get the text embedding for conditioning
10541056
encoder_hidden_states = text_encoder(batch["input_ids"], return_dict=False)[0]

examples/controlnet/train_controlnet_sd3.py

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@
5050
)
5151
from diffusers.optimization import get_scheduler
5252
from diffusers.training_utils import compute_density_for_timestep_sampling, compute_loss_weighting_for_sd3, free_memory
53-
from diffusers.utils import check_min_version, is_wandb_available
53+
from diffusers.utils import check_min_version, is_wandb_available, make_image_grid
5454
from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
5555
from diffusers.utils.torch_utils import is_compiled_module
5656

@@ -64,17 +64,6 @@
6464
logger = get_logger(__name__)
6565

6666

67-
def image_grid(imgs, rows, cols):
68-
assert len(imgs) == rows * cols
69-
70-
w, h = imgs[0].size
71-
grid = Image.new("RGB", size=(cols * w, rows * h))
72-
73-
for i, img in enumerate(imgs):
74-
grid.paste(img, box=(i % cols * w, i // cols * h))
75-
return grid
76-
77-
7867
def log_validation(controlnet, args, accelerator, weight_dtype, step, is_final_validation=False):
7968
logger.info("Running validation... ")
8069

@@ -224,7 +213,7 @@ def save_model_card(repo_id: str, image_logs=None, base_model=str, repo_folder=N
224213
validation_image.save(os.path.join(repo_folder, "image_control.png"))
225214
img_str += f"prompt: {validation_prompt}\n"
226215
images = [validation_image] + images
227-
image_grid(images, 1, len(images)).save(os.path.join(repo_folder, f"images_{i}.png"))
216+
make_image_grid(images, 1, len(images)).save(os.path.join(repo_folder, f"images_{i}.png"))
228217
img_str += f"![images_{i})](./images_{i}.png)\n"
229218

230219
model_description = f"""

examples/controlnet/train_controlnet_sdxl.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1210,7 +1210,9 @@ def compute_embeddings(batch, proportion_empty_prompts, text_encoders, tokenizer
12101210

12111211
# Add noise to the latents according to the noise magnitude at each timestep
12121212
# (this is the forward diffusion process)
1213-
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
1213+
noisy_latents = noise_scheduler.add_noise(latents.float(), noise.float(), timesteps).to(
1214+
dtype=weight_dtype
1215+
)
12141216

12151217
# ControlNet conditioning.
12161218
controlnet_image = batch["conditioning_pixel_values"].to(dtype=weight_dtype)

0 commit comments

Comments
 (0)