Skip to content

Commit 4dc10fb

Browse files
authored
Merge branch 'main' into safety_checker
2 parents 53e5995 + 9a92b81 commit 4dc10fb

40 files changed

+6307
-15
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,8 @@
252252
title: SparseControlNetModel
253253
title: ControlNets
254254
- sections:
255+
- local: api/models/allegro_transformer3d
256+
title: AllegroTransformer3DModel
255257
- local: api/models/aura_flow_transformer2d
256258
title: AuraFlowTransformer2DModel
257259
- local: api/models/cogvideox_transformer3d
@@ -300,6 +302,8 @@
300302
- sections:
301303
- local: api/models/autoencoderkl
302304
title: AutoencoderKL
305+
- local: api/models/autoencoderkl_allegro
306+
title: AutoencoderKLAllegro
303307
- local: api/models/autoencoderkl_cogvideox
304308
title: AutoencoderKLCogVideoX
305309
- local: api/models/asymmetricautoencoderkl
@@ -318,6 +322,8 @@
318322
sections:
319323
- local: api/pipelines/overview
320324
title: Overview
325+
- local: api/pipelines/allegro
326+
title: Allegro
321327
- local: api/pipelines/amused
322328
title: aMUSEd
323329
- local: api/pipelines/animatediff
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AllegroTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data from [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AllegroTransformer3DModel
20+
21+
vae = AllegroTransformer3DModel.from_pretrained("rhymes-ai/Allegro", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
22+
```
23+
24+
## AllegroTransformer3DModel
25+
26+
[[autodoc]] AllegroTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLAllegro
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLAllegro
20+
21+
vae = AutoencoderKLCogVideoX.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32).to("cuda")
22+
```
23+
24+
## AutoencoderKLAllegro
25+
26+
[[autodoc]] AutoencoderKLAllegro
27+
- decode
28+
- encode
29+
- all
30+
31+
## AutoencoderKLOutput
32+
33+
[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
34+
35+
## DecoderOutput
36+
37+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# Allegro
13+
14+
[Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) from RhymesAI, by Yuan Zhou, Qiuyue Wang, Yuxuan Cai, Huan Yang.
15+
16+
The abstract from the paper is:
17+
18+
*Significant advancements have been made in the field of video generation, with the open-source community contributing a wealth of research papers and tools for training high-quality models. However, despite these efforts, the available information and resources remain insufficient for achieving commercial-level performance. In this report, we open the black box and introduce Allegro, an advanced video generation model that excels in both quality and temporal consistency. We also highlight the current limitations in the field and present a comprehensive methodology for training high-performance, commercial-level video generation models, addressing key aspects such as data, model architecture, training pipeline, and evaluation. Our user study shows that Allegro surpasses existing open-source models and most commercial models, ranking just behind Hailuo and Kling. Code: https://github.com/rhymes-ai/Allegro , Model: https://huggingface.co/rhymes-ai/Allegro , Gallery: https://rhymes.ai/allegro_gallery .*
19+
20+
<Tip>
21+
22+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
23+
24+
</Tip>
25+
26+
## AllegroPipeline
27+
28+
[[autodoc]] AllegroPipeline
29+
- all
30+
- __call__
31+
32+
## AllegroPipelineOutput
33+
34+
[[autodoc]] pipelines.allegro.pipeline_output.AllegroPipelineOutput

examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1650,6 +1650,8 @@ def save_model_hook(models, weights, output_dir):
16501650
elif isinstance(model, type(unwrap_model(text_encoder_one))):
16511651
if args.train_text_encoder: # when --train_text_encoder_ti we don't save the layers
16521652
text_encoder_one_lora_layers_to_save = get_peft_model_state_dict(model)
1653+
elif isinstance(model, type(unwrap_model(text_encoder_two))):
1654+
pass # when --train_text_encoder_ti and --enable_t5_ti we don't save the layers
16531655
else:
16541656
raise ValueError(f"unexpected save model: {model.__class__}")
16551657

examples/community/README.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ Please also check out our [Community Scripts](https://github.com/huggingface/dif
7373
| Stable Diffusion BoxDiff Pipeline | Training-free controlled generation with bounding boxes using [BoxDiff](https://github.com/showlab/BoxDiff) | [Stable Diffusion BoxDiff Pipeline](#stable-diffusion-boxdiff) | - | [Jingyang Zhang](https://github.com/zjysteven/) |
7474
| FRESCO V2V Pipeline | Implementation of [[CVPR 2024] FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation](https://arxiv.org/abs/2403.12962) | [FRESCO V2V Pipeline](#fresco) | - | [Yifan Zhou](https://github.com/SingleZombie) |
7575
| AnimateDiff IPEX Pipeline | Accelerate AnimateDiff inference pipeline with BF16/FP32 precision on Intel Xeon CPUs with [IPEX](https://github.com/intel/intel-extension-for-pytorch) | [AnimateDiff on IPEX](#animatediff-on-ipex) | - | [Dan Li](https://github.com/ustcuna/) |
76+
PIXART-α Controlnet pipeline | Implementation of the controlnet model for pixart alpha and its diffusers pipeline | [PIXART-α Controlnet pipeline](#pixart-α-controlnet-pipeline) | - | [Raul Ciotescu](https://github.com/raulc0399/) |
7677
| HunyuanDiT Differential Diffusion Pipeline | Applies [Differential Diffusion](https://github.com/exx8/differential-diffusion) to [HunyuanDiT](https://github.com/huggingface/diffusers/pull/8240). | [HunyuanDiT with Differential Diffusion](#hunyuandit-with-differential-diffusion) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1v44a5fpzyr4Ffr4v2XBQ7BajzG874N4P?usp=sharing) | [Monjoy Choudhury](https://github.com/MnCSSJ4x) |
7778
| [🪆Matryoshka Diffusion Models](https://huggingface.co/papers/2310.15111) | A diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small scale inputs are nested within those of the large scales. See [original codebase](https://github.com/apple/ml-mdm). | [🪆Matryoshka Diffusion Models](#matryoshka-diffusion-models) | [![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow)](https://huggingface.co/spaces/pcuenq/mdm) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/tolgacangoz/1f54875fc7aeaabcf284ebde64820966/matryoshka_hf.ipynb) | [M. Tolga Cangöz](https://github.com/tolgacangoz) |
7879

@@ -4445,3 +4446,94 @@ grid_image.save(grid_dir + "sample.png")
44454446
`pag_scale` : guidance scale of PAG (ex: 5.0)
44464447

44474448
`pag_applied_layers_index` : index of the layer to apply perturbation (ex: ['m0'])
4449+
4450+
# PIXART-α Controlnet pipeline
4451+
4452+
[Project](https://pixart-alpha.github.io/) / [GitHub](https://github.com/PixArt-alpha/PixArt-alpha/blob/master/asset/docs/pixart_controlnet.md)
4453+
4454+
This the implementation of the controlnet model and the pipelne for the Pixart-alpha model, adapted to use the HuggingFace Diffusers.
4455+
4456+
## Example Usage
4457+
4458+
This example uses the Pixart HED Controlnet model, converted from the control net model as trained by the authors of the paper.
4459+
4460+
```py
4461+
import sys
4462+
import os
4463+
import torch
4464+
import torchvision.transforms as T
4465+
import torchvision.transforms.functional as TF
4466+
4467+
from pipeline_pixart_alpha_controlnet import PixArtAlphaControlnetPipeline
4468+
from diffusers.utils import load_image
4469+
4470+
from diffusers.image_processor import PixArtImageProcessor
4471+
4472+
from controlnet_aux import HEDdetector
4473+
4474+
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
4475+
from pixart.controlnet_pixart_alpha import PixArtControlNetAdapterModel
4476+
4477+
controlnet_repo_id = "raulc0399/pixart-alpha-hed-controlnet"
4478+
4479+
weight_dtype = torch.float16
4480+
image_size = 1024
4481+
4482+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
4483+
4484+
torch.manual_seed(0)
4485+
4486+
# load controlnet
4487+
controlnet = PixArtControlNetAdapterModel.from_pretrained(
4488+
controlnet_repo_id,
4489+
torch_dtype=weight_dtype,
4490+
use_safetensors=True,
4491+
).to(device)
4492+
4493+
pipe = PixArtAlphaControlnetPipeline.from_pretrained(
4494+
"PixArt-alpha/PixArt-XL-2-1024-MS",
4495+
controlnet=controlnet,
4496+
torch_dtype=weight_dtype,
4497+
use_safetensors=True,
4498+
).to(device)
4499+
4500+
images_path = "images"
4501+
control_image_file = "0_7.jpg"
4502+
4503+
prompt = "battleship in space, galaxy in background"
4504+
4505+
control_image_name = control_image_file.split('.')[0]
4506+
4507+
control_image = load_image(f"{images_path}/{control_image_file}")
4508+
print(control_image.size)
4509+
height, width = control_image.size
4510+
4511+
hed = HEDdetector.from_pretrained("lllyasviel/Annotators")
4512+
4513+
condition_transform = T.Compose([
4514+
T.Lambda(lambda img: img.convert('RGB')),
4515+
T.CenterCrop([image_size, image_size]),
4516+
])
4517+
4518+
control_image = condition_transform(control_image)
4519+
hed_edge = hed(control_image, detect_resolution=image_size, image_resolution=image_size)
4520+
4521+
hed_edge.save(f"{images_path}/{control_image_name}_hed.jpg")
4522+
4523+
# run pipeline
4524+
with torch.no_grad():
4525+
out = pipe(
4526+
prompt=prompt,
4527+
image=hed_edge,
4528+
num_inference_steps=14,
4529+
guidance_scale=4.5,
4530+
height=image_size,
4531+
width=image_size,
4532+
)
4533+
4534+
out.images[0].save(f"{images_path}//{control_image_name}_output.jpg")
4535+
4536+
```
4537+
4538+
In the folder examples/pixart there is also a script that can be used to train new models.
4539+
Please check the script `train_controlnet_hf_diffusers.sh` on how to start the training.

examples/dreambooth/README_flux.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,21 @@ accelerate launch train_dreambooth_lora_flux.py \
170170
--push_to_hub
171171
```
172172

173+
### Target Modules
174+
When LoRA was first adapted from language models to diffusion models, it was applied to the cross-attention layers in the Unet that relate the image representations with the prompts that describe them.
175+
More recently, SOTA text-to-image diffusion models replaced the Unet with a diffusion Transformer(DiT). With this change, we may also want to explore
176+
applying LoRA training onto different types of layers and blocks. To allow more flexibility and control over the targeted modules we added `--lora_layers`- in which you can specify in a comma seperated string
177+
the exact modules for LoRA training. Here are some examples of target modules you can provide:
178+
- for attention only layers: `--lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0"`
179+
- to train the same modules as in the fal trainer: `--lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0,attn.add_k_proj,attn.add_q_proj,attn.add_v_proj,attn.to_add_out,ff.net.0.proj,ff.net.2,ff_context.net.0.proj,ff_context.net.2"`
180+
- to train the same modules as in ostris ai-toolkit / replicate trainer: `--lora_blocks="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0,attn.add_k_proj,attn.add_q_proj,attn.add_v_proj,attn.to_add_out,ff.net.0.proj,ff.net.2,ff_context.net.0.proj,ff_context.net.2,norm1_context.linear, norm1.linear,norm.linear,proj_mlp,proj_out"`
181+
> [!NOTE]
182+
> `--lora_layers` can also be used to specify which **blocks** to apply LoRA training to. To do so, simply add a block prefix to each layer in the comma seperated string:
183+
> **single DiT blocks**: to target the ith single transformer block, add the prefix `single_transformer_blocks.i`, e.g. - `single_transformer_blocks.i.attn.to_k`
184+
> **MMDiT blocks**: to target the ith MMDiT block, add the prefix `transformer_blocks.i`, e.g. - `transformer_blocks.i.attn.to_k`
185+
> [!NOTE]
186+
> keep in mind that while training more layers can improve quality and expressiveness, it also increases the size of the output LoRA weights.
187+
173188
### Text Encoder Training
174189

175190
Alongside the transformer, fine-tuning of the CLIP text encoder is also supported.

examples/dreambooth/README_sd3.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,40 @@ accelerate launch train_dreambooth_lora_sd3.py \
147147
--push_to_hub
148148
```
149149

150+
### Targeting Specific Blocks & Layers
151+
As image generation models get bigger & more powerful, more fine-tuners come to find that training only part of the
152+
transformer blocks (sometimes as little as two) can be enough to get great results.
153+
In some cases, it can be even better to maintain some of the blocks/layers frozen.
154+
155+
For **SD3.5-Large** specifically, you may find this information useful (taken from: [Stable Diffusion 3.5 Large Fine-tuning Tutorial](https://stabilityai.notion.site/Stable-Diffusion-3-5-Large-Fine-tuning-Tutorial-11a61cdcd1968027a15bdbd7c40be8c6#12461cdcd19680788a23c650dab26b93):
156+
> [!NOTE]
157+
> A commonly believed heuristic that we verified once again during the construction of the SD3.5 family of models is that later/higher layers (i.e. `30 - 37`)* impact tertiary details more heavily. Conversely, earlier layers (i.e. `12 - 24` )* influence the overall composition/primary form more.
158+
> So, freezing other layers/targeting specific layers is a viable approach.
159+
> `*`These suggested layers are speculative and not 100% guaranteed. The tips here are more or less a general idea for next steps.
160+
> **Photorealism**
161+
> In preliminary testing, we observed that freezing the last few layers of the architecture significantly improved model training when using a photorealistic dataset, preventing detail degradation introduced by small dataset from happening.
162+
> **Anatomy preservation**
163+
> To dampen any possible degradation of anatomy, training only the attention layers and **not** the adaptive linear layers could help. For reference, below is one of the transformer blocks.
164+
165+
166+
We've added `--lora_layers` and `--lora_blocks` to make LoRA training modules configurable.
167+
- with `--lora_blocks` you can specify the block numbers for training. E.g. passing -
168+
```diff
169+
--lora_blocks "12,13,14,15,16,17,18,19,20,21,22,23,24,30,31,32,33,34,35,36,37"
170+
```
171+
will trigger LoRA training of transformer blocks 12-24 and 30-37. By default, all blocks are trained.
172+
- with `--lora_layers` you can specify the types of layers you wish to train.
173+
By default, the trained layers are -
174+
`attn.add_k_proj,attn.add_q_proj,attn.add_v_proj,attn.to_add_out,attn.to_k,attn.to_out.0,attn.to_q,attn.to_v`
175+
If you wish to have a leaner LoRA / train more blocks over layers you could pass -
176+
```diff
177+
+ --lora_layers attn.to_k,attn.to_q,attn.to_v,attn.to_out.0
178+
```
179+
This will reduce LoRA size by roughly 50% for the same rank compared to the default.
180+
However, if you're after compact LoRAs, it's our impression that maintaining the default setting for `--lora_layers` and
181+
freezing some of the early & blocks is usually better.
182+
183+
150184
### Text Encoder Training
151185
Alongside the transformer, LoRA fine-tuning of the CLIP text encoders is now also supported.
152186
To do so, just specify `--train_text_encoder` while launching training. Please keep the following points in mind:

0 commit comments

Comments
 (0)