Skip to content

Commit d824451

Browse files
authored
Merge branch 'main' into sd3.5_IPAdapter
2 parents 68169f8 + 5ed761a commit d824451

File tree

73 files changed

+8118
-142
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+8118
-142
lines changed

.github/workflows/push_tests.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,8 @@ jobs:
165165
group: gcp-ct5lp-hightpu-8t
166166
container:
167167
image: diffusers/diffusers-flax-tpu
168-
options: --shm-size "16gb" --ipc host --privileged ${{ vars.V5_LITEPOD_8_ENV}} -v /mnt/hf_cache:/mnt/hf_cache defaults:
168+
options: --shm-size "16gb" --ipc host --privileged ${{ vars.V5_LITEPOD_8_ENV}} -v /mnt/hf_cache:/mnt/hf_cache
169+
defaults:
169170
run:
170171
shell: bash
171172
steps:

docs/source/en/_toctree.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,8 @@
270270
title: FluxTransformer2DModel
271271
- local: api/models/hunyuan_transformer2d
272272
title: HunyuanDiT2DModel
273+
- local: api/models/hunyuan_video_transformer_3d
274+
title: HunyuanVideoTransformer3DModel
273275
- local: api/models/latte_transformer3d
274276
title: LatteTransformer3DModel
275277
- local: api/models/lumina_nextdit2d
@@ -284,6 +286,8 @@
284286
title: PriorTransformer
285287
- local: api/models/sd3_transformer2d
286288
title: SD3Transformer2DModel
289+
- local: api/models/sana_transformer2d
290+
title: SanaTransformer2DModel
287291
- local: api/models/stable_audio_transformer
288292
title: StableAudioDiTModel
289293
- local: api/models/transformer2d
@@ -314,6 +318,8 @@
314318
title: AutoencoderKLAllegro
315319
- local: api/models/autoencoderkl_cogvideox
316320
title: AutoencoderKLCogVideoX
321+
- local: api/models/autoencoder_kl_hunyuan_video
322+
title: AutoencoderKLHunyuanVideo
317323
- local: api/models/autoencoderkl_ltx_video
318324
title: AutoencoderKLLTXVideo
319325
- local: api/models/autoencoderkl_mochi
@@ -392,6 +398,8 @@
392398
title: Flux
393399
- local: api/pipelines/hunyuandit
394400
title: Hunyuan-DiT
401+
- local: api/pipelines/hunyuan_video
402+
title: HunyuanVideo
395403
- local: api/pipelines/i2vgenxl
396404
title: I2VGen-XL
397405
- local: api/pipelines/pix2pix
@@ -434,6 +442,8 @@
434442
title: PixArt-α
435443
- local: api/pipelines/pixart_sigma
436444
title: PixArt-Σ
445+
- local: api/pipelines/sana
446+
title: Sana
437447
- local: api/pipelines/self_attention_guidance
438448
title: Self-Attention Guidance
439449
- local: api/pipelines/semantic_stable_diffusion

docs/source/en/api/loaders/lora.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ LoRA is a fast and lightweight training method that inserts and trains a signifi
1717
- [`StableDiffusionLoraLoaderMixin`] provides functions for loading and unloading, fusing and unfusing, enabling and disabling, and more functions for managing LoRA weights. This class can be used with any model.
1818
- [`StableDiffusionXLLoraLoaderMixin`] is a [Stable Diffusion (SDXL)](../../api/pipelines/stable_diffusion/stable_diffusion_xl) version of the [`StableDiffusionLoraLoaderMixin`] class for loading and saving LoRA weights. It can only be used with the SDXL model.
1919
- [`SD3LoraLoaderMixin`] provides similar functions for [Stable Diffusion 3](https://huggingface.co/blog/sd3).
20+
- [`FluxLoraLoaderMixin`] provides similar functions for [Flux](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux).
21+
- [`CogVideoXLoraLoaderMixin`] provides similar functions for [CogVideoX](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox).
22+
- [`Mochi1LoraLoaderMixin`] provides similar functions for [Mochi](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi).
2023
- [`AmusedLoraLoaderMixin`] is for the [`AmusedPipeline`].
2124
- [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more.
2225

@@ -38,6 +41,18 @@ To learn more about how to load LoRA weights, see the [LoRA](../../using-diffuse
3841

3942
[[autodoc]] loaders.lora_pipeline.SD3LoraLoaderMixin
4043

44+
## FluxLoraLoaderMixin
45+
46+
[[autodoc]] loaders.lora_pipeline.FluxLoraLoaderMixin
47+
48+
## CogVideoXLoraLoaderMixin
49+
50+
[[autodoc]] loaders.lora_pipeline.CogVideoXLoraLoaderMixin
51+
52+
## Mochi1LoraLoaderMixin
53+
54+
[[autodoc]] loaders.lora_pipeline.Mochi1LoraLoaderMixin
55+
4156
## AmusedLoraLoaderMixin
4257

4358
[[autodoc]] loaders.lora_pipeline.AmusedLoraLoaderMixin
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLHunyuanVideo
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo](https://github.com/Tencent/HunyuanVideo/), which was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLHunyuanVideo
20+
21+
vae = AutoencoderKLHunyuanVideo.from_pretrained("tencent/HunyuanVideo", torch_dtype=torch.float16)
22+
```
23+
24+
## AutoencoderKLHunyuanVideo
25+
26+
[[autodoc]] AutoencoderKLHunyuanVideo
27+
- decode
28+
- all
29+
30+
## DecoderOutput
31+
32+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# HunyuanVideoTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import HunyuanVideoTransformer3DModel
20+
21+
transformer = HunyuanVideoTransformer3DModel.from_pretrained("tencent/HunyuanVideo", torch_dtype=torch.bfloat16)
22+
```
23+
24+
## HunyuanVideoTransformer3DModel
25+
26+
[[autodoc]] HunyuanVideoTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# SanaTransformer2DModel
13+
14+
A Diffusion Transformer model for 2D data from [SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) was introduced from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
15+
16+
The abstract from the paper is:
17+
18+
*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*
19+
20+
The model can be loaded with the following code snippet.
21+
22+
```python
23+
from diffusers import SanaTransformer2DModel
24+
25+
transformer = SanaTransformer2DModel.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_diffusers", subfolder="transformer", torch_dtype=torch.float16)
26+
```
27+
28+
## SanaTransformer2DModel
29+
30+
[[autodoc]] SanaTransformer2DModel
31+
32+
## Transformer2DModelOutput
33+
34+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# HunyuanVideo
16+
17+
[HunyuanVideo](https://www.arxiv.org/abs/2412.03603) by Tencent.
18+
19+
*Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/Tencent/HunyuanVideo).*
20+
21+
<Tip>
22+
23+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
24+
25+
</Tip>
26+
27+
Recommendations for inference:
28+
- Both text encoders should be in `torch.float16`.
29+
- Transformer should be in `torch.bfloat16`.
30+
- VAE should be in `torch.float16`.
31+
- `num_frames` should be of the form `4 * k + 1`, for example `49` or `129`.
32+
- For smaller resolution images, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
33+
- For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).
34+
35+
## HunyuanVideoPipeline
36+
37+
[[autodoc]] HunyuanVideoPipeline
38+
- all
39+
- __call__
40+
41+
## HunyuanVideoPipelineOutput
42+
43+
[[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

docs/source/en/api/pipelines/ltx_video.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,26 +31,38 @@ import torch
3131
from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
3232

3333
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
34-
transformer = LTXVideoTransformer3DModel.from_single_file(single_file_url, torch_dtype=torch.bfloat16)
34+
transformer = LTXVideoTransformer3DModel.from_single_file(
35+
single_file_url, torch_dtype=torch.bfloat16
36+
)
3537
vae = AutoencoderKLLTXVideo.from_single_file(single_file_url, torch_dtype=torch.bfloat16)
36-
pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", transformer=transformer, vae=vae, torch_dtype=torch.bfloat16)
38+
pipe = LTXImageToVideoPipeline.from_pretrained(
39+
"Lightricks/LTX-Video", transformer=transformer, vae=vae, torch_dtype=torch.bfloat16
40+
)
3741

3842
# ... inference code ...
3943
```
4044

41-
Alternatively, the pipeline can be used to load the weights with [~FromSingleFileMixin.from_single_file`].
45+
Alternatively, the pipeline can be used to load the weights with [`~FromSingleFileMixin.from_single_file`].
4246

4347
```python
4448
import torch
4549
from diffusers import LTXImageToVideoPipeline
4650
from transformers import T5EncoderModel, T5Tokenizer
4751

4852
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
49-
text_encoder = T5EncoderModel.from_pretrained("Lightricks/LTX-Video", subfolder="text_encoder", torch_dtype=torch.bfloat16)
50-
tokenizer = T5Tokenizer.from_pretrained("Lightricks/LTX-Video", subfolder="tokenizer", torch_dtype=torch.bfloat16)
51-
pipe = LTXImageToVideoPipeline.from_single_file(single_file_url, text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.bfloat16)
53+
text_encoder = T5EncoderModel.from_pretrained(
54+
"Lightricks/LTX-Video", subfolder="text_encoder", torch_dtype=torch.bfloat16
55+
)
56+
tokenizer = T5Tokenizer.from_pretrained(
57+
"Lightricks/LTX-Video", subfolder="tokenizer", torch_dtype=torch.bfloat16
58+
)
59+
pipe = LTXImageToVideoPipeline.from_single_file(
60+
single_file_url, text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.bfloat16
61+
)
5262
```
5363

64+
Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.
65+
5466
## LTXPipeline
5567

5668
[[autodoc]] LTXPipeline

0 commit comments

Comments
 (0)