Skip to content

Commit 155a846

Browse files
committed
Merge remote-tracking branch 'refs/remotes/origin/main' into sd3-xformers
# Conflicts: # src/diffusers/models/attention_processor.py
2 parents e97bc2a + bbd2f9d commit 155a846

File tree

85 files changed

+13661
-87
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+13661
-87
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,4 +175,4 @@ tags
175175
.ruff_cache
176176

177177
# wandb
178-
wandb
178+
wandb

docs/source/en/_toctree.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,12 @@
249249
title: DiTTransformer2DModel
250250
- local: api/models/hunyuan_transformer2d
251251
title: HunyuanDiT2DModel
252+
- local: api/models/aura_flow_transformer2d
253+
title: AuraFlowTransformer2DModel
254+
- local: api/models/latte_transformer3d
255+
title: LatteTransformer3DModel
256+
- local: api/models/lumina_nextdit2d
257+
title: LuminaNextDiT2DModel
252258
- local: api/models/transformer_temporal
253259
title: TransformerTemporalModel
254260
- local: api/models/sd3_transformer2d
@@ -276,6 +282,8 @@
276282
title: AudioLDM
277283
- local: api/pipelines/audioldm2
278284
title: AudioLDM 2
285+
- local: api/pipelines/aura_flow
286+
title: AuraFlow
279287
- local: api/pipelines/auto_pipeline
280288
title: AutoPipeline
281289
- local: api/pipelines/blip_diffusion
@@ -318,12 +326,16 @@
318326
title: Kandinsky 2.2
319327
- local: api/pipelines/kandinsky3
320328
title: Kandinsky 3
329+
- local: api/pipelines/kolors
330+
title: Kolors
321331
- local: api/pipelines/latent_consistency_models
322332
title: Latent Consistency Models
323333
- local: api/pipelines/latent_diffusion
324334
title: Latent Diffusion
325335
- local: api/pipelines/ledits_pp
326336
title: LEDITS++
337+
- local: api/pipelines/lumina
338+
title: Lumina-T2X
327339
- local: api/pipelines/marigold
328340
title: Marigold
329341
- local: api/pipelines/panorama
@@ -435,6 +447,8 @@
435447
title: EulerDiscreteScheduler
436448
- local: api/schedulers/flow_match_euler_discrete
437449
title: FlowMatchEulerDiscreteScheduler
450+
- local: api/schedulers/flow_match_heun_discrete
451+
title: FlowMatchHeunDiscreteScheduler
438452
- local: api/schedulers/heun
439453
title: HeunDiscreteScheduler
440454
- local: api/schedulers/ipndm
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# AuraFlowTransformer2DModel
14+
15+
A Transformer model for image-like data from [AuraFlow](https://blog.fal.ai/auraflow/).
16+
17+
## AuraFlowTransformer2DModel
18+
19+
[[autodoc]] AuraFlowTransformer2DModel
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
## LatteTransformer3DModel
14+
15+
A Diffusion Transformer model for 3D data from [Latte](https://github.com/Vchitect/Latte).
16+
17+
## LatteTransformer3DModel
18+
19+
[[autodoc]] LatteTransformer3DModel
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# LuminaNextDiT2DModel
14+
15+
A Next Version of Diffusion Transformer model for 2D data from [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X).
16+
17+
## LuminaNextDiT2DModel
18+
19+
[[autodoc]] LuminaNextDiT2DModel
20+

docs/source/en/api/pipelines/animatediff.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -560,6 +560,20 @@ export_to_gif(frames, "animatelcm-motion-lora.gif")
560560
</table>
561561

562562

563+
## Using `from_single_file` with the MotionAdapter
564+
565+
`diffusers>=0.30.0` supports loading the AnimateDiff checkpoints into the `MotionAdapter` in their original format via `from_single_file`
566+
567+
```python
568+
from diffusers import MotionAdapter
569+
570+
ckpt_path = "https://huggingface.co/Lightricks/LongAnimateDiff/blob/main/lt_long_mm_32_frames.ckpt"
571+
572+
adapter = MotionAdapter.from_single_file(ckpt_path, torch_dtype=torch.float16)
573+
pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter)
574+
575+
```
576+
563577
## AnimateDiffPipeline
564578

565579
[[autodoc]] AnimateDiffPipeline
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# AuraFlow
14+
15+
AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3.md) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark.
16+
17+
It was developed by the Fal team and more details about it can be found in [this blog post](https://blog.fal.ai/auraflow/).
18+
19+
<Tip>
20+
21+
AuraFlow can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details.
22+
23+
</Tip>
24+
25+
## AuraFlowPipeline
26+
27+
[[autodoc]] AuraFlowPipeline
28+
- all
29+
- __call__
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis
14+
15+
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/kolors_header_collage.png)
16+
17+
Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team]([email protected]). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf).
18+
19+
The abstract from the technical report is:
20+
21+
*We present Kolors, a latent diffusion model for text-to-image synthesis, characterized by its profound understanding of both English and Chinese, as well as an impressive degree of photorealism. There are three key insights contributing to the development of Kolors. Firstly, unlike large language model T5 used in Imagen and Stable Diffusion 3, Kolors is built upon the General Language Model (GLM), which enhances its comprehension capabilities in both English and Chinese. Moreover, we employ a multimodal large language model to recaption the extensive training dataset for fine-grained text understanding. These strategies significantly improve Kolors’ ability to comprehend intricate semantics, particularly those involving multiple entities, and enable its advanced text rendering capabilities. Secondly, we divide the training of Kolors into two phases: the concept learning phase with broad knowledge and the quality improvement phase with specifically curated high-aesthetic data. Furthermore, we investigate the critical role of the noise schedule and introduce a novel schedule to optimize high-resolution image generation. These strategies collectively enhance the visual appeal of the generated high-resolution images. Lastly, we propose a category-balanced benchmark KolorsPrompts, which serves as a guide for the training and evaluation of Kolors. Consequently, even when employing the commonly used U-Net backbone, Kolors has demonstrated remarkable performance in human evaluations, surpassing the existing open-source models and achieving Midjourney-v6 level performance, especially in terms of visual appeal. We will release the code and weights of Kolors at <https://github.com/Kwai-Kolors/Kolors>, and hope that it will benefit future research and applications in the visual generation community.*
22+
23+
## Usage Example
24+
25+
```python
26+
import torch
27+
28+
from diffusers import DPMSolverMultistepScheduler, KolorsPipeline
29+
30+
pipe = KolorsPipeline.from_pretrained("Kwai-Kolors/Kolors-diffusers", torch_dtype=torch.float16, variant="fp16")
31+
pipe.to("cuda")
32+
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)
33+
34+
image = pipe(
35+
prompt='一张瓢虫的照片,微距,变焦,高质量,电影,拿着一个牌子,写着"可图"',
36+
negative_prompt="",
37+
guidance_scale=6.5,
38+
num_inference_steps=25,
39+
).images[0]
40+
41+
image.save("kolors_sample.png")
42+
```
43+
44+
## KolorsPipeline
45+
46+
[[autodoc]] KolorsPipeline
47+
48+
- all
49+
- __call__
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Lumina-T2X
14+
![concepts](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9f52eabb-07dc-4881-8257-6d8a5f2a0a5a)
15+
16+
[Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT](https://github.com/Alpha-VLLM/Lumina-T2X/blob/main/assets/lumina-next.pdf) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory.
17+
18+
The abstract from the paper is:
19+
20+
*Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers (Flag-DiT) that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduce a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights at https://github.com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling.*
21+
22+
**Highlights**: Lumina-Next is a next-generation Diffusion Transformer that significantly enhances text-to-image generation, multilingual generation, and multitask performance by introducing the Next-DiT architecture, 3D RoPE, and frequency- and time-aware RoPE, among other improvements.
23+
24+
Lumina-Next has the following components:
25+
* It improves sampling efficiency with fewer and faster Steps.
26+
* It uses a Next-DiT as a transformer backbone with Sandwichnorm 3D RoPE, and Grouped-Query Attention.
27+
* It uses a Frequency- and Time-Aware Scaled RoPE.
28+
29+
---
30+
31+
[Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers](https://arxiv.org/abs/2405.05945) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory.
32+
33+
The abstract from the paper is:
34+
35+
*Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.*
36+
37+
38+
You can find the original codebase at [Alpha-VLLM](https://github.com/Alpha-VLLM/Lumina-T2X) and all the available checkpoints at [Alpha-VLLM Lumina Family](https://huggingface.co/collections/Alpha-VLLM/lumina-family-66423205bedb81171fd0644b).
39+
40+
**Highlights**: Lumina-T2X supports Any Modality, Resolution, and Duration.
41+
42+
Lumina-T2X has the following components:
43+
* It uses a Flow-based Large Diffusion Transformer as the backbone
44+
* It supports different any modalities with one backbone and corresponding encoder, decoder.
45+
46+
<Tip>
47+
48+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
49+
50+
</Tip>
51+
52+
### Inference (Text-to-Image)
53+
54+
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
55+
56+
First, load the pipeline:
57+
58+
```python
59+
from diffusers import LuminaText2ImgPipeline
60+
import torch
61+
62+
pipeline = LuminaText2ImgPipeline.from_pretrained(
63+
"Alpha-VLLM/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16
64+
).to("cuda")
65+
```
66+
67+
Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
68+
69+
```python
70+
pipeline.transformer.to(memory_format=torch.channels_last)
71+
pipeline.vae.to(memory_format=torch.channels_last)
72+
```
73+
74+
Finally, compile the components and run inference:
75+
76+
```python
77+
pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
78+
pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True)
79+
80+
image = pipeline(prompt="Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution cityscape with smoky skies and tall, metal structures").images[0]
81+
```
82+
83+
## LuminaText2ImgPipeline
84+
85+
[[autodoc]] LuminaText2ImgPipeline
86+
- all
87+
- __call__
88+

docs/source/en/api/pipelines/pag.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,11 @@ The abstract from the paper is:
2525
- all
2626
- __call__
2727

28+
## StableDiffusionControlNetPAGPipeline
29+
[[autodoc]] StableDiffusionControlNetPAGPipeline
30+
- all
31+
- __call__
32+
2833
## StableDiffusionXLPAGPipeline
2934
[[autodoc]] StableDiffusionXLPAGPipeline
3035
- all

0 commit comments

Comments
 (0)