Skip to content

Commit e2b5723

Browse files
authored
Merge branch 'main' into handle-missing-keys-lora
2 parents dfd9865 + 22ed39f commit e2b5723

28 files changed

+7836
-22
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,8 @@
242242
title: AuraFlowTransformer2DModel
243243
- local: api/models/cogvideox_transformer3d
244244
title: CogVideoXTransformer3DModel
245+
- local: api/models/cogview3plus_transformer2d
246+
title: CogView3PlusTransformer2DModel
245247
- local: api/models/dit_transformer2d
246248
title: DiTTransformer2DModel
247249
- local: api/models/flux_transformer
@@ -320,6 +322,8 @@
320322
title: BLIP-Diffusion
321323
- local: api/pipelines/cogvideox
322324
title: CogVideoX
325+
- local: api/pipelines/cogview3
326+
title: CogView3
323327
- local: api/pipelines/consistency_models
324328
title: Consistency Models
325329
- local: api/pipelines/controlnet
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# CogView3PlusTransformer2DModel
13+
14+
A Diffusion Transformer model for 2D data from [CogView3Plus](https://github.com/THUDM/CogView3) was introduced in [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) by Tsinghua University & ZhipuAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import CogView3PlusTransformer2DModel
20+
21+
vae = CogView3PlusTransformer2DModel.from_pretrained("THUDM/CogView3Plus-3b", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
22+
```
23+
24+
## CogView3PlusTransformer2DModel
25+
26+
[[autodoc]] CogView3PlusTransformer2DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# CogView3Plus
17+
18+
[CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) from Tsinghua University & ZhipuAI, by Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang.
19+
20+
The abstract from the paper is:
21+
22+
*Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.*
23+
24+
<Tip>
25+
26+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
27+
28+
</Tip>
29+
30+
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
31+
32+
## CogView3PlusPipeline
33+
34+
[[autodoc]] CogView3PlusPipeline
35+
- all
36+
- __call__
37+
38+
## CogView3PipelineOutput
39+
40+
[[autodoc]] pipelines.cogview3.pipeline_output.CogView3PipelineOutput

examples/community/README.md

Lines changed: 56 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,8 @@ Please also check out our [Community Scripts](https://github.com/huggingface/dif
7373
| Stable Diffusion BoxDiff Pipeline | Training-free controlled generation with bounding boxes using [BoxDiff](https://github.com/showlab/BoxDiff) | [Stable Diffusion BoxDiff Pipeline](#stable-diffusion-boxdiff) | - | [Jingyang Zhang](https://github.com/zjysteven/) |
7474
| FRESCO V2V Pipeline | Implementation of [[CVPR 2024] FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation](https://arxiv.org/abs/2403.12962) | [FRESCO V2V Pipeline](#fresco) | - | [Yifan Zhou](https://github.com/SingleZombie) |
7575
| AnimateDiff IPEX Pipeline | Accelerate AnimateDiff inference pipeline with BF16/FP32 precision on Intel Xeon CPUs with [IPEX](https://github.com/intel/intel-extension-for-pytorch) | [AnimateDiff on IPEX](#animatediff-on-ipex) | - | [Dan Li](https://github.com/ustcuna/) |
76-
| HunyuanDiT Differential Diffusion Pipeline | Applies [Differential Diffsuion](https://github.com/exx8/differential-diffusion) to [HunyuanDiT](https://github.com/huggingface/diffusers/pull/8240). | [HunyuanDiT with Differential Diffusion](#hunyuandit-with-differential-diffusion) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1v44a5fpzyr4Ffr4v2XBQ7BajzG874N4P?usp=sharing) | [Monjoy Choudhury](https://github.com/MnCSSJ4x) |
76+
| HunyuanDiT Differential Diffusion Pipeline | Applies [Differential Diffusion](https://github.com/exx8/differential-diffusion) to [HunyuanDiT](https://github.com/huggingface/diffusers/pull/8240). | [HunyuanDiT with Differential Diffusion](#hunyuandit-with-differential-diffusion) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1v44a5fpzyr4Ffr4v2XBQ7BajzG874N4P?usp=sharing) | [Monjoy Choudhury](https://github.com/MnCSSJ4x) |
77+
| [🪆Matryoshka Diffusion Models](https://huggingface.co/papers/2310.15111) | A diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small scale inputs are nested within those of the large scales. See [original codebase](https://github.com/apple/ml-mdm). | [🪆Matryoshka Diffusion Models](#matryoshka-diffusion-models) | [![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow)](https://huggingface.co/spaces/pcuenq/mdm) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/tolgacangoz/1f54875fc7aeaabcf284ebde64820966/matryoshka_hf.ipynb) | [M. Tolga Cangöz](https://github.com/tolgacangoz) |
7778

7879
To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.
7980

@@ -85,28 +86,28 @@ pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion
8586

8687
### Flux with CFG
8788

88-
Know more about Flux [here](https://blackforestlabs.ai/announcing-black-forest-labs/). Since Flux doesn't use CFG, this implementation provides one, inspired by the [PuLID Flux adaptation](https://github.com/ToTheBeginning/PuLID/blob/main/docs/pulid_for_flux.md).
89+
Know more about Flux [here](https://blackforestlabs.ai/announcing-black-forest-labs/). Since Flux doesn't use CFG, this implementation provides one, inspired by the [PuLID Flux adaptation](https://github.com/ToTheBeginning/PuLID/blob/main/docs/pulid_for_flux.md).
8990

9091
Example usage:
9192

9293
```py
9394
from diffusers import DiffusionPipeline
94-
import torch
95+
import torch
9596

9697
pipeline = DiffusionPipeline.from_pretrained(
97-
"black-forest-labs/FLUX.1-dev",
98-
torch_dtype=torch.bfloat16,
98+
"black-forest-labs/FLUX.1-dev",
99+
torch_dtype=torch.bfloat16,
99100
custom_pipeline="pipeline_flux_with_cfg"
100101
)
101102
pipeline.enable_model_cpu_offload()
102103
prompt = "a watercolor painting of a unicorn"
103104
negative_prompt = "pink"
104105

105106
img = pipeline(
106-
prompt=prompt,
107-
negative_prompt=negative_prompt,
108-
true_cfg=1.5,
109-
guidance_scale=3.5,
107+
prompt=prompt,
108+
negative_prompt=negative_prompt,
109+
true_cfg=1.5,
110+
guidance_scale=3.5,
110111
num_images_per_prompt=1,
111112
generator=torch.manual_seed(0)
112113
).images[0]
@@ -2656,7 +2657,7 @@ image with mask mech_painted.png
26562657

26572658
<img src=https://github.com/noskill/diffusers/assets/733626/c334466a-67fe-4377-9ff7-f46021b9c224 width="25%" >
26582659

2659-
result:
2660+
result:
26602661

26612662
<img src=https://github.com/noskill/diffusers/assets/733626/5043fb57-a785-4606-a5ba-a36704f7cb42 width="25%" >
26622663

@@ -4324,6 +4325,51 @@ image = pipe(
43244325

43254326
A colab notebook demonstrating all results can be found [here](https://colab.research.google.com/drive/1v44a5fpzyr4Ffr4v2XBQ7BajzG874N4P?usp=sharing). Depth Maps have also been added in the same colab.
43264327

4328+
### 🪆Matryoshka Diffusion Models
4329+
4330+
![🪆Matryoshka Diffusion Models](https://github.com/user-attachments/assets/bf90b53b-48c3-4769-a805-d9dfe4a7c572)
4331+
4332+
The Abstract of the paper:
4333+
>Diffusion models are the _de-facto_ approach for generating high-quality images and videos but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space, or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion (MDM), **a novel framework for high-resolution image and video synthesis**. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a **NestedUNet** architecture where features and parameters for small scale inputs are nested within those of the large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a **_single pixel-space model_ at resolutions of up to 1024 × 1024 pixels**, demonstrating strong zero shot generalization using the **CC12M dataset, which contains only 12 million images**. Code and pre-trained checkpoints are released at https://github.com/apple/ml-mdm.
4334+
4335+
- `64×64, nesting_level=0`: 1.719 GiB. With `50` DDIM inference steps:
4336+
4337+
**64x64**
4338+
:-------------------------:
4339+
| <img src="https://github.com/user-attachments/assets/9e7bb2cd-45a0-4bd1-adb8-23e283baed39" width="222" height="222" alt="bird_64"> |
4340+
4341+
- `256×256, nesting_level=1`: 1.776 GiB. With `150` DDIM inference steps:
4342+
4343+
**64x64** | **256x256**
4344+
:-------------------------:|:-------------------------:
4345+
| <img src="https://github.com/user-attachments/assets/6b724c2e-5e6a-4b63-9b65-c1182cbb67e0" width="222" height="222" alt="64x64"> | <img src="https://github.com/user-attachments/assets/7dbab2ad-bf40-4a73-ab04-f178347cb7d5" width="222" height="222" alt="256x256"> |
4346+
4347+
- `1024×1024, nesting_level=2`: 1.792 GiB. As one can realize the cost of adding another layer is really negligible. With `250` DDIM inference steps:
4348+
4349+
**64x64** | **256x256** | **1024x1024**
4350+
:-------------------------:|:-------------------------:|:-------------------------:
4351+
| <img src="https://github.com/user-attachments/assets/4a9454e4-e20a-4736-a196-270e2ae796c0" width="222" height="222" alt="64x64"> | <img src="https://github.com/user-attachments/assets/4a96555d-0fda-4303-82b1-a4d886f770b9" width="222" height="222" alt="256x256"> | <img src="https://github.com/user-attachments/assets/e0239b7a-ab73-4d45-8f3e-b4e6b4b50abe" width="222" height="222" alt="1024x1024"> |
4352+
4353+
```py
4354+
from diffusers import DiffusionPipeline
4355+
from diffusers.utils import make_image_grid
4356+
4357+
# nesting_level=0 -> 64x64; nesting_level=1 -> 256x256 - 64x64; nesting_level=2 -> 1024x1024 - 256x256 - 64x64
4358+
pipe = DiffusionPipeline.from_pretrained("tolgacangoz/matryoshka-diffusion-models",
4359+
nesting_level=0,
4360+
trust_remote_code=False, # One needs to give permission for this code to run
4361+
).to("cuda")
4362+
4363+
prompt0 = "a blue jay stops on the top of a helmet of Japanese samurai, background with sakura tree"
4364+
prompt = f"breathtaking {prompt0}. award-winning, professional, highly detailed"
4365+
negative_prompt = "deformed, mutated, ugly, disfigured, blur, blurry, noise, noisy"
4366+
image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=50).images
4367+
make_image_grid(image, rows=1, cols=len(image))
4368+
4369+
# pipe.change_nesting_level(<int>) # 0, 1, or 2
4370+
# 50+, 100+, and 250+ num_inference_steps are recommended for nesting levels 0, 1, and 2 respectively.
4371+
```
4372+
43274373
# Perturbed-Attention Guidance
43284374

43294375
[Project](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) / [arXiv](https://arxiv.org/abs/2403.17377) / [GitHub](https://github.com/KU-CVLAB/Perturbed-Attention-Guidance)

0 commit comments

Comments
 (0)