Skip to content

Commit 0f6bec9

Browse files
authored
Merge branch 'main' into fixes-issue-10872
2 parents 56a7718 + cc22058 commit 0f6bec9

File tree

112 files changed

+6169
-1378
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

112 files changed

+6169
-1378
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,8 @@
290290
title: CogView4Transformer2DModel
291291
- local: api/models/dit_transformer2d
292292
title: DiTTransformer2DModel
293+
- local: api/models/easyanimate_transformer3d
294+
title: EasyAnimateTransformer3DModel
293295
- local: api/models/flux_transformer
294296
title: FluxTransformer2DModel
295297
- local: api/models/hunyuan_transformer2d
@@ -352,6 +354,8 @@
352354
title: AutoencoderKLHunyuanVideo
353355
- local: api/models/autoencoderkl_ltx_video
354356
title: AutoencoderKLLTXVideo
357+
- local: api/models/autoencoderkl_magvit
358+
title: AutoencoderKLMagvit
355359
- local: api/models/autoencoderkl_mochi
356360
title: AutoencoderKLMochi
357361
- local: api/models/autoencoder_kl_wan
@@ -430,6 +434,8 @@
430434
title: DiffEdit
431435
- local: api/pipelines/dit
432436
title: DiT
437+
- local: api/pipelines/easyanimate
438+
title: EasyAnimate
433439
- local: api/pipelines/flux
434440
title: Flux
435441
- local: api/pipelines/control_flux_inpaint
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLMagvit
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLMagvit
20+
21+
vae = AutoencoderKLMagvit.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="vae", torch_dtype=torch.float16).to("cuda")
22+
```
23+
24+
## AutoencoderKLMagvit
25+
26+
[[autodoc]] AutoencoderKLMagvit
27+
- decode
28+
- encode
29+
- all
30+
31+
## AutoencoderKLOutput
32+
33+
[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
34+
35+
## DecoderOutput
36+
37+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# EasyAnimateTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data from [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import EasyAnimateTransformer3DModel
20+
21+
transformer = EasyAnimateTransformer3DModel.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="transformer", torch_dtype=torch.float16).to("cuda")
22+
```
23+
24+
## EasyAnimateTransformer3DModel
25+
26+
[[autodoc]] EasyAnimateTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# EasyAnimate
17+
[EasyAnimate](https://github.com/aigc-apps/EasyAnimate) by Alibaba PAI.
18+
19+
The description from it's GitHub page:
20+
*EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.*
21+
22+
This pipeline was contributed by [bubbliiiing](https://github.com/bubbliiiing). The original codebase can be found [here](https://huggingface.co/alibaba-pai). The original weights can be found under [hf.co/alibaba-pai](https://huggingface.co/alibaba-pai).
23+
24+
There are two official EasyAnimate checkpoints for text-to-video and video-to-video.
25+
26+
| checkpoints | recommended inference dtype |
27+
|:---:|:---:|
28+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh) | torch.float16 |
29+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 |
30+
31+
There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.
32+
33+
| checkpoints | recommended inference dtype |
34+
|:---:|:---:|
35+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 |
36+
37+
There are two official EasyAnimate checkpoints available for control-to-video.
38+
39+
| checkpoints | recommended inference dtype |
40+
|:---:|:---:|
41+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control) | torch.float16 |
42+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera) | torch.float16 |
43+
44+
For the EasyAnimateV5.1 series:
45+
- Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
46+
- Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.
47+
48+
## Quantization
49+
50+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
51+
52+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`EasyAnimatePipeline`] for inference with bitsandbytes.
53+
54+
```py
55+
import torch
56+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
57+
from diffusers.utils import export_to_video
58+
59+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
60+
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
61+
"alibaba-pai/EasyAnimateV5.1-12b-zh",
62+
subfolder="transformer",
63+
quantization_config=quant_config,
64+
torch_dtype=torch.float16,
65+
)
66+
67+
pipeline = EasyAnimatePipeline.from_pretrained(
68+
"alibaba-pai/EasyAnimateV5.1-12b-zh",
69+
transformer=transformer_8bit,
70+
torch_dtype=torch.float16,
71+
device_map="balanced",
72+
)
73+
74+
prompt = "A cat walks on the grass, realistic style."
75+
negative_prompt = "bad detailed"
76+
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
77+
export_to_video(video, "cat.mp4", fps=8)
78+
```
79+
80+
## EasyAnimatePipeline
81+
82+
[[autodoc]] EasyAnimatePipeline
83+
- all
84+
- __call__
85+
86+
## EasyAnimatePipelineOutput
87+
88+
[[autodoc]] pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput

docs/source/en/conceptual/evaluation.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,11 @@ specific language governing permissions and limitations under the License.
1616
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
1717
</a>
1818

19+
> [!TIP]
20+
> This document has now grown outdated given the emergence of existing evaluation frameworks for diffusion models for image generation. Please check
21+
> out works like [HEIM](https://crfm.stanford.edu/helm/heim/latest/), [T2I-Compbench](https://arxiv.org/abs/2307.06350),
22+
> [GenEval](https://arxiv.org/abs/2310.11513).
23+
1924
Evaluation of generative models like [Stable Diffusion](https://huggingface.co/docs/diffusers/stable_diffusion) is subjective in nature. But as practitioners and researchers, we often have to make careful choices amongst many different possibilities. So, when working with different generative models (like GANs, Diffusion, etc.), how do we choose one over the other?
2025

2126
Qualitative evaluation of such models can be error-prone and might incorrectly influence a decision.

docs/source/en/using-diffusers/callback.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,84 @@ pipeline(
157157
)
158158
```
159159

160+
## IP Adapter Cutoff
161+
162+
IP Adapter is an image prompt adapter that can be used for diffusion models without any changes to the underlying model. We can use the IP Adapter Cutoff Callback to disable the IP Adapter after a certain number of steps. To set up the callback, you need to specify the number of denoising steps after which the callback comes into effect. You can do so by using either one of these two arguments:
163+
164+
- `cutoff_step_ratio`: Float number with the ratio of the steps.
165+
- `cutoff_step_index`: Integer number with the exact number of the step.
166+
167+
We need to download the diffusion model and load the ip_adapter for it as follows:
168+
169+
```py
170+
from diffusers import AutoPipelineForText2Image
171+
from diffusers.utils import load_image
172+
import torch
173+
174+
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
175+
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
176+
pipeline.set_ip_adapter_scale(0.6)
177+
```
178+
The setup for the callback should look something like this:
179+
180+
```py
181+
182+
from diffusers import AutoPipelineForText2Image
183+
from diffusers.callbacks import IPAdapterScaleCutoffCallback
184+
from diffusers.utils import load_image
185+
import torch
186+
187+
188+
pipeline = AutoPipelineForText2Image.from_pretrained(
189+
"stabilityai/stable-diffusion-xl-base-1.0",
190+
torch_dtype=torch.float16
191+
).to("cuda")
192+
193+
194+
pipeline.load_ip_adapter(
195+
"h94/IP-Adapter",
196+
subfolder="sdxl_models",
197+
weight_name="ip-adapter_sdxl.bin"
198+
)
199+
200+
pipeline.set_ip_adapter_scale(0.6)
201+
202+
203+
callback = IPAdapterScaleCutoffCallback(
204+
cutoff_step_ratio=None,
205+
cutoff_step_index=5
206+
)
207+
208+
image = load_image(
209+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png"
210+
)
211+
212+
generator = torch.Generator(device="cuda").manual_seed(2628670641)
213+
214+
images = pipeline(
215+
prompt="a tiger sitting in a chair drinking orange juice",
216+
ip_adapter_image=image,
217+
negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
218+
generator=generator,
219+
num_inference_steps=50,
220+
callback_on_step_end=callback,
221+
).images
222+
223+
images[0].save("custom_callback_img.png")
224+
```
225+
226+
<div class="flex gap-4">
227+
<div>
228+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/without_callback.png" alt="generated image of a tiger sitting in a chair drinking orange juice" />
229+
<figcaption class="mt-2 text-center text-sm text-gray-500">without IPAdapterScaleCutoffCallback</figcaption>
230+
</div>
231+
<div>
232+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/with_callback2.png" alt="generated image of a tiger sitting in a chair drinking orange juice with ip adapter callback" />
233+
<figcaption class="mt-2 text-center text-sm text-gray-500">with IPAdapterScaleCutoffCallback</figcaption>
234+
</div>
235+
</div>
236+
237+
160238
## Display image after each generation step
161239

162240
> [!TIP]

src/diffusers/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@
9494
"AutoencoderKLCogVideoX",
9595
"AutoencoderKLHunyuanVideo",
9696
"AutoencoderKLLTXVideo",
97+
"AutoencoderKLMagvit",
9798
"AutoencoderKLMochi",
9899
"AutoencoderKLTemporalDecoder",
99100
"AutoencoderKLWan",
@@ -109,6 +110,7 @@
109110
"ControlNetUnionModel",
110111
"ControlNetXSAdapter",
111112
"DiTTransformer2DModel",
113+
"EasyAnimateTransformer3DModel",
112114
"FluxControlNetModel",
113115
"FluxMultiControlNetModel",
114116
"FluxTransformer2DModel",
@@ -293,6 +295,9 @@
293295
"CogView4Pipeline",
294296
"ConsisIDPipeline",
295297
"CycleDiffusionPipeline",
298+
"EasyAnimateControlPipeline",
299+
"EasyAnimateInpaintPipeline",
300+
"EasyAnimatePipeline",
296301
"FluxControlImg2ImgPipeline",
297302
"FluxControlInpaintPipeline",
298303
"FluxControlNetImg2ImgPipeline",
@@ -620,6 +625,7 @@
620625
AutoencoderKLCogVideoX,
621626
AutoencoderKLHunyuanVideo,
622627
AutoencoderKLLTXVideo,
628+
AutoencoderKLMagvit,
623629
AutoencoderKLMochi,
624630
AutoencoderKLTemporalDecoder,
625631
AutoencoderKLWan,
@@ -635,6 +641,7 @@
635641
ControlNetUnionModel,
636642
ControlNetXSAdapter,
637643
DiTTransformer2DModel,
644+
EasyAnimateTransformer3DModel,
638645
FluxControlNetModel,
639646
FluxMultiControlNetModel,
640647
FluxTransformer2DModel,
@@ -798,6 +805,9 @@
798805
CogView4Pipeline,
799806
ConsisIDPipeline,
800807
CycleDiffusionPipeline,
808+
EasyAnimateControlPipeline,
809+
EasyAnimateInpaintPipeline,
810+
EasyAnimatePipeline,
801811
FluxControlImg2ImgPipeline,
802812
FluxControlInpaintPipeline,
803813
FluxControlNetImg2ImgPipeline,

src/diffusers/loaders/ip_adapter.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,8 @@ def load_ip_adapter(
215215
low_cpu_mem_usage=low_cpu_mem_usage,
216216
cache_dir=cache_dir,
217217
local_files_only=local_files_only,
218-
).to(self.device, dtype=self.dtype)
218+
torch_dtype=self.dtype,
219+
).to(self.device)
219220
self.register_modules(image_encoder=image_encoder)
220221
else:
221222
raise ValueError(
@@ -526,8 +527,9 @@ def load_ip_adapter(
526527
low_cpu_mem_usage=low_cpu_mem_usage,
527528
cache_dir=cache_dir,
528529
local_files_only=local_files_only,
530+
dtype=image_encoder_dtype,
529531
)
530-
.to(self.device, dtype=image_encoder_dtype)
532+
.to(self.device)
531533
.eval()
532534
)
533535
self.register_modules(image_encoder=image_encoder)
@@ -805,9 +807,9 @@ def load_ip_adapter(
805807
feature_extractor=SiglipImageProcessor.from_pretrained(image_encoder_subfolder, **kwargs).to(
806808
self.device, dtype=self.dtype
807809
),
808-
image_encoder=SiglipVisionModel.from_pretrained(image_encoder_subfolder, **kwargs).to(
809-
self.device, dtype=self.dtype
810-
),
810+
image_encoder=SiglipVisionModel.from_pretrained(
811+
image_encoder_subfolder, torch_dtype=self.dtype, **kwargs
812+
).to(self.device),
811813
)
812814
else:
813815
raise ValueError(

src/diffusers/loaders/single_file_utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1458,8 +1458,8 @@ def convert_open_clip_checkpoint(
14581458

14591459
if text_proj_key in checkpoint:
14601460
text_proj_dim = int(checkpoint[text_proj_key].shape[0])
1461-
elif hasattr(text_model.config, "projection_dim"):
1462-
text_proj_dim = text_model.config.projection_dim
1461+
elif hasattr(text_model.config, "hidden_size"):
1462+
text_proj_dim = text_model.config.hidden_size
14631463
else:
14641464
text_proj_dim = LDM_OPEN_CLIP_TEXT_PROJECTION_DIM
14651465

0 commit comments

Comments
 (0)