Skip to content

Commit d62ecb9

Browse files
authored
Merge branch 'huggingface:main' into dev
2 parents 85e325d + 97fda1b commit d62ecb9

File tree

143 files changed

+11719
-1405
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

143 files changed

+11719
-1405
lines changed

.github/workflows/pr_style_bot.yml

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,33 @@ permissions:
99
pull-requests: write
1010

1111
jobs:
12-
run-style-bot:
12+
check-permissions:
1313
if: >
1414
contains(github.event.comment.body, '@bot /style') &&
1515
github.event.issue.pull_request != null
1616
runs-on: ubuntu-latest
17+
outputs:
18+
is_authorized: ${{ steps.check_user_permission.outputs.has_permission }}
19+
steps:
20+
- name: Check user permission
21+
id: check_user_permission
22+
uses: actions/github-script@v6
23+
with:
24+
script: |
25+
const comment_user = context.payload.comment.user.login;
26+
const { data: permission } = await github.rest.repos.getCollaboratorPermissionLevel({
27+
owner: context.repo.owner,
28+
repo: context.repo.repo,
29+
username: comment_user
30+
});
31+
const authorized = permission.permission === 'admin';
32+
console.log(`User ${comment_user} has permission level: ${permission.permission}, authorized: ${authorized} (only admins allowed)`);
33+
core.setOutput('has_permission', authorized);
1734
35+
run-style-bot:
36+
needs: check-permissions
37+
if: needs.check-permissions.outputs.is_authorized == 'true'
38+
runs-on: ubuntu-latest
1839
steps:
1940
- name: Extract PR details
2041
id: pr_info

docs/source/en/_toctree.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,14 @@
7676
- local: advanced_inference/outpaint
7777
title: Outpainting
7878
title: Advanced inference
79+
- sections:
80+
- local: hybrid_inference/overview
81+
title: Overview
82+
- local: hybrid_inference/vae_decode
83+
title: VAE Decode
84+
- local: hybrid_inference/api_reference
85+
title: API Reference
86+
title: Hybrid Inference
7987
- sections:
8088
- local: using-diffusers/cogvideox
8189
title: CogVideoX
@@ -282,6 +290,8 @@
282290
title: CogView4Transformer2DModel
283291
- local: api/models/dit_transformer2d
284292
title: DiTTransformer2DModel
293+
- local: api/models/easyanimate_transformer3d
294+
title: EasyAnimateTransformer3DModel
285295
- local: api/models/flux_transformer
286296
title: FluxTransformer2DModel
287297
- local: api/models/hunyuan_transformer2d
@@ -314,6 +324,8 @@
314324
title: Transformer2DModel
315325
- local: api/models/transformer_temporal
316326
title: TransformerTemporalModel
327+
- local: api/models/wan_transformer_3d
328+
title: WanTransformer3DModel
317329
title: Transformers
318330
- sections:
319331
- local: api/models/stable_cascade_unet
@@ -342,8 +354,12 @@
342354
title: AutoencoderKLHunyuanVideo
343355
- local: api/models/autoencoderkl_ltx_video
344356
title: AutoencoderKLLTXVideo
357+
- local: api/models/autoencoderkl_magvit
358+
title: AutoencoderKLMagvit
345359
- local: api/models/autoencoderkl_mochi
346360
title: AutoencoderKLMochi
361+
- local: api/models/autoencoder_kl_wan
362+
title: AutoencoderKLWan
347363
- local: api/models/asymmetricautoencoderkl
348364
title: AsymmetricAutoencoderKL
349365
- local: api/models/autoencoder_dc
@@ -418,6 +434,8 @@
418434
title: DiffEdit
419435
- local: api/pipelines/dit
420436
title: DiT
437+
- local: api/pipelines/easyanimate
438+
title: EasyAnimate
421439
- local: api/pipelines/flux
422440
title: Flux
423441
- local: api/pipelines/control_flux_inpaint
@@ -534,6 +552,8 @@
534552
title: UniDiffuser
535553
- local: api/pipelines/value_guided_sampling
536554
title: Value-guided sampling
555+
- local: api/pipelines/wan
556+
title: Wan
537557
- local: api/pipelines/wuerstchen
538558
title: Wuerstchen
539559
title: Pipelines
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLWan
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLWan
20+
21+
vae = AutoencoderKLWan.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
22+
```
23+
24+
## AutoencoderKLWan
25+
26+
[[autodoc]] AutoencoderKLWan
27+
- decode
28+
- all
29+
30+
## DecoderOutput
31+
32+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLMagvit
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLMagvit
20+
21+
vae = AutoencoderKLMagvit.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="vae", torch_dtype=torch.float16).to("cuda")
22+
```
23+
24+
## AutoencoderKLMagvit
25+
26+
[[autodoc]] AutoencoderKLMagvit
27+
- decode
28+
- encode
29+
- all
30+
31+
## AutoencoderKLOutput
32+
33+
[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
34+
35+
## DecoderOutput
36+
37+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# EasyAnimateTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data from [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import EasyAnimateTransformer3DModel
20+
21+
transformer = EasyAnimateTransformer3DModel.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="transformer", torch_dtype=torch.float16).to("cuda")
22+
```
23+
24+
## EasyAnimateTransformer3DModel
25+
26+
[[autodoc]] EasyAnimateTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# WanTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data was introduced in [Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import WanTransformer3DModel
20+
21+
transformer = WanTransformer3DModel.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
22+
```
23+
24+
## WanTransformer3DModel
25+
26+
[[autodoc]] WanTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# EasyAnimate
17+
[EasyAnimate](https://github.com/aigc-apps/EasyAnimate) by Alibaba PAI.
18+
19+
The description from it's GitHub page:
20+
*EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.*
21+
22+
This pipeline was contributed by [bubbliiiing](https://github.com/bubbliiiing). The original codebase can be found [here](https://huggingface.co/alibaba-pai). The original weights can be found under [hf.co/alibaba-pai](https://huggingface.co/alibaba-pai).
23+
24+
There are two official EasyAnimate checkpoints for text-to-video and video-to-video.
25+
26+
| checkpoints | recommended inference dtype |
27+
|:---:|:---:|
28+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh) | torch.float16 |
29+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 |
30+
31+
There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.
32+
33+
| checkpoints | recommended inference dtype |
34+
|:---:|:---:|
35+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 |
36+
37+
There are two official EasyAnimate checkpoints available for control-to-video.
38+
39+
| checkpoints | recommended inference dtype |
40+
|:---:|:---:|
41+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control) | torch.float16 |
42+
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera) | torch.float16 |
43+
44+
For the EasyAnimateV5.1 series:
45+
- Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
46+
- Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.
47+
48+
## Quantization
49+
50+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
51+
52+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`EasyAnimatePipeline`] for inference with bitsandbytes.
53+
54+
```py
55+
import torch
56+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
57+
from diffusers.utils import export_to_video
58+
59+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
60+
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
61+
"alibaba-pai/EasyAnimateV5.1-12b-zh",
62+
subfolder="transformer",
63+
quantization_config=quant_config,
64+
torch_dtype=torch.float16,
65+
)
66+
67+
pipeline = EasyAnimatePipeline.from_pretrained(
68+
"alibaba-pai/EasyAnimateV5.1-12b-zh",
69+
transformer=transformer_8bit,
70+
torch_dtype=torch.float16,
71+
device_map="balanced",
72+
)
73+
74+
prompt = "A cat walks on the grass, realistic style."
75+
negative_prompt = "bad detailed"
76+
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
77+
export_to_video(video, "cat.mp4", fps=8)
78+
```
79+
80+
## EasyAnimatePipeline
81+
82+
[[autodoc]] EasyAnimatePipeline
83+
- all
84+
- __call__
85+
86+
## EasyAnimatePipelineOutput
87+
88+
[[autodoc]] pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# Wan
16+
17+
[Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team.
18+
19+
<!-- TODO(aryan): update abstract once paper is out -->
20+
21+
<Tip>
22+
23+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
24+
25+
</Tip>
26+
27+
Recommendations for inference:
28+
- VAE in `torch.float32` for better decoding quality.
29+
- `num_frames` should be of the form `4 * k + 1`, for example `49` or `81`.
30+
- For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution videos, try higher values (between `7.0` and `12.0`). The default value is `3.0` for Wan.
31+
32+
### Using a custom scheduler
33+
34+
Wan can be used with many different schedulers, each with their own benefits regarding speed and generation quality. By default, Wan uses the `UniPCMultistepScheduler(prediction_type="flow_prediction", use_flow_sigmas=True, flow_shift=3.0)` scheduler. You can use a different scheduler as follows:
35+
36+
```python
37+
from diffusers import FlowMatchEulerDiscreteScheduler, UniPCMultistepScheduler, WanPipeline
38+
39+
scheduler_a = FlowMatchEulerDiscreteScheduler(shift=5.0)
40+
scheduler_b = UniPCMultistepScheduler(prediction_type="flow_prediction", use_flow_sigmas=True, flow_shift=4.0)
41+
42+
pipe = WanPipeline.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", scheduler=<CUSTOM_SCHEDULER_HERE>)
43+
44+
# or,
45+
pipe.scheduler = <CUSTOM_SCHEDULER_HERE>
46+
```
47+
48+
## WanPipeline
49+
50+
[[autodoc]] WanPipeline
51+
- all
52+
- __call__
53+
54+
## WanImageToVideoPipeline
55+
56+
[[autodoc]] WanImageToVideoPipeline
57+
- all
58+
- __call__
59+
60+
## WanPipelineOutput
61+
62+
[[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput

docs/source/en/conceptual/evaluation.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,11 @@ specific language governing permissions and limitations under the License.
1616
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
1717
</a>
1818

19+
> [!TIP]
20+
> This document has now grown outdated given the emergence of existing evaluation frameworks for diffusion models for image generation. Please check
21+
> out works like [HEIM](https://crfm.stanford.edu/helm/heim/latest/), [T2I-Compbench](https://arxiv.org/abs/2307.06350),
22+
> [GenEval](https://arxiv.org/abs/2310.11513).
23+
1924
Evaluation of generative models like [Stable Diffusion](https://huggingface.co/docs/diffusers/stable_diffusion) is subjective in nature. But as practitioners and researchers, we often have to make careful choices amongst many different possibilities. So, when working with different generative models (like GANs, Diffusion, etc.), how do we choose one over the other?
2025

2126
Qualitative evaluation of such models can be error-prone and might incorrectly influence a decision.

0 commit comments

Comments
 (0)