Skip to content

Commit a7a14cc

Browse files
authored
Merge branch 'main' into flux-control-lora-bnb-8bit
2 parents 8229ef9 + 6c7fad7 commit a7a14cc

File tree

55 files changed

+2303
-94
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+2303
-94
lines changed

docs/source/en/api/pipelines/wan.md

Lines changed: 104 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,30 @@
2222

2323
# Wan2.1
2424

25-
[Wan2.1](https://files.alicdn.com/tpsservice/5c9de1c74de03972b7aa657e5a54756b.pdf) is a series of large diffusion transformer available in two versions, a high-performance 14B parameter model and a more accessible 1.3B version. Trained on billions of images and videos, it supports tasks like text-to-video (T2V) and image-to-video (I2V) while enabling features such as camera control and stylistic diversity. The Wan-VAE features better image data compression and a feature cache mechanism that encodes and decodes a video in chunks. To maintain continuity, features from previous chunks are cached and reused for processing subsequent chunks. This improves inference efficiency by reducing memory usage. Wan2.1 also uses a multilingual text encoder and the diffusion transformer models space and time relationships and text conditions with each time step to capture more complex video dynamics.
25+
[Wan-2.1](https://huggingface.co/papers/2503.20314) by the Wan Team.
26+
27+
*This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at [this https URL](https://github.com/Wan-Video/Wan2.1).*
2628

2729
You can find all the original Wan2.1 checkpoints under the [Wan-AI](https://huggingface.co/Wan-AI) organization.
2830

31+
The following Wan models are supported in Diffusers:
32+
- [Wan 2.1 T2V 1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers)
33+
- [Wan 2.1 T2V 14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers)
34+
- [Wan 2.1 I2V 14B - 480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P-Diffusers)
35+
- [Wan 2.1 I2V 14B - 720P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P-Diffusers)
36+
- [Wan 2.1 FLF2V 14B - 720P](https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers)
37+
- [Wan 2.1 VACE 1.3B](https://huggingface.co/Wan-AI/Wan2.1-VACE-1.3B-diffusers)
38+
- [Wan 2.1 VACE 14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B-diffusers)
39+
2940
> [!TIP]
3041
> Click on the Wan2.1 models in the right sidebar for more examples of video generation.
3142
43+
### Text-to-Video Generation
44+
3245
The example below demonstrates how to generate a video from text optimized for memory or inference speed.
3346

34-
<hfoptions id="usage">
35-
<hfoption id="memory">
47+
<hfoptions id="T2V usage">
48+
<hfoption id="T2V memory">
3649

3750
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
3851

@@ -100,7 +113,7 @@ export_to_video(output, "output.mp4", fps=16)
100113
```
101114

102115
</hfoption>
103-
<hfoption id="inference speed">
116+
<hfoption id="T2V inference speed">
104117

105118
[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster.
106119

@@ -157,6 +170,81 @@ export_to_video(output, "output.mp4", fps=16)
157170
</hfoption>
158171
</hfoptions>
159172

173+
### First-Last-Frame-to-Video Generation
174+
175+
The example below demonstrates how to use the image-to-video pipeline to generate a video using a text description, a starting frame, and an ending frame.
176+
177+
<hfoptions id="FLF2V usage">
178+
<hfoption id="usage">
179+
180+
```python
181+
import numpy as np
182+
import torch
183+
import torchvision.transforms.functional as TF
184+
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
185+
from diffusers.utils import export_to_video, load_image
186+
from transformers import CLIPVisionModel
187+
188+
189+
model_id = "Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers"
190+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
191+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
192+
pipe = WanImageToVideoPipeline.from_pretrained(
193+
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
194+
)
195+
pipe.to("cuda")
196+
197+
first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
198+
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")
199+
200+
def aspect_ratio_resize(image, pipe, max_area=720 * 1280):
201+
aspect_ratio = image.height / image.width
202+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
203+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
204+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
205+
image = image.resize((width, height))
206+
return image, height, width
207+
208+
def center_crop_resize(image, height, width):
209+
# Calculate resize ratio to match first frame dimensions
210+
resize_ratio = max(width / image.width, height / image.height)
211+
212+
# Resize the image
213+
width = round(image.width * resize_ratio)
214+
height = round(image.height * resize_ratio)
215+
size = [width, height]
216+
image = TF.center_crop(image, size)
217+
218+
return image, height, width
219+
220+
first_frame, height, width = aspect_ratio_resize(first_frame, pipe)
221+
if last_frame.size != first_frame.size:
222+
last_frame, _, _ = center_crop_resize(last_frame, height, width)
223+
224+
prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
225+
226+
output = pipe(
227+
image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.5
228+
).frames[0]
229+
export_to_video(output, "output.mp4", fps=16)
230+
```
231+
232+
</hfoption>
233+
</hfoptions>
234+
235+
### Any-to-Video Controllable Generation
236+
237+
Wan VACE supports various generation techniques which achieve controllable video generation. Some of the capabilities include:
238+
- Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: [huggingface/controlnet_aux]()
239+
- Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
240+
- Inpainting and Outpainting
241+
- Subject to Video (faces, object, characters, etc.)
242+
- Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)
243+
244+
The code snippets available in [this](https://github.com/huggingface/diffusers/pull/11582) pull request demonstrate some examples of how videos can be generated with controllability signals.
245+
246+
The general rule of thumb to keep in mind when preparing inputs for the VACE pipeline is that the input images, or frames of a video that you want to use for conditioning, should have a corresponding mask that is black in color. The black mask signifies that the model will not generate new content for that area, and only use those parts for conditioning the generation process. For parts/frames that should be generated by the model, the mask should be white in color.
247+
160248
## Notes
161249

162250
- Wan2.1 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`].
@@ -251,6 +339,18 @@ export_to_video(output, "output.mp4", fps=16)
251339
- all
252340
- __call__
253341

342+
## WanVACEPipeline
343+
344+
[[autodoc]] WanVACEPipeline
345+
- all
346+
- __call__
347+
348+
## WanVideoToVideoPipeline
349+
350+
[[autodoc]] WanVideoToVideoPipeline
351+
- all
352+
- __call__
353+
254354
## WanPipelineOutput
255355

256356
[[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# Copyright Philip Brown, ppbrown@github
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
###########################################################################
16+
# This pipeline attempts to use a model that has SDXL vae, T5 text encoder,
17+
# and SDXL unet.
18+
# At the present time, there are no pretrained models that give pleasing
19+
# output. So as yet, (2025/06/10) this pipeline is somewhat of a tech
20+
# demo proving that the pieces can at least be put together.
21+
# Hopefully, it will encourage someone with the hardware available to
22+
# throw enough resources into training one up.
23+
24+
25+
from typing import Optional
26+
27+
import torch.nn as nn
28+
from transformers import (
29+
CLIPImageProcessor,
30+
CLIPTokenizer,
31+
CLIPVisionModelWithProjection,
32+
T5EncoderModel,
33+
)
34+
35+
from diffusers import DiffusionPipeline, StableDiffusionXLPipeline
36+
from diffusers.image_processor import VaeImageProcessor
37+
from diffusers.models import AutoencoderKL, UNet2DConditionModel
38+
from diffusers.schedulers import KarrasDiffusionSchedulers
39+
40+
41+
# Note: At this time, the intent is to use the T5 encoder mentioned
42+
# below, with zero changes.
43+
# Therefore, the model deliberately does not store the T5 encoder model bytes,
44+
# (Since they are not unique!)
45+
# but instead takes advantage of huggingface hub cache loading
46+
47+
T5_NAME = "mcmonkey/google_t5-v1_1-xxl_encoderonly"
48+
49+
# Caller is expected to load this, or equivalent, as model name for now
50+
# eg: pipe = StableDiffusionXL_T5Pipeline(SDXL_NAME)
51+
SDXL_NAME = "stabilityai/stable-diffusion-xl-base-1.0"
52+
53+
54+
class LinearWithDtype(nn.Linear):
55+
@property
56+
def dtype(self):
57+
return self.weight.dtype
58+
59+
60+
class StableDiffusionXL_T5Pipeline(StableDiffusionXLPipeline):
61+
_expected_modules = [
62+
"vae",
63+
"unet",
64+
"scheduler",
65+
"tokenizer",
66+
"image_encoder",
67+
"feature_extractor",
68+
"t5_encoder",
69+
"t5_projection",
70+
"t5_pooled_projection",
71+
]
72+
73+
_optional_components = [
74+
"image_encoder",
75+
"feature_extractor",
76+
"t5_encoder",
77+
"t5_projection",
78+
"t5_pooled_projection",
79+
]
80+
81+
def __init__(
82+
self,
83+
vae: AutoencoderKL,
84+
unet: UNet2DConditionModel,
85+
scheduler: KarrasDiffusionSchedulers,
86+
tokenizer: CLIPTokenizer,
87+
t5_encoder=None,
88+
t5_projection=None,
89+
t5_pooled_projection=None,
90+
image_encoder: CLIPVisionModelWithProjection = None,
91+
feature_extractor: CLIPImageProcessor = None,
92+
force_zeros_for_empty_prompt: bool = True,
93+
add_watermarker: Optional[bool] = None,
94+
):
95+
DiffusionPipeline.__init__(self)
96+
97+
if t5_encoder is None:
98+
self.t5_encoder = T5EncoderModel.from_pretrained(T5_NAME, torch_dtype=unet.dtype)
99+
else:
100+
self.t5_encoder = t5_encoder
101+
102+
# ----- build T5 4096 => 2048 dim projection -----
103+
if t5_projection is None:
104+
self.t5_projection = LinearWithDtype(4096, 2048) # trainable
105+
else:
106+
self.t5_projection = t5_projection
107+
self.t5_projection.to(dtype=unet.dtype)
108+
# ----- build T5 4096 => 1280 dim projection -----
109+
if t5_pooled_projection is None:
110+
self.t5_pooled_projection = LinearWithDtype(4096, 1280) # trainable
111+
else:
112+
self.t5_pooled_projection = t5_pooled_projection
113+
self.t5_pooled_projection.to(dtype=unet.dtype)
114+
115+
print("dtype of Linear is ", self.t5_projection.dtype)
116+
117+
self.register_modules(
118+
vae=vae,
119+
unet=unet,
120+
scheduler=scheduler,
121+
tokenizer=tokenizer,
122+
t5_encoder=self.t5_encoder,
123+
t5_projection=self.t5_projection,
124+
t5_pooled_projection=self.t5_pooled_projection,
125+
image_encoder=image_encoder,
126+
feature_extractor=feature_extractor,
127+
)
128+
self.register_to_config(force_zeros_for_empty_prompt=force_zeros_for_empty_prompt)
129+
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
130+
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
131+
132+
self.default_sample_size = (
133+
self.unet.config.sample_size
134+
if hasattr(self, "unet") and self.unet is not None and hasattr(self.unet.config, "sample_size")
135+
else 128
136+
)
137+
138+
self.watermark = None
139+
140+
# Parts of original SDXL class complain if these attributes are not
141+
# at least PRESENT
142+
self.text_encoder = self.text_encoder_2 = None
143+
144+
# ------------------------------------------------------------------
145+
# Encode a text prompt (T5-XXL + 4096→2048 projection)
146+
# Returns exactly four tensors in the order SDXL’s __call__ expects.
147+
# ------------------------------------------------------------------
148+
def encode_prompt(
149+
self,
150+
prompt,
151+
num_images_per_prompt: int = 1,
152+
do_classifier_free_guidance: bool = True,
153+
negative_prompt: str | None = None,
154+
**_,
155+
):
156+
"""
157+
Returns
158+
-------
159+
prompt_embeds : Tensor [B, T, 2048]
160+
negative_prompt_embeds : Tensor [B, T, 2048] | None
161+
pooled_prompt_embeds : Tensor [B, 1280]
162+
negative_pooled_prompt_embeds: Tensor [B, 1280] | None
163+
where B = batch * num_images_per_prompt
164+
"""
165+
166+
# --- helper to tokenize on the pipeline’s device ----------------
167+
def _tok(text: str):
168+
tok_out = self.tokenizer(
169+
text,
170+
return_tensors="pt",
171+
padding="max_length",
172+
max_length=self.tokenizer.model_max_length,
173+
truncation=True,
174+
).to(self.device)
175+
return tok_out.input_ids, tok_out.attention_mask
176+
177+
# ---------- positive stream -------------------------------------
178+
ids, mask = _tok(prompt)
179+
h_pos = self.t5_encoder(ids, attention_mask=mask).last_hidden_state # [b, T, 4096]
180+
tok_pos = self.t5_projection(h_pos) # [b, T, 2048]
181+
pool_pos = self.t5_pooled_projection(h_pos.mean(dim=1)) # [b, 1280]
182+
183+
# expand for multiple images per prompt
184+
tok_pos = tok_pos.repeat_interleave(num_images_per_prompt, 0)
185+
pool_pos = pool_pos.repeat_interleave(num_images_per_prompt, 0)
186+
187+
# ---------- negative / CFG stream --------------------------------
188+
if do_classifier_free_guidance:
189+
neg_text = "" if negative_prompt is None else negative_prompt
190+
ids_n, mask_n = _tok(neg_text)
191+
h_neg = self.t5_encoder(ids_n, attention_mask=mask_n).last_hidden_state
192+
tok_neg = self.t5_projection(h_neg)
193+
pool_neg = self.t5_pooled_projection(h_neg.mean(dim=1))
194+
195+
tok_neg = tok_neg.repeat_interleave(num_images_per_prompt, 0)
196+
pool_neg = pool_neg.repeat_interleave(num_images_per_prompt, 0)
197+
else:
198+
tok_neg = pool_neg = None
199+
200+
# ----------------- final ordered return --------------------------
201+
# 1) positive token embeddings
202+
# 2) negative token embeddings (or None)
203+
# 3) positive pooled embeddings
204+
# 4) negative pooled embeddings (or None)
205+
return tok_pos, tok_neg, pool_pos, pool_neg

0 commit comments

Comments
 (0)