Skip to content

Conversation

@yiyixuxu
Copy link
Collaborator

@yiyixuxu yiyixuxu commented Jul 29, 2025

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline, AutoencoderKLWan, ModularPipeline
from diffusers.utils import export_to_video


model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
dtype = torch.bfloat16
device = "cuda:2"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)

# use default wan image processor to resize and crop the image
image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor", trust_remote_code=True)
image = image_processor(
    image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
    max_area=1280*704, output="processed_image")

height, width = image.height, image.width
print(f"height: {height}, width: {width}")
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0

prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "yiyi_test_6_ti2v_5b_output.mp4", fps=24)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yiyixuxu yiyixuxu requested a review from a-r-r-o-w July 29, 2025 08:43
@nitinmukesh
Copy link

nitinmukesh commented Jul 29, 2025

Thank you @yiyixuxu .

1 query, is this fixed value or need to be reversed as per input image,
max_area=1280*704

portrait=1280*704 
landscape=704*1280

Copy link
Contributor

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yiyixuxu
Copy link
Collaborator Author

@nitinmukesh
it's atuallyint input so it's fixed value

@yiyixuxu yiyixuxu mentioned this pull request Jul 30, 2025
6 tasks
@yiyixuxu yiyixuxu merged commit d8854b8 into main Jul 30, 2025
13 of 15 checks passed
@yiyixuxu yiyixuxu deleted the wan5bi2v branch July 30, 2025 03:34
@zhaoyun0071
Copy link

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline, AutoencoderKLWan, ModularPipeline
from diffusers.utils import export_to_video


model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
dtype = torch.bfloat16
device = "cuda:2"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)

# use default wan image processor to resize and crop the image
image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor", trust_remote_code=True)
image = image_processor(
    image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
    max_area=1280*704, output="processed_image")

height, width = image.height, image.width
print(f"height: {height}, width: {width}")
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0

prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "yiyi_test_6_ti2v_5b_output.mp4", fps=24)

进度跑满后,最后关头占用显存直接飙升到30G左右,导致非常慢,我是3090显卡

@JoeGaffney
Copy link

Hey,

With Wan2.1 we was able to pass just a RGB PIl image. With 2.2 i get

def _conv_forward(self, input: Tensor, weight: Tensor, bias: Optional[Tensor]):
        if self.padding_mode != "zeros":
            return F.conv3d(
                F.pad(
                    input, self._reversed_padding_repeated_twice, mode=self.padding_mode
                ),
                weight,
                bias,
                self.stride,
                _triple(0),
                self.dilation,
                self.groups,
            )
>       return F.conv3d(
            input, weight, bias, self.stride, self.padding, self.dilation, self.groups
        )
E       RuntimeError: Given groups=1, weight of size [160, 12, 3, 3, 3], expected input[1, 3, 3, 258, 258] to have 12 channels, but got 3 channels instead

/opt/conda/lib/python3.11/site-packages/torch/nn/modules/conv.py:720: RuntimeError

Cheers,
Joe

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Aug 1, 2025

hi @JoeGaffney
there is no code attached so i'm not 100% sure I understand the issue here, but the error might be caused by the fact that with wan 2.2 there is a patchify/unpatchify step

x = patchify(x, patch_size=self.config.patch_size)

@chensongkui
Copy link

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline, AutoencoderKLWan, ModularPipeline
from diffusers.utils import export_to_video


model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
dtype = torch.bfloat16
device = "cuda:2"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)

# use default wan image processor to resize and crop the image
image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor", trust_remote_code=True)
image = image_processor(
    image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
    max_area=1280*704, output="processed_image")

height, width = image.height, image.width
print(f"height: {height}, width: {width}")
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0

prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "yiyi_test_6_ti2v_5b_output.mp4", fps=24)
image

do you run this code successfully? I run this code fail. I met some problems, like lacking image processor and image encoder, i have solved these problems. now i met a new problem, as the picture showed above. I think this phenomenon is unreasonable. Theoretically, this code should be run successfully.

@JoeGaffney
Copy link

hi @JoeGaffney there is no code attached so i'm not 100% sure I understand the issue here, but the error might be caused by the fact that with wan 2.2 there is a patchify/unpatchify step

x = patchify(x, patch_size=self.config.patch_size)

Hey @yiyixuxu it was resolved in the other ticket it was vae.enable_tiling()

@nitinmukesh
Copy link

@JoeGaffney

Which ticket fixed enable_tiling. I'm getting OOM even after installing diffusers from source.

Beinsezii pushed a commit to Beinsezii/diffusers that referenced this pull request Aug 7, 2025
* add 5b ti2v

* remove a copy

* Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py

Co-authored-by: Aryan <[email protected]>

* Apply suggestions from code review

---------

Co-authored-by: Aryan <[email protected]>
@ares89
Copy link

ares89 commented Aug 19, 2025

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline, AutoencoderKLWan, ModularPipeline
from diffusers.utils import export_to_video


model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
dtype = torch.bfloat16
device = "cuda:2"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)

# use default wan image processor to resize and crop the image
image_processor = ModularPipeline.from_pretrained("YiYiXu/WanImageProcessor", trust_remote_code=True)
image = image_processor(
    image="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG",
    max_area=1280*704, output="processed_image")

height, width = image.height, image.width
print(f"height: {height}, width: {width}")
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0

prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "yiyi_test_6_ti2v_5b_output.mp4", fps=24)

进度跑满后,最后关头占用显存直接飙升到30G左右,导致非常慢,我是3090显卡

have you solved this problem?你解决这个问题了吗?怎么解决的?

@jerry2102
Copy link

I encounter this error when I use (width,height)=(1280, 720) calling generate pipe. the diffusers is version 0.35.1.
anyone can help with me?

image

from user code:
File "/usr/local/lib/python3.10/site-packages/diffusers/models/transformers/transformer_wan.py", line 478, in forward
norm_hidden_states = (self.norm1(hidden_states.float()) * (1 + scale_msa) + shift_msa).type_as(hidden_states)

@dg845
Copy link
Collaborator

dg845 commented Oct 11, 2025

For the issue where certain resolutions such as $1280 \times 720$ lead to an error, see #12348 for more details - a mitigation is to ensure that both height and width are multiples of $32$, which is why the $1280 \times 704$ resolution works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants