Skip to content

Conversation

@DefTruth
Copy link
Contributor

@DefTruth DefTruth commented Oct 16, 2025

What does this PR do?

fixed #12499

Related issue: vipshop/cache-dit#291

Fix wan i2v condition shape. I think we should use num_latent_frames not the original num_frames after:

num_latent_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1

w/o this fix:

(cu129)➜ pipeline git:(dev) ✗ python3 run_wan_2.2_i2v.py
torch.Size([1, 1, 4, 90, 68])
torch.Size([1, 1, 84, 90, 68]) 4
torch.Size([1, 21, 4, 90, 68])
torch.Size([1, 4, 21, 90, 68]) torch.Size([1, 16, 81, 90, 68])
Traceback (most recent call last):
  File "/workspace/dev/vipshop/cache-dit/examples/pipeline/run_wan_2.2_i2v.py", line 154, in <module>
    video = run_pipe()
            ^^^^^^^^^^
  File "/workspace/dev/vipshop/cache-dit/examples/pipeline/run_wan_2.2_i2v.py", line 130, in run_pipe
    video = pipe(
            ^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/dev/vipshop/diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py", line 705, in __call__
    latents_outputs = self.prepare_latents(
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/dev/vipshop/diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py", line 487, in prepare_latents
    return latents, torch.concat([mask_lat_size, latent_condition], dim=1)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 21 but got size 81 for tensor number 1 in the list.

w/ this fix:

(cu129)➜ pipeline git:(dev) ✗ python3 run_wan_2.2_i2v.py
video_condition shape: torch.Size([1, 3, 21, 720, 544])
torch.Size([1, 1, 81, 90, 68])
torch.Size([1, 1, 1, 90, 68])
torch.Size([1, 1, 4, 90, 68])
torch.Size([1, 1, 84, 90, 68]) 4
torch.Size([1, 21, 4, 90, 68])
torch.Size([1, 4, 21, 90, 68]) torch.Size([1, 16, 21, 90, 68])
torch.Size([1, 16, 21, 90, 68]) torch.Size([1, 4, 21, 90, 68]) torch.Size([1, 16, 21, 90, 68])
 42%|████████████████████                                                 | 17/40 [10:17<13:55, 36.33s/it]

Reproduce

from: https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

model_id = "Wan-AI/Wan2.2-I2V-A14B-Diffusers"
dtype = torch.bfloat16
device = "cuda"

pipe = WanImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.to(device)


image = load_image(
    "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG"
)
max_area = 480 * 832
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
generator = torch.Generator(device=device).manual_seed(0)
output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=3.5,
    num_inference_steps=40,
    generator=generator,
).frames[0]
export_to_video(output, "i2v_output.mp4", fps=16)

@DefTruth
Copy link
Contributor Author

@yiyixuxu hi~ can you take a look to this PR. I have found that your are the author of:

@DefTruth DefTruth closed this Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wan 2.2 I2V condition shape mismatch

1 participant