Fix wan i2v condition shape mismatch #12496

DefTruth · 2025-10-16T10:56:57Z

What does this PR do?

Fix wan i2v condition shape. I think we should use num_latent_frames not the original num_frames after:

num_latent_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1

w/o this fix:

(cu129)➜ pipeline git:(dev) ✗ python3 run_wan_2.2_i2v.py
torch.Size([1, 1, 4, 90, 68])
torch.Size([1, 1, 84, 90, 68]) 4
torch.Size([1, 21, 4, 90, 68])
torch.Size([1, 4, 21, 90, 68]) torch.Size([1, 16, 81, 90, 68])
Traceback (most recent call last):
  File "/workspace/dev/vipshop/cache-dit/examples/pipeline/run_wan_2.2_i2v.py", line 154, in <module>
    video = run_pipe()
            ^^^^^^^^^^
  File "/workspace/dev/vipshop/cache-dit/examples/pipeline/run_wan_2.2_i2v.py", line 130, in run_pipe
    video = pipe(
            ^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/dev/vipshop/diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py", line 705, in __call__
    latents_outputs = self.prepare_latents(
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/dev/vipshop/diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py", line 487, in prepare_latents
    return latents, torch.concat([mask_lat_size, latent_condition], dim=1)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 21 but got size 81 for tensor number 1 in the list.

w/ this fix:

(cu129)➜ pipeline git:(dev) ✗ python3 run_wan_2.2_i2v.py
video_condition shape: torch.Size([1, 3, 21, 720, 544])
torch.Size([1, 1, 81, 90, 68])
torch.Size([1, 1, 1, 90, 68])
torch.Size([1, 1, 4, 90, 68])
torch.Size([1, 1, 84, 90, 68]) 4
torch.Size([1, 21, 4, 90, 68])
torch.Size([1, 4, 21, 90, 68]) torch.Size([1, 16, 21, 90, 68])
torch.Size([1, 16, 21, 90, 68]) torch.Size([1, 4, 21, 90, 68]) torch.Size([1, 16, 21, 90, 68])
 42%|████████████████████                                                 | 17/40 [10:17<13:55, 36.33s/it]

Reproduce

from: https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

model_id = "Wan-AI/Wan2.2-I2V-A14B-Diffusers"
dtype = torch.bfloat16
device = "cuda"

pipe = WanImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.to(device)


image = load_image(
    "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG"
)
max_area = 480 * 832
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
generator = torch.Generator(device=device).manual_seed(0)
output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=3.5,
    num_inference_steps=40,
    generator=generator,
).frames[0]
export_to_video(output, "i2v_output.mp4", fps=16)

DefTruth · 2025-10-17T00:56:08Z

@yiyixuxu hi~ can you take a look to this PR. I have found that your are the author of:

[wan2.2] follow-up #12024

DefTruth added 2 commits October 16, 2025 10:48

bugfix: fix wan-i2v pipeline condition shape mismatch

6cf9280

bugfix: fix wan-i2v pipeline condition shape mismatch

c417330

DefTruth changed the title ~~Fix wan i2v condition shape~~ Fix wan i2v condition shape mismatch Oct 16, 2025

This was referenced Oct 16, 2025

Wan 2.2 I2V shape error in prepare latents vipshop/cache-dit#294

Open

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 21 but got size 81 for tensor number 1 in the list. vipshop/cache-dit#291

Closed

Merge branch 'main' into fix-wan-i2v-condition-shape

ed8fdfa

DefTruth mentioned this pull request Oct 17, 2025

Wan 2.2 I2V condition shape mismatch #12499

Closed

Merge branch 'main' into fix-wan-i2v-condition-shape

778badb

DefTruth closed this Oct 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix wan i2v condition shape mismatch #12496

Fix wan i2v condition shape mismatch #12496

Uh oh!

DefTruth commented Oct 16, 2025 •

edited

Loading

Uh oh!

DefTruth commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix wan i2v condition shape mismatch #12496

Fix wan i2v condition shape mismatch #12496

Uh oh!

Conversation

DefTruth commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Reproduce

Uh oh!

DefTruth commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DefTruth commented Oct 16, 2025 •

edited

Loading