Skip to content

WAN 2.1 VACE 1.3B - V2V: Out-of-Order Mask Sequence Loading Bug? #856

@MrSnichovitch

Description

@MrSnichovitch

So I've managed to get WAN 2.1 VACE working on my rig with ROCm (reference: RX 7600 XT (16GB), 48 GB RAM, Manjaro Linux Kernel v. 6.16.4, ROCm v.6.4.3), but attemping to use reference video frames as a video control source isn't working correctly. The control frames are being loaded in what appears to be a random order, so the resulting AVI output has "scrambled" motion.

The problem could be due to some command option not being set that I'm unaware of.

Here's what I'm doing and what I've tried, tested with both: master-301-fd693ac & master-306-2abe945 (latest)

  1. Use WAN 2.1 VACE to generate a 7 second AVI at 12 FPS (half-size, 416x240 is all my system can realistically handle).
./sd -M vid_gen 
--diffusion-model ./models/checkpoints/wan2.1_vace_1.3B_fp16.safetensors 
--vae ./models/vae/wan_2.1_vae.safetensors 
--t5xxl ./models/text_encoders/umt5_xxl_fp16.safetensors 
--lora-model-dir ./models/loras 
--embd-dir ./models/embeddings 
-s -1 
--cfg-scale 5.5 
--steps 50 
--sampling-method euler 
-W 416 -H 240 
--fps 12 
--video-frames 85 
--offload-to-cpu 
--clip-on-cpu 
-v 
-p "medium shot, a lovely cat in a carpeted room, turning to face the camera and walking toward it" 
-n "anime, cartoon, drawing, 3d render, cgi render, ai generated, ugly face" 
-o /run/media/Storage/TestingWeirdness/WAN2.1-VACE_09-23_02h.avi 
-i ./References/CatRef.png
  1. Use FFMpeg and "Video-Depthify" (https://github.com/jankais3r/Video-Depthify) to break that AVI into sequential depth mask images, named in numerical frame order 01-37, 0001-0037, etc. (I've tried different zero left-pad combinations to no avail.) I can confirm the resulting depth mask images are in the correct sequence using Viewnior, which steps through images in a given directory in named order when SPC/BKSP or UP/DOWN arrow keys.
ffmpeg -i /run/media/Storage/TestingWeirdness/WAN2.1-VACE_09-23_02h.avi -qmin 1 -qscale:v 1 ./rgb/%04d.jpg
python3 depth.py
  1. Use a single reference image to change the visual subject, and the depth map sequence to replicate movement to create a new AVI.
./sd -M vid_gen 
--diffusion-model ./models/checkpoints/wan2.1_vace_1.3B_fp16.safetensors 
--vae ./models/vae/wan_2.1_vae.safetensors 
--t5xxl ./models/text_encoders/umt5_xxl_fp16.safetensors 
--lora-model-dir ./models/loras 
--embd-dir ./models/embeddings 
-s -1 
--cfg-scale 5.5 
--steps 50 
--sampling-method euler 
-W 416 -H 240 
--fps 12 
--video-frames 85 
--offload-to-cpu 
--clip-on-cpu 
-v 
-p "medium shot, a lovely ferret in a carpeted room, turning to face the camera and walking toward it" 
-n "anime, cartoon, drawing, 3d render, cgi render, ai generated, ugly face" 
-o /run/media/Storage/TestingWeirdness/WAN2.1-VACE_09-23_03a.avi 
-i ./References/FerretRef.png
--control-video ./WAN-Tests_Archive/vidref/crap/Video-Depthify/depth

The end result is always the same. SD appears to be loading the depth map frame images out of sequence...

load image 0 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/20.jpg'
load image 1 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/02.jpg'
load image 2 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/63.jpg'
load image 3 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/10.jpg'
load image 4 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/61.jpg'
load image 5 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/79.jpg'
load image 6 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/08.jpg'
load image 7 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/21.jpg'
load image 8 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/43.jpg'
load image 9 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/24.jpg'
load image 10 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/32.jpg'
....
load image 0 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0079.jpg'
load image 1 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0070.jpg'
load image 2 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0048.jpg'
load image 3 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0059.jpg'
load image 4 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0029.jpg'
load image 5 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0083.jpg'
load image 6 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0034.jpg'
load image 7 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0011.jpg'
load image 8 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0036.jpg'
load image 9 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0065.jpg'
load image 10 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0060.jpg'
....
load image 0 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0040.png'
load image 1 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0063.png'
load image 2 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0025.png'
load image 3 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0038.png'
load image 4 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0023.png'
load image 5 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0065.png'
load image 6 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0049.png'
load image 7 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0035.png'
load image 8 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0070.png'
load image 9 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0016.png'
load image 10 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0009.png'
....
etc.

...so the resulting video's motion is constantly screwed up.

It doesn't appear to matter if the depth map frames are named "frame_0n.jpg", "frames_0n.jpg", "0n.jpg", or any combination of the same with different numbers of leading zeros for the frame number. They're still getting loaded out of order. I've also tried using PNG files instead of JPEGs and keep getting scrambled results.

The good news is that both the depth map sequence and the single reference image ARE getting used, it's just the action sequence in the resulting AVI is jumbled to hell and back. Am I doing something wrong here that's not obvious to me, or is the out-of-sequence image loading a bug?

I've attached MP4s of the source, mask, and resulting AVIs, plus the I2V inputs just to be thorough.

Sources & Gens
https://github.com/user-attachments/assets/6a22b62a-2493-49e3-b4ed-843711c32528
https://github.com/user-attachments/assets/8dc15c62-fa13-4fd6-b95b-500f6963bc8c
https://github.com/user-attachments/assets/cee5a507-b2b5-424a-a32e-f3218a2775da
https://github.com/user-attachments/assets/7c322ac1-c62f-411d-b479-02a041afe4c7
https://github.com/user-attachments/assets/8ce3be79-fe46-451a-83b6-9e36e86393a4
Subject Ref Images
Image
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions