WAN 2.1 VACE 1.3B - V2V: Out-of-Order Mask Sequence Loading Bug?

So I've managed to get WAN 2.1 VACE working on my rig with ROCm (reference: RX 7600 XT (16GB), 48 GB RAM, Manjaro Linux Kernel v. 6.16.4, ROCm v.6.4.3), but attemping to use reference video frames as a video control source isn't working correctly. The control frames are being loaded in what appears to be a random order, so the resulting AVI output has "scrambled" motion.

The problem could be due to some command option not being set that I'm unaware of.

Here's what I'm doing and what I've tried, tested with both: master-301-fd693ac & master-306-2abe945 (latest)

1. Use WAN 2.1 VACE to generate a 7 second AVI at 12 FPS (half-size, 416x240 is all my system can realistically handle).
```shell
./sd -M vid_gen 
--diffusion-model ./models/checkpoints/wan2.1_vace_1.3B_fp16.safetensors 
--vae ./models/vae/wan_2.1_vae.safetensors 
--t5xxl ./models/text_encoders/umt5_xxl_fp16.safetensors 
--lora-model-dir ./models/loras 
--embd-dir ./models/embeddings 
-s -1 
--cfg-scale 5.5 
--steps 50 
--sampling-method euler 
-W 416 -H 240 
--fps 12 
--video-frames 85 
--offload-to-cpu 
--clip-on-cpu 
-v 
-p "medium shot, a lovely cat in a carpeted room, turning to face the camera and walking toward it" 
-n "anime, cartoon, drawing, 3d render, cgi render, ai generated, ugly face" 
-o /run/media/Storage/TestingWeirdness/WAN2.1-VACE_09-23_02h.avi 
-i ./References/CatRef.png
``` 
2. Use FFMpeg and "Video-Depthify" (https://github.com/jankais3r/Video-Depthify) to break that AVI into sequential depth mask images, named in numerical frame order 01-37, 0001-0037, etc. (I've tried different zero left-pad combinations to no avail.) I can confirm the resulting depth mask images are in the correct sequence using Viewnior, which steps through images in a given directory in named order when SPC/BKSP or UP/DOWN arrow keys.
```shell
ffmpeg -i /run/media/Storage/TestingWeirdness/WAN2.1-VACE_09-23_02h.avi -qmin 1 -qscale:v 1 ./rgb/%04d.jpg
python3 depth.py
``` 
3. Use a single reference image to change the visual subject, and the depth map sequence to replicate movement to create a new AVI.
```shell
./sd -M vid_gen 
--diffusion-model ./models/checkpoints/wan2.1_vace_1.3B_fp16.safetensors 
--vae ./models/vae/wan_2.1_vae.safetensors 
--t5xxl ./models/text_encoders/umt5_xxl_fp16.safetensors 
--lora-model-dir ./models/loras 
--embd-dir ./models/embeddings 
-s -1 
--cfg-scale 5.5 
--steps 50 
--sampling-method euler 
-W 416 -H 240 
--fps 12 
--video-frames 85 
--offload-to-cpu 
--clip-on-cpu 
-v 
-p "medium shot, a lovely ferret in a carpeted room, turning to face the camera and walking toward it" 
-n "anime, cartoon, drawing, 3d render, cgi render, ai generated, ugly face" 
-o /run/media/Storage/TestingWeirdness/WAN2.1-VACE_09-23_03a.avi 
-i ./References/FerretRef.png
--control-video ./WAN-Tests_Archive/vidref/crap/Video-Depthify/depth
``` 

The end result is always the same. SD appears to be loading the depth map frame images out of sequence...
```shell
load image 0 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/20.jpg'
load image 1 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/02.jpg'
load image 2 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/63.jpg'
load image 3 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/10.jpg'
load image 4 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/61.jpg'
load image 5 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/79.jpg'
load image 6 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/08.jpg'
load image 7 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/21.jpg'
load image 8 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/43.jpg'
load image 9 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/24.jpg'
load image 10 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/32.jpg'
....
load image 0 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0079.jpg'
load image 1 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0070.jpg'
load image 2 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0048.jpg'
load image 3 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0059.jpg'
load image 4 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0029.jpg'
load image 5 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0083.jpg'
load image 6 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0034.jpg'
load image 7 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0011.jpg'
load image 8 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0036.jpg'
load image 9 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0065.jpg'
load image 10 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/0060.jpg'
....
load image 0 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0040.png'
load image 1 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0063.png'
load image 2 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0025.png'
load image 3 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0038.png'
load image 4 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0023.png'
load image 5 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0065.png'
load image 6 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0049.png'
load image 7 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0035.png'
load image 8 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0070.png'
load image 9 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0016.png'
load image 10 from './WAN-Tests_Archive/vidref/crap/Video-Depthify/depth/frame_0009.png'
....
etc.
```
...so the resulting video's motion is constantly screwed up.

It doesn't appear to matter if the depth map frames are named "frame_0n.jpg", "frames_0n.jpg", "0n.jpg", or any combination of the same with different numbers of leading zeros for the frame number. They're still getting loaded out of order. I've also tried using PNG files instead of JPEGs and keep getting scrambled results.

The good news is that both the depth map sequence and the single reference image ARE getting used, it's just the action sequence in the resulting AVI is jumbled to hell and back. Am I doing something wrong here that's not obvious to me, or is the out-of-sequence image loading a bug?

I've attached MP4s of the source, mask, and resulting AVIs, plus the I2V inputs just to be thorough.

| Sources & Gens |
|--------|
| https://github.com/user-attachments/assets/6a22b62a-2493-49e3-b4ed-843711c32528 |
| https://github.com/user-attachments/assets/8dc15c62-fa13-4fd6-b95b-500f6963bc8c |
| https://github.com/user-attachments/assets/cee5a507-b2b5-424a-a32e-f3218a2775da |
| https://github.com/user-attachments/assets/7c322ac1-c62f-411d-b479-02a041afe4c7 |
| https://github.com/user-attachments/assets/8ce3be79-fe46-451a-83b6-9e36e86393a4 | 

| Subject Ref Images |
|--------|
| <img width="416" height="240" alt="Image" src="https://github.com/user-attachments/assets/1fb47438-3b50-4fa6-b0aa-5bdd0703c477" /> |
| <img width="416" height="240" alt="Image" src="https://github.com/user-attachments/assets/23878dbd-6ee2-43f8-88a5-ba764d302368" /> |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WAN 2.1 VACE 1.3B - V2V: Out-of-Order Mask Sequence Loading Bug? #856

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

WAN 2.1 VACE 1.3B - V2V: Out-of-Order Mask Sequence Loading Bug? #856

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions