Skip to content

Conversation

alexheretic
Copy link
Contributor

@alexheretic alexheretic commented Oct 6, 2025

I experience slow VAE performance on my AMD RX 7900 GRE gpu and can usually improve this by opting for the tiled VAE nodes. However, WanImageToVideo does VAE encoding and is currently not configurable. This leads to wan workflows being slow for me, see benchmarks.

I propose we add a vae_tile_size optional argument to WanImageToVideo (and similar). By default this will be 0 to mean untiled, ie acting as it did previously. If set the value will be used as the x & y tile size. This allows users, like me, a way to workaround poor wan VAE untiled encode performance.

As the default behaviour is unchanged this should be backward compatible.

Alternatives

  • Add new "tiled" variant nodes for wan, e.g. TiledWanImageToVideo.
  • Automatically pick tiled encoding for certain GPUs, e.g. my gpu -> 256x256 tiled encoding.

Wan 2.1 VAE benchmarks (480x832 * 81 frames)

System info

MIOPEN_FIND_MODE=FAST

Total VRAM 16368 MB, total RAM 64217 MB
pytorch version: 2.9.0.dev20250827+rocm6.4
AMD arch: gfx1100
ROCm version: (6, 4)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 GRE : native
Using Flash Attention
Python version: 3.12.11 (main, Jun  4 2025, 10:32:37) [GCC 15.1.1 20250425]
ComfyUI version: 0.3.62
ComfyUI frontend version: 1.27.7
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16

VAE Encode

Benches show significant improvement using tiled vae encoding. On my setup 256x256 performed best. 589s -> 25s.

Untiled vs 512 vs 384 vs 256 vs 128

2 runs each.

untiled

Yes really 10 minutes 😞

[WanImageToVideo]: 608.79s
[WanImageToVideo]: 588.72s

tiled 512,512,32,256,8

[WanImageToVideo]: 41.86s
[WanImageToVideo]: 43.68s

tiled 384,384,32,256,8

[WanImageToVideo]: 30.41s
[WanImageToVideo]: 28.89s

tiled 256,256,32,256,8

[WanImageToVideo]: 25.00s
[WanImageToVideo]: 25.35s

tiled 128,128,32,256,8

[WanImageToVideo]: 45.57s
[WanImageToVideo]: 45.31s

VAE Decode

Benches also show significant improvement using tiled vae decoding. On my setup 256x256 performed best.
Note: Decoding is already a separate node so no code changes required, this is just kinda related and perhaps interesting.

Untiled vs 512 vs 384 vs 256 vs 128

4 runs each (where possible).

untiled

OOM 😢

tiled 512,512,32,124,8

OOM 😢

tiled 384,384,32,124,8

[VAEDecodeTiled]: 73.94s
[VAEDecodeTiled]: 99.03s
[VAEDecodeTiled]: 62.71s
[VAEDecodeTiled]: 66.34s

tiled 256,256,32,124,8

[VAEDecodeTiled]: 60.79s
[VAEDecodeTiled]: 61.21s
[VAEDecodeTiled]: 54.53s
[VAEDecodeTiled]: 47.72s

tiled 128,128,32,124,8

[VAEDecodeTiled]: 72.18s
[VAEDecodeTiled]: 71.70s
[VAEDecodeTiled]: 71.47s
[VAEDecodeTiled]: 71.29s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant