Skip to content

Conversation

leejet
Copy link
Owner

@leejet leejet commented Aug 29, 2025

Feature:

  • Wan2.1 T2V 1.3B
  • Wan2.1 T2V 14B
  • Wan2.1 I2V 14B
  • Wan2.2 T2V A14B
  • Wan2.2 I2V A14B
  • Wan2.2 TI2V 5B
  • Wan2.1 FLF2V 14B
  • Wan2.2 FLF2V 14B

TODO:

  • Vace
  • Fun control
  • Reduce the memory usage of WAN VAE

Warning: Currently, only the CUDA and CPU backends support WAN VAE. If you are using another backend, try using --vae-on-cpu to run the WAN VAE on the CPU. Although this will be very slow.

Examples

Since GitHub does not support AVI files, the file I uploaded was converted from AVI to MP4.

Wan2.1 T2V 1.3B

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.1_t2v_1.3B_fp16.safetensors --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部, 畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 832 -H 480 --diffusion-fa --video-frames 33
Wan2.1_1.3B_t2v.mp4

Wan2.1 T2V 14B

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.1-t2v-14b-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 832 -H 480 --diffusion-fa  --offload-to-cpu --video-frames 33
Wan2.1_14B_t2v.mp4

Wan2.1 I2V 14B

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.1-i2v-14b-480p-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf --clip_vision ..\..\ComfyUI\models\clip_vision\clip_vision_h.safetensors -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 480 -H 832 --diffusion-fa --video-frames 33 --offload-to-cpu -i ..\assets\cat_with_sd_cpp_42.png
Wan2.1_14B_i2v.mp4

Wan2.2 T2V A14B

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 8 -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 832 -H 480 --diffusion-fa --offload-to-cpu --video-frames 33
Wan2.2_14B_t2v.mp4

Wan2.2 I2V A14B

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-I2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-I2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 8 -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 832 -H 480 --diffusion-fa --offload-to-cpu --video-frames 33 --offload-to-cpu -i ..\assets\cat_with_sd_cpp_42.png
Wan2.2_14B_i2v.mp4

Wan2.2 I2V A14B T2I

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 8 -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 832 -H 480 --diffusion-fa --offload-to-cpu
Wan2 2_14B_t2i

Wan2.2 T2V 14B with Lora

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-T2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat<lora:wan2.2_t2v_lightx2v_4steps_lora_v1.1_low_noise:1><lora:|high_noise|wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise:1>" --cfg-scale 3.5 --sampling-method euler --steps 4 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 4 -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 832 -H 480 --diffusion-fa --offload-to-cpu --lora-model-dir ..\..\ComfyUI\models\loras --video-frames 33
Wan2.2_14B_t2v_lora.mp4

Wan2.2 TI2V 5B

T2V

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.2_ti2v_5B_fp16.safetensors --vae ..\..\ComfyUI\models\vae\wan2.2_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 480 -H 832 --diffusion-fa --offload-to-cpu --video-frames 33
Wan2.2_5B_t2v.mp4

I2V

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.2_ti2v_5B_fp16.safetensors --vae ..\..\ComfyUI\models\vae\wan2.2_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf  -p "a lovely cat" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 480 -H 832 --diffusion-fa --offload-to-cpu --video-frames 33 -i ..\assets\cat_with_sd_cpp_42.png
Wan2.2_5B_i2v.mp4

Wan2.1 FLF2V 14B

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\wan2.1-flf2v-14b-720p-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf --clip_vision ..\..\ComfyUI\models\clip_vision\clip_vision_h.safetensors -p "glass flower blossom" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 480 -H 832 --diffusion-fa --video-frames 33 --offload-to-cpu --init-img ..\..\ComfyUI\input\start_image.png --end-img ..\..\ComfyUI\input\end_image.png
Wan2.1_14B_flf2v.mp4

Wan2.2 FLF2V 14B

.\bin\Release\sd.exe -M vid_gen --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-I2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model  ..\..\ComfyUI\models\diffusion_models\Wan2.2-I2V-A14B-HighNoise-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\wan_2.1_vae.safetensors --t5xxl ..\..\ComfyUI\models\text_encoders\umt5-xxl-encoder-Q8_0.gguf --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 8 -v -p "glass flower blossom" -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 480 -H 832 --diffusion-fa --video-frames 33 --offload-to-cpu --init-img ..\..\ComfyUI\input\start_image.png --end-img ..\..\ComfyUI\input\end_image.png
Wan2.2_14B_flf2v.mp4

@leejet
Copy link
Owner Author

leejet commented Aug 29, 2025

Finally, Wan's support was added. This took me a long time. Once this PR is merged, I will try to add support for Qwen Image.

@Green-Sky
Copy link
Contributor

Great job @leejet , very nice. I can't wait to try it later.

I see you went for mjpeg+avi, is there also an option to output it as a png image sequence?

@leejet
Copy link
Owner Author

leejet commented Aug 29, 2025

Great job @leejet , very nice. I can't wait to try it later.

I see you went for mjpeg+avi, is there also an option to output it as a png image sequence?

I will add command-line parameters to control it, but the priority is not very high.

@chaserhkj
Copy link

@Green-Sky It might not be the problem with the implementation but with the smaller model itself. I think that smaller model's distilling is not done very well, I had a lot of trouble getting consistent result in ComfyUI using that smaller model as well. I had far better changes using a quantized version of the full model.

@Green-Sky
Copy link
Contributor

Green-Sky commented Sep 2, 2025

@Green-Sky It might not be the problem with the implementation but with the smaller model itself. I think that smaller model's distilling is not done very well, I had a lot of trouble getting consistent result in ComfyUI using that smaller model as well. I had far better changes using a quantized version of the full model.

Hmm. I don't think you can call Wan2.2 TI2V 5B a distilled model. It has its own VAE, that has way more compression than the other VAE.

Wan2.2 open-sources a 5B model built with our advanced Wan2.2-VAE that achieves a compression ratio of 16×16×4.

Also, the same model behaves just fine with text only input.

@stduhpf
Copy link
Contributor

stduhpf commented Sep 2, 2025

Wan2.2 TI2V 5B with image input still seems to be somewhat broken,

I think wan2.2 TI2V kind of sucks in I2V mode in ComfyUI too.

Edit: I tried to match the settings as well as I could in comfy, It's definitely not as bad, but maybe it's just a lucky seed.
ComfyUI_01039_

Edit2: No something's definitely wrong with this PR's implementation, the cat keeps sneezing no matter the seed, and this doesn't happen at all in ComfyUI.

Edit 3: I was using --sed instead of --seed (thank god #767 is merged in master now)

seed 42 seed 0
output output0

@chaserhkj
Copy link

chaserhkj commented Sep 2, 2025 via email

@tyllmoritz
Copy link

When I tried this with the vulkan backend, I had problems with im2col_3d.

For a quick and dirty test, I just reverted the commit "cuda/cpu: add im2col_3d support" in https://github.com/leejet/ggml/tree/wan

There are two better solutions (already implemented by @leejet and @jeffbolznv , thanks for your work):

@leejet
Copy link
Owner Author

leejet commented Sep 6, 2025

Since ggml-org/ggml has already synchronized the PR that I made for adding WAN-related operations, I have decided to merge this PR first. This PR already contains too many changes. Support for VACE and FUN will be in a separate pull request.

@leejet leejet merged commit cb1d975 into master Sep 6, 2025
8 checks passed
Copy link
Contributor

@Green-Sky Green-Sky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late review.

@Green-Sky
Copy link
Contributor

--offload-to-cpu is not part of help too.

@leejet
Copy link
Owner Author

leejet commented Sep 6, 2025

All of these have been fixed. Thank you for your review comments.

@Amin456789
Copy link

@LostRuins please add this to ur gui if possibile. will be great if u add support for lora too

thank u guys for making this, thnx leejet and others

@LostRuins
Copy link
Contributor

LostRuins commented Sep 28, 2025

Hello @leejet , I noticed that the sd_vid_gen_params_t doesn't contain any parameters for toggling VAE tiling - currently does VAE tiling work for WAN videos and is it possible to enable? Thanks!

Edit: Reason is because without VAE tiling currently it's trying to allocated a massive buffer on vulkan that goes OOM.

@LostRuins
Copy link
Contributor

Also can someone help me understand how the flow shift works? Is that what's causing these abrupt transitions and how can I avoid it?
cat

@LostRuins
Copy link
Contributor

LostRuins commented Sep 30, 2025

wtf

still getting really weird results in most gens

@wbruna any ideas?

Final edit: All resolved by switching to wan2.2-rapid-mega-aio-v3

@leejet
Copy link
Owner Author

leejet commented Oct 11, 2025

Hello @leejet , I noticed that the sd_vid_gen_params_t doesn't contain any parameters for toggling VAE tiling - currently does VAE tiling work for WAN videos and is it possible to enable? Thanks!

Edit: Reason is because without VAE tiling currently it's trying to allocated a massive buffer on vulkan that goes OOM.

Currently, WAN VAE does not support video tiling, and I haven’t tested the feasibility of video tiling yet.

@leejet
Copy link
Owner Author

leejet commented Oct 11, 2025

Also can someone help me understand how the flow shift works? Is that what's causing these abrupt transitions and how can I avoid it?

Try lower shift values (2.0 to 5.0) for lower resolution videos and higher shift values (7.0 to 12.0) for higher resolution images. https://huggingface.co/docs/diffusers/en/api/pipelines/wan#notes

@LostRuins
Copy link
Contributor

Currently, WAN VAE does not support video tiling, and I haven’t tested the feasibility of video tiling yet.

Would it be possible to simply do the VAE per-frame (the entire frame at once). I confess I don't know how it works, but the memory usage for a single frame image is perfectly ok. The problem only comes when doing longer videos with many frames.

@leejet
Copy link
Owner Author

leejet commented Oct 12, 2025

Currently, WAN VAE does not support video tiling, and I haven’t tested the feasibility of video tiling yet.

Would it be possible to simply do the VAE per-frame (the entire frame at once). I confess I don't know how it works, but the memory usage for a single frame image is perfectly ok. The problem only comes when doing longer videos with many frames.

        struct ggml_tensor* decode(struct ggml_context* ctx,
                                   struct ggml_tensor* z,
                                   int64_t b = 1) {
            // z: [b*c, t, h, w]
            GGML_ASSERT(b == 1);

            clear_cache();

            auto decoder = std::dynamic_pointer_cast<Decoder3d>(blocks["decoder"]);
            auto conv2   = std::dynamic_pointer_cast<CausalConv3d>(blocks["conv2"]);

            int64_t iter_ = z->ne[2];
            auto x        = conv2->forward(ctx, z);
            struct ggml_tensor* out;
            for (int64_t i = 0; i < iter_; i++) {
                _conv_idx = 0;
                if (i == 0) {
                    auto in = ggml_slice(ctx, x, 2, i, i + 1);  // [b*c, 1, h, w]
                    out     = decoder->forward(ctx, in, b, _feat_map, _conv_idx, i);
                } else {
                    auto in   = ggml_slice(ctx, x, 2, i, i + 1);  // [b*c, 1, h, w]
                    auto out_ = decoder->forward(ctx, in, b, _feat_map, _conv_idx, i);
                    out       = ggml_concat(ctx, out, out_, 2);
                }
            }
            if (wan2_2) {
                out = unpatchify(ctx, out, 2, b);
            }
            clear_cache();
            return out;
        }

Currently, decoding is done frame by frame, and the compute buffer size used is the same for both 33 frames and 81 frames.

@LostRuins
Copy link
Contributor

Oh, then why is it smaller for something like 1 frame or 5 frames?

@leejet
Copy link
Owner Author

leejet commented Oct 12, 2025

Starting from chunk 1, each chunk depends on data from the previous chunk, so the computation graph is different, causing the compute buffer to grow. In theory, after chunk 1, the compute buffer shouldn’t grow anymore, but in practice, it actually stops growing after chunk 2. I tried creating a separate computation graph for each chunk, and indeed, the buffer no longer grows after chunk 1. However, the results for chunk 1 were a bit odd, so I disabled the related code—you can check the code around build_graph_partial.

By the way, for Wan VAE, the decoding rule for chunks is: chunk 0 corresponds to 1 frame, and starting from chunk 1, each chunk corresponds to 4 frames.

@henk717
Copy link

henk717 commented Oct 12, 2025

As a measure, in our mode for 80 frames (KoboldCpp side) I am measuring 75GB of vram used during the generation on the 14B 2.2. If I use only 10 frames I can fit it on my 3090 fine. So something is balooning the vram usage with higher frame counts.

@LostRuins
Copy link
Contributor

LostRuins commented Oct 13, 2025

what resolution were you generating at?

Also if this logic is correct then 10 frames should take the same amount of memory as 80 frames, but it seems higher.

@leejet
Copy link
Owner Author

leejet commented Oct 13, 2025

Have you used --diffusion-fa? This option can significantly reduce the VRAM usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.