-
Notifications
You must be signed in to change notification settings - Fork 432
feat: add wan2.1/2.2 support #778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Finally, Wan's support was added. This took me a long time. Once this PR is merged, I will try to add support for Qwen Image. |
Great job @leejet , very nice. I can't wait to try it later. I see you went for mjpeg+avi, is there also an option to output it as a png image sequence? |
I will add command-line parameters to control it, but the priority is not very high. |
@Green-Sky It might not be the problem with the implementation but with the smaller model itself. I think that smaller model's distilling is not done very well, I had a lot of trouble getting consistent result in ComfyUI using that smaller model as well. I had far better changes using a quantized version of the full model. |
Hmm. I don't think you can call
Also, the same model behaves just fine with text only input. |
I think wan2.2 TI2V kind of sucks in I2V mode in ComfyUI too. Edit: I tried to match the settings as well as I could in comfy, It's definitely not as bad, but maybe it's just a lucky seed. Edit2: No something's definitely wrong with this PR's implementation, the cat keeps sneezing no matter the seed, and this doesn't happen at all in ComfyUI. Edit 3: I was using
|
On Tue, Sep 2, 2025, 6:03 a.m. Erik Scholz ***@***.***> wrote:
*Green-Sky* left a comment (leejet/stable-diffusion.cpp#778)
<#778 (comment)>
@Green-Sky <https://github.com/Green-Sky> It might not be the problem
with the implementation but with the smaller model itself. I think that
smaller model's distilling is not done very well, I had a lot of trouble
getting consistent result in ComfyUI using that smaller model as well. I
had far better changes using a quantized version of the full model.
Hmm. I don't think you can call Wan2.2 TI2V 5B a distilled model. It has
its own VAE, that has way more compression then the other VAE.
You re correct, but maybe this just means it's not very well trained,
anyways. I think efforts around quantizing the 14B model still makes far
more sense for lower end devices. The VAE is the problem there though as in
my use case in ComfyUI I was constantly blasted with VRAM OOMs with a 16GB
GPU during VAE procedures. Had to do a lot of offloading.
… Wan2.2 open-sources a 5B model built with our advanced Wan2.2-VAE that
achieves a compression ratio of 16×16×4.
—
Reply to this email directly, view it on GitHub
<#778 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANPXZYNP5K4VIB54Y2IFLT3QVTP7AVCNFSM6AAAAACFEZWMR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENBUGY2TONZVHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
When I tried this with the vulkan backend, I had problems with im2col_3d. For a quick and dirty test, I just reverted the commit "cuda/cpu: add im2col_3d support" in https://github.com/leejet/ggml/tree/wan There are two better solutions (already implemented by @leejet and @jeffbolznv , thanks for your work):
|
Since ggml-org/ggml has already synchronized the PR that I made for adding WAN-related operations, I have decided to merge this PR first. This PR already contains too many changes. Support for VACE and FUN will be in a separate pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the late review.
|
All of these have been fixed. Thank you for your review comments. |
@LostRuins please add this to ur gui if possibile. will be great if u add support for lora too thank u guys for making this, thnx leejet and others |
Hello @leejet , I noticed that the Edit: Reason is because without VAE tiling currently it's trying to allocated a massive buffer on vulkan that goes OOM. |
still getting really weird results in most gens @wbruna any ideas? Final edit: All resolved by switching to wan2.2-rapid-mega-aio-v3 |
Currently, WAN VAE does not support video tiling, and I haven’t tested the feasibility of video tiling yet. |
Try lower shift values (2.0 to 5.0) for lower resolution videos and higher shift values (7.0 to 12.0) for higher resolution images. https://huggingface.co/docs/diffusers/en/api/pipelines/wan#notes |
Would it be possible to simply do the VAE per-frame (the entire frame at once). I confess I don't know how it works, but the memory usage for a single frame image is perfectly ok. The problem only comes when doing longer videos with many frames. |
struct ggml_tensor* decode(struct ggml_context* ctx,
struct ggml_tensor* z,
int64_t b = 1) {
// z: [b*c, t, h, w]
GGML_ASSERT(b == 1);
clear_cache();
auto decoder = std::dynamic_pointer_cast<Decoder3d>(blocks["decoder"]);
auto conv2 = std::dynamic_pointer_cast<CausalConv3d>(blocks["conv2"]);
int64_t iter_ = z->ne[2];
auto x = conv2->forward(ctx, z);
struct ggml_tensor* out;
for (int64_t i = 0; i < iter_; i++) {
_conv_idx = 0;
if (i == 0) {
auto in = ggml_slice(ctx, x, 2, i, i + 1); // [b*c, 1, h, w]
out = decoder->forward(ctx, in, b, _feat_map, _conv_idx, i);
} else {
auto in = ggml_slice(ctx, x, 2, i, i + 1); // [b*c, 1, h, w]
auto out_ = decoder->forward(ctx, in, b, _feat_map, _conv_idx, i);
out = ggml_concat(ctx, out, out_, 2);
}
}
if (wan2_2) {
out = unpatchify(ctx, out, 2, b);
}
clear_cache();
return out;
} Currently, decoding is done frame by frame, and the compute buffer size used is the same for both 33 frames and 81 frames. |
Oh, then why is it smaller for something like 1 frame or 5 frames? |
Starting from chunk 1, each chunk depends on data from the previous chunk, so the computation graph is different, causing the compute buffer to grow. In theory, after chunk 1, the compute buffer shouldn’t grow anymore, but in practice, it actually stops growing after chunk 2. I tried creating a separate computation graph for each chunk, and indeed, the buffer no longer grows after chunk 1. However, the results for chunk 1 were a bit odd, so I disabled the related code—you can check the code around By the way, for Wan VAE, the decoding rule for chunks is: chunk 0 corresponds to 1 frame, and starting from chunk 1, each chunk corresponds to 4 frames. |
As a measure, in our mode for 80 frames (KoboldCpp side) I am measuring 75GB of vram used during the generation on the 14B 2.2. If I use only 10 frames I can fit it on my 3090 fine. So something is balooning the vram usage with higher frame counts. |
what resolution were you generating at? Also if this logic is correct then 10 frames should take the same amount of memory as 80 frames, but it seems higher. |
Have you used --diffusion-fa? This option can significantly reduce the VRAM usage. |
Feature:
TODO:
Warning: Currently, only the CUDA and CPU backends support WAN VAE. If you are using another backend, try using
--vae-on-cpu
to run the WAN VAE on the CPU. Although this will be very slow.Examples
Since GitHub does not support AVI files, the file I uploaded was converted from AVI to MP4.
Wan2.1 T2V 1.3B
Wan2.1_1.3B_t2v.mp4
Wan2.1 T2V 14B
Wan2.1_14B_t2v.mp4
Wan2.1 I2V 14B
Wan2.1_14B_i2v.mp4
Wan2.2 T2V A14B
Wan2.2_14B_t2v.mp4
Wan2.2 I2V A14B
Wan2.2_14B_i2v.mp4
Wan2.2 I2V A14B T2I
Wan2.2 T2V 14B with Lora
Wan2.2_14B_t2v_lora.mp4
Wan2.2 TI2V 5B
T2V
Wan2.2_5B_t2v.mp4
I2V
Wan2.2_5B_i2v.mp4
Wan2.1 FLF2V 14B
Wan2.1_14B_flf2v.mp4
Wan2.2 FLF2V 14B
Wan2.2_14B_flf2v.mp4