ggml: add ops for WAN video model (cuda && cpu) #15669

leejet · 2025-08-29T19:19:50Z

This PR has adds some necessary ops for the WAN video model and fixes some issues. See: leejet/stable-diffusion.cpp#778

jeffbolznv · 2025-08-30T01:49:39Z

This looks really cool. I can add the new op support to vulkan. Do you happen to have any backend test cases? If not I can probably add some.

leejet · 2025-08-30T02:41:14Z

I developed it based on the ggml repository. I implement https://github.com/leejet/ggml/blob/wan/tests/test-conv3d.cpp. However, I'm not sure if it should be synchronized.

jeffbolznv · 2025-08-30T03:23:32Z

It would be helpful to have test cases in test-backend-ops, particularly for whatever was broken with getrows and scale.

leejet · 2025-08-30T03:29:32Z

The issue with get_rows/scale is caused by the fact that CUDA's gridDim.y and gridDim.z must be <= 65,535. I'm not sure whether Vulkan has a similar limitation.

Acly · 2025-08-30T08:52:41Z

ggml/include/ggml.h

            int                   d1); // dilation dimension 1

-    GGML_API struct ggml_tensor * ggml_conv_3d(
+    GGML_API struct ggml_tensor * ggml_conv_3d_direct(


I think ideally there should be only one conv_3d operation, and the backend can decide on an implementation (direct, im2col, tiled im2col, or choose one depending on inputs & hardware).

The reason there are two versions for conv_2d is because historically the im2col version was there first, and it coded im2col into the graph. This means the backend cannot optimize it or choose a better implementation, and the huge memory requirements of full tensor im2col or baked into the graph. This was difficult to change without breaking backends, hence the "direct" workaround. But IMO we should avoid this situation for newly introduced conv ops.

Yes, this is the ultimate goal. However, currently the support for each backend is not yet complete, and conv_2d_direct cannot fully replace im2col + gemm at present. In some backends such as CUDA, conv_2d_direct is much slower than im2col + gemm.

I agree there's no solution for conv_2d right away. My point is, this PR now creates the same problem for 3d. I don't think new code should be introduced that repeats the issue and bakes im2col as the default way to do conv_3d.

My suggestion would be to either implement im2col+mul_mat behind OP_CONV_3D in the CUDA backend. Or only expose im2col_3d on the API, and move im2col+mul_mat calls into the application (simpler but not as nice).

For CUDA one of my long-term goals is to write convolution kernels, particularly ones that can make use of quantized data. But I should stress that I have so many other things I want to do that realistically I would start working on it in a year at the absolute earliest.

JohannesGaessler

Please extend test-backend-ops with test cases for all of the newly added code.

ggml/src/ggml-cuda/getrows.cu

ggml/src/ggml-cuda/pad.cu

JohannesGaessler · 2025-08-30T14:33:12Z

ggml/include/gguf.h

I forgot: why is the GGUF header being changed?

Because the GGUF file of WAN contains a 5D tensor, for example: https://huggingface.co/QuantStack/Wan2.2-T2V-A14B-GGUF/tree/main/LowNoise?show_file_info=LowNoise%2FWan2.2-T2V-A14B-LowNoise-Q2_K.gguf. Therefore, I added the tensor_shape_read_cb_t callback, allowing users to control how the multi-dimensional(>4) tensor is mapped to a 4D format.

leejet · 2025-08-30T16:53:19Z

I have added test_im2col_3d to the test-backend-ops.cpp file.

This reverts commit d8377a0.

This reverts commit d9f1d13.

leejet · 2025-09-02T13:21:26Z

The changes related to gguf have been reverted.

CISC · 2025-09-02T13:24:32Z

And so it begins... :)
#15742

JohannesGaessler · 2025-09-02T14:36:50Z

As far as I'm concerned that's the fault of whoever wrote a 5D tensor to a GGUF file. If user code wants to encode some 5D structure that should have been done by writing a 1D tensor to the file with KV data for the 5D shape.

pwilkin · 2025-09-02T14:45:21Z

Waiting for the inevitable "why can't I run Qwen Image on llama.cpp?!"

leejet · 2025-09-02T15:34:06Z

@JohannesGaessler Do I still have anything else to do to get this PR merged?

JohannesGaessler · 2025-09-02T15:46:29Z

Please adapt the supports_op function for Vulkan to return false for the new padding mode.

0cc4m · 2025-09-02T16:33:00Z

The Vulkan change looks fine.

leejet · 2025-09-02T16:46:36Z

I also modified the supports_op function for Metal/SYCL/OpenCL to return false for the new padding mode.

leejet · 2025-09-02T16:57:09Z

I made some adjustments. ggml_pad is also calling ggml_pad_ext now, so we only need to check the parameters of 0, 2, 4, and 6.

…upport * origin/master: (72 commits) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) vulkan : update ggml_vk_instance_validation_ext_available (ggml-org#15666) ggml vulkan: add hardsigmoid and hardswish operations (ggml-org#15762) CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (ggml-org#15715) model-conversion : fix pyright errors (ggml-org#15770) sampling : optimize dist sampler (ggml-org#15704) llama : fix incorrect model type for Gemma 270M (ggml-org#15764) model-conversion : remove hardcoded /bin/bash shebangs [no ci] (ggml-org#15765) CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735) ggml-cpu : optimize RVV kernels (ggml-org#15720) ...

…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...

leejet · 2025-09-06T03:13:07Z

Thank you!

* add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <[email protected]> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit d8377a0. * Revert "add gguf_init_from_file_ext impl" This reverts commit d9f1d13. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <[email protected]>

…)" CUDA

leejet added 8 commits August 30, 2025 02:56

add conv3d support

c92f9b4

add ggml_pad_ext for cpu & cuda backend

93c7e77

cuda/cpu: add im2col_3d support

f7a12f9

cuda: make im2col a little faster

85c8e1e

fix cuda pad/scale/im2col3d

ae47cac

make im2col_3d faster

dd745ba

gguf: support loading tensors which n_dims > GGML_MAX_DIMS

d8377a0

fix cuda get_rows

d30e07d

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Aug 29, 2025

avoid ggml_conv_3d conflict

df05913

github-actions bot added the testing Everything test related label Aug 29, 2025

leejet added 4 commits August 30, 2025 03:36

correct GGML_OP_COUNT assertion

9d035c4

avoid build failure

d11a729

avoid build failure on MacOS

f6a874c

cuda: remove unnecessary MIN define

f6278c8

fix cpu im2col_3d

c9b9fab

Acly reviewed Aug 30, 2025

View reviewed changes

JohannesGaessler reviewed Aug 30, 2025

View reviewed changes

ggml/src/ggml-cuda/getrows.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/pad.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/pad.cu Outdated Show resolved Hide resolved

JohannesGaessler reviewed Aug 30, 2025

View reviewed changes

leejet added 3 commits August 31, 2025 00:04

adjust the code style

131ae2d

cuda: use simpler loop in get_rows

0d5eb51

add test_im2col_3d to test-backend-ops

aafa79a

test-backend-ops.cpp: remove trailing whitespace

3f901e3

leejet added 2 commits September 2, 2025 21:19

Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"

9b365e8

This reverts commit d8377a0.

Revert "add gguf_init_from_file_ext impl"

6b6eede

This reverts commit d9f1d13.

update ggml_backend_vk_device_supports_op

2412bb0

leejet requested a review from 0cc4m as a code owner September 2, 2025 16:12

github-actions bot added the Vulkan Issues specific to the Vulkan backend label Sep 2, 2025

leejet added 2 commits September 3, 2025 00:14

fix ggml_backend_vk_device_supports_op

b38bfbb

update other backend supports op for ggml_pad_ext

457f186

github-actions bot added SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) OpenCL Issues specific to the OpenCL backend labels Sep 2, 2025

metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op

1618844

JohannesGaessler approved these changes Sep 4, 2025

View reviewed changes

JohannesGaessler merged commit 0a1b398 into ggml-org:master Sep 4, 2025
49 checks passed

tyllmoritz mentioned this pull request Sep 4, 2025

feat: add wan2.1/2.2 support leejet/stable-diffusion.cpp#778

Merged

3 tasks

bssrdf mentioned this pull request Sep 4, 2025

Add conv2d Implicit GEMM #15805

Closed

CISC mentioned this pull request Sep 8, 2025

cuda : non-contiguous src0 not supported for PAD #15869

Merged

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025

Revert "ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669…

bb15126

…)" CUDA

ggml: add ops for WAN video model (cuda && cpu) #15669

ggml: add ops for WAN video model (cuda && cpu) #15669

Uh oh!

Conversation

leejet commented Aug 29, 2025

Uh oh!

jeffbolznv commented Aug 30, 2025

Uh oh!

leejet commented Aug 30, 2025

Uh oh!

jeffbolznv commented Aug 30, 2025

Uh oh!

leejet commented Aug 30, 2025

Uh oh!

Acly Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

leejet Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Acly Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

leejet Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

leejet commented Aug 30, 2025

Uh oh!

leejet commented Sep 2, 2025

Uh oh!

CISC commented Sep 2, 2025

Uh oh!

JohannesGaessler commented Sep 2, 2025

Uh oh!

pwilkin commented Sep 2, 2025

Uh oh!

leejet commented Sep 2, 2025

Uh oh!

JohannesGaessler commented Sep 2, 2025

Uh oh!

0cc4m commented Sep 2, 2025

Uh oh!

leejet commented Sep 2, 2025

Uh oh!

leejet commented Sep 2, 2025

Uh oh!

Uh oh!

leejet commented Sep 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants