Skip to content

Conversation

@leejet
Copy link
Contributor

@leejet leejet commented Aug 29, 2025

This PR has adds some necessary ops for the WAN video model and fixes some issues. See: leejet/stable-diffusion.cpp#778

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Aug 29, 2025
@github-actions github-actions bot added the testing Everything test related label Aug 29, 2025
@jeffbolznv
Copy link
Collaborator

This looks really cool. I can add the new op support to vulkan. Do you happen to have any backend test cases? If not I can probably add some.

@leejet
Copy link
Contributor Author

leejet commented Aug 30, 2025

I developed it based on the ggml repository. I implement https://github.com/leejet/ggml/blob/wan/tests/test-conv3d.cpp. However, I'm not sure if it should be synchronized.

@jeffbolznv
Copy link
Collaborator

It would be helpful to have test cases in test-backend-ops, particularly for whatever was broken with getrows and scale.

@leejet
Copy link
Contributor Author

leejet commented Aug 30, 2025

The issue with get_rows/scale is caused by the fact that CUDA's gridDim.y and gridDim.z must be <= 65,535. I'm not sure whether Vulkan has a similar limitation.

int d1); // dilation dimension 1

GGML_API struct ggml_tensor * ggml_conv_3d(
GGML_API struct ggml_tensor * ggml_conv_3d_direct(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally there should be only one conv_3d operation, and the backend can decide on an implementation (direct, im2col, tiled im2col, or choose one depending on inputs & hardware).

The reason there are two versions for conv_2d is because historically the im2col version was there first, and it coded im2col into the graph. This means the backend cannot optimize it or choose a better implementation, and the huge memory requirements of full tensor im2col or baked into the graph. This was difficult to change without breaking backends, hence the "direct" workaround. But IMO we should avoid this situation for newly introduced conv ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the ultimate goal. However, currently the support for each backend is not yet complete, and conv_2d_direct cannot fully replace im2col + gemm at present. In some backends such as CUDA, conv_2d_direct is much slower than im2col + gemm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there's no solution for conv_2d right away. My point is, this PR now creates the same problem for 3d. I don't think new code should be introduced that repeats the issue and bakes im2col as the default way to do conv_3d.

My suggestion would be to either implement im2col+mul_mat behind OP_CONV_3D in the CUDA backend. Or only expose im2col_3d on the API, and move im2col+mul_mat calls into the application (simpler but not as nice).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CUDA one of my long-term goals is to write convolution kernels, particularly ones that can make use of quantized data. But I should stress that I have so many other things I want to do that realistically I would start working on it in a year at the absolute earliest.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extend test-backend-ops with test cases for all of the newly added code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot: why is the GGUF header being changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the GGUF file of WAN contains a 5D tensor, for example: https://huggingface.co/QuantStack/Wan2.2-T2V-A14B-GGUF/tree/main/LowNoise?show_file_info=LowNoise%2FWan2.2-T2V-A14B-LowNoise-Q2_K.gguf. Therefore, I added the tensor_shape_read_cb_t callback, allowing users to control how the multi-dimensional(>4) tensor is mapped to a 4D format.

@leejet
Copy link
Contributor Author

leejet commented Aug 30, 2025

I have added test_im2col_3d to the test-backend-ops.cpp file.

@leejet
Copy link
Contributor Author

leejet commented Sep 2, 2025

The changes related to gguf have been reverted.

@CISC
Copy link
Collaborator

CISC commented Sep 2, 2025

And so it begins... :)
#15742

@JohannesGaessler
Copy link
Collaborator

As far as I'm concerned that's the fault of whoever wrote a 5D tensor to a GGUF file. If user code wants to encode some 5D structure that should have been done by writing a 1D tensor to the file with KV data for the 5D shape.

@pwilkin
Copy link
Collaborator

pwilkin commented Sep 2, 2025

Waiting for the inevitable "why can't I run Qwen Image on llama.cpp?!"

@leejet
Copy link
Contributor Author

leejet commented Sep 2, 2025

@JohannesGaessler Do I still have anything else to do to get this PR merged?

@JohannesGaessler
Copy link
Collaborator

Please adapt the supports_op function for Vulkan to return false for the new padding mode.

@leejet leejet requested a review from 0cc4m as a code owner September 2, 2025 16:12
@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label Sep 2, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 2, 2025

The Vulkan change looks fine.

@github-actions github-actions bot added SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) OpenCL Issues specific to the OpenCL backend labels Sep 2, 2025
@leejet
Copy link
Contributor Author

leejet commented Sep 2, 2025

I also modified the supports_op function for Metal/SYCL/OpenCL to return false for the new padding mode.

@leejet
Copy link
Contributor Author

leejet commented Sep 2, 2025

I made some adjustments. ggml_pad is also calling ggml_pad_ext now, so we only need to check the parameters of 0, 2, 4, and 6.

@JohannesGaessler JohannesGaessler merged commit 0a1b398 into ggml-org:master Sep 4, 2025
49 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Sep 4, 2025
…upport

* origin/master: (72 commits)
metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799)
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791)
CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763)
server: add exceed_context_size_error type (ggml-org#15780)
Document the new max GPU layers default in help (ggml-org#15771)
ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669)
CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784)
opencl: add hs=40 to FA (ggml-org#15758)
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760)
vulkan: fix mmv subgroup16 selection (ggml-org#15775)
vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724)
vulkan : update ggml_vk_instance_validation_ext_available (ggml-org#15666)
ggml vulkan: add hardsigmoid and hardswish operations (ggml-org#15762)
CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (ggml-org#15715)
model-conversion : fix pyright errors (ggml-org#15770)
sampling : optimize dist sampler (ggml-org#15704)
llama : fix incorrect model type for Gemma 270M (ggml-org#15764)
model-conversion : remove hardcoded /bin/bash shebangs [no ci] (ggml-org#15765)
CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735)
ggml-cpu : optimize RVV kernels (ggml-org#15720)
...
@bssrdf bssrdf mentioned this pull request Sep 4, 2025
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Sep 5, 2025
…g-model-disabled-agent-prefill

* origin/master: (84 commits)
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802)
tests : add --list-ops and --show-coverage options (ggml-org#15745)
gguf: gguf_writer refactor (ggml-org#15691)
kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811)
model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801)
chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639)
chat : nemotron thinking & toolcalling support (ggml-org#15676)
scripts : add Jinja tester PySide6 simple app (ggml-org#15756)
llama : add support for EmbeddingGemma 300m (ggml-org#15798)
metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799)
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791)
CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763)
server: add exceed_context_size_error type (ggml-org#15780)
Document the new max GPU layers default in help (ggml-org#15771)
ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669)
CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784)
opencl: add hs=40 to FA (ggml-org#15758)
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760)
vulkan: fix mmv subgroup16 selection (ggml-org#15775)
vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724)
...
@leejet
Copy link
Contributor Author

leejet commented Sep 6, 2025

Thank you!

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
* add conv3d support

* add ggml_pad_ext for cpu & cuda backend

* cuda/cpu: add im2col_3d support

* cuda: make im2col a little faster

* fix cuda pad/scale/im2col3d

* make im2col_3d faster

* gguf: support loading tensors which n_dims > GGML_MAX_DIMS

* fix cuda get_rows

* avoid ggml_conv_3d conflict

* correct GGML_OP_COUNT assertion

* avoid build failure

* avoid build failure on MacOS

* cuda: remove unnecessary MIN define

* fix cpu im2col_3d

* adjust the code style

* cuda: use simpler loop in get_rows

* add test_im2col_3d to test-backend-ops

* test-backend-ops.cpp: remove trailing whitespace

* cpu: im2col_3d support non continuous src

Co-authored-by: Jeff Bolz <[email protected]>

* fix test_im2col_3d

* remove unused variables

* cuda: get_rows: dfloat2 -> float2

* add test_pad_ext to test-backend-ops.cpp

* add gguf_init_from_file_ext impl

* Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"

This reverts commit d8377a0.

* Revert "add gguf_init_from_file_ext impl"

This reverts commit d9f1d13.

* update ggml_backend_vk_device_supports_op

* fix ggml_backend_vk_device_supports_op

* update other backend supports op for ggml_pad_ext

* metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op

---------

Co-authored-by: Jeff Bolz <[email protected]>
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants