-
Notifications
You must be signed in to change notification settings - Fork 13.4k
ggml: add ops for WAN video model (cuda && cpu) #15669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This looks really cool. I can add the new op support to vulkan. Do you happen to have any backend test cases? If not I can probably add some. |
|
I developed it based on the ggml repository. I implement https://github.com/leejet/ggml/blob/wan/tests/test-conv3d.cpp. However, I'm not sure if it should be synchronized. |
|
It would be helpful to have test cases in test-backend-ops, particularly for whatever was broken with getrows and scale. |
|
The issue with get_rows/scale is caused by the fact that CUDA's |
| int d1); // dilation dimension 1 | ||
|
|
||
| GGML_API struct ggml_tensor * ggml_conv_3d( | ||
| GGML_API struct ggml_tensor * ggml_conv_3d_direct( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally there should be only one conv_3d operation, and the backend can decide on an implementation (direct, im2col, tiled im2col, or choose one depending on inputs & hardware).
The reason there are two versions for conv_2d is because historically the im2col version was there first, and it coded im2col into the graph. This means the backend cannot optimize it or choose a better implementation, and the huge memory requirements of full tensor im2col or baked into the graph. This was difficult to change without breaking backends, hence the "direct" workaround. But IMO we should avoid this situation for newly introduced conv ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the ultimate goal. However, currently the support for each backend is not yet complete, and conv_2d_direct cannot fully replace im2col + gemm at present. In some backends such as CUDA, conv_2d_direct is much slower than im2col + gemm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there's no solution for conv_2d right away. My point is, this PR now creates the same problem for 3d. I don't think new code should be introduced that repeats the issue and bakes im2col as the default way to do conv_3d.
My suggestion would be to either implement im2col+mul_mat behind OP_CONV_3D in the CUDA backend. Or only expose im2col_3d on the API, and move im2col+mul_mat calls into the application (simpler but not as nice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CUDA one of my long-term goals is to write convolution kernels, particularly ones that can make use of quantized data. But I should stress that I have so many other things I want to do that realistically I would start working on it in a year at the absolute earliest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please extend test-backend-ops with test cases for all of the newly added code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot: why is the GGUF header being changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the GGUF file of WAN contains a 5D tensor, for example: https://huggingface.co/QuantStack/Wan2.2-T2V-A14B-GGUF/tree/main/LowNoise?show_file_info=LowNoise%2FWan2.2-T2V-A14B-LowNoise-Q2_K.gguf. Therefore, I added the tensor_shape_read_cb_t callback, allowing users to control how the multi-dimensional(>4) tensor is mapped to a 4D format.
|
I have added |
|
The changes related to gguf have been reverted. |
|
And so it begins... :) |
|
As far as I'm concerned that's the fault of whoever wrote a 5D tensor to a GGUF file. If user code wants to encode some 5D structure that should have been done by writing a 1D tensor to the file with KV data for the 5D shape. |
|
Waiting for the inevitable "why can't I run Qwen Image on llama.cpp?!" |
|
@JohannesGaessler Do I still have anything else to do to get this PR merged? |
|
Please adapt the |
|
The Vulkan change looks fine. |
|
I also modified the supports_op function for Metal/SYCL/OpenCL to return false for the new padding mode. |
|
I made some adjustments. |
…upport * origin/master: (72 commits) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) vulkan : update ggml_vk_instance_validation_ext_available (ggml-org#15666) ggml vulkan: add hardsigmoid and hardswish operations (ggml-org#15762) CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (ggml-org#15715) model-conversion : fix pyright errors (ggml-org#15770) sampling : optimize dist sampler (ggml-org#15704) llama : fix incorrect model type for Gemma 270M (ggml-org#15764) model-conversion : remove hardcoded /bin/bash shebangs [no ci] (ggml-org#15765) CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735) ggml-cpu : optimize RVV kernels (ggml-org#15720) ...
…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...
|
Thank you! |
* add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <[email protected]> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit d8377a0. * Revert "add gguf_init_from_file_ext impl" This reverts commit d9f1d13. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <[email protected]>
This PR has adds some necessary ops for the WAN video model and fixes some issues. See: leejet/stable-diffusion.cpp#778