vulkan : support ggml_mean #15393

Acly · 2025-08-18T12:28:01Z

Adds support for GGML_OP_MEAN in Vulkan backend.

It reuses the sum_rows kernel, which also affects sum. There's an additional multiply with push constant now after the reduction. From what I can see it doesn't noticeably affect performance of those operations, let me know if there's something else I should check.

master

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 4070
  Device memory: 12012 MB (12012 MB free)

  SUM_ROWS(type=f32,ne=[8192,1,1,1]):                  32764 runs -    33.16 us/run -       32 kB/run -    0.92 GB/s
  SUM_ROWS(type=f32,ne=[8192,8192,1,1]):                1792 runs -   578.24 us/run -   262176 kB/run -  435.77 GB/s
  SUM_ROWS(type=f32,ne=[128,8192,1,1]):                32516 runs -    32.68 us/run -     4128 kB/run -  120.49 GB/s
  
  SUM(type=f32,ne=[8192,1,1,1]):                       32764 runs -    33.52 us/run -       32 kB/run -    0.91 GB/s
  SUM(type=f32,ne=[8192,8192,1,1]):                      128 runs - 130136.73 us/run -   262144 kB/run -    1.94 GB/s
  SUM(type=f32,ne=[128,8192,1,1]):                      8191 runs -   946.71 us/run -     4096 kB/run -    4.13 GB/s
  
  MEAN(type=f32,ne=[256,256,3,1]): not supported
  MEAN(type=f32,ne=[8192,1,1,1]): not supported
  MEAN(type=f32,ne=[8192,8192,1,1]): not supported
  MEAN(type=f32,ne=[128,8192,1,1]): not supported

PR

  SUM_ROWS(type=f32,ne=[8192,1,1,1]):                  32764 runs -    32.84 us/run -       32 kB/run -    0.93 GB/s
  SUM_ROWS(type=f32,ne=[8192,8192,1,1]):                1792 runs -   578.37 us/run -   262176 kB/run -  435.68 GB/s
  SUM_ROWS(type=f32,ne=[128,8192,1,1]):                32516 runs -    32.10 us/run -     4128 kB/run -  122.67 GB/s

  SUM(type=f32,ne=[8192,1,1,1]):                       32764 runs -    34.46 us/run -       32 kB/run -    0.89 GB/s
  SUM(type=f32,ne=[8192,8192,1,1]):                      128 runs - 130786.12 us/run -   262144 kB/run -    1.93 GB/s
  SUM(type=f32,ne=[128,8192,1,1]):                      8191 runs -   947.65 us/run -     4096 kB/run -    4.12 GB/s

  MEAN(type=f32,ne=[256,256,3,1]):                     32764 runs -    39.35 us/run -      771 kB/run -   18.69 GB/s
  MEAN(type=f32,ne=[8192,1,1,1]):                      32764 runs -    34.49 us/run -       32 kB/run -    0.89 GB/s
  MEAN(type=f32,ne=[8192,8192,1,1]):                    1792 runs -   577.63 us/run -   262176 kB/run -  436.24 GB/s
  MEAN(type=f32,ne=[128,8192,1,1]):                    32516 runs -    33.73 us/run -     4128 kB/run -  116.73 GB/s

jeffbolznv · 2025-08-18T13:56:11Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

        case GGML_OP_SOFT_MAX_BACK:
        case GGML_OP_SUM:
        case GGML_OP_SUM_ROWS:
+        case GGML_OP_MEAN:


It's a pre-existing bug, but it looks like the sum sum_rows shader assumes the source is contiguous. Would be nice to update the check here, or update the shader to handle it (which would be more involved).

Acly · 2025-08-19T11:45:41Z

I added support for views and non-contiguous source. It does affect performance slightly for the test with small workload.

While testing this I also stumbled upon a bug (I think) where the sub-buffer size doesn't account for misalign offsets. The buffer range passed to the shader ends up being too small and few elements at the end are cut off. See the last commit for the fix.

I'd also like to push a backend test that uses slice/permute, but at least cuda and sycl backends (and maybe others) would fail this. They have asserts for contiguous source.

Updated numbers:

  SUM_ROWS(type=f32,ne=[8192,1,1,1]):                  32764 runs -    33.33 us/run -       32 kB/run -    0.92 GB/s
  SUM_ROWS(type=f32,ne=[8192,8192,1,1]):                1792 runs -   578.21 us/run -   262176 kB/run -  435.80 GB/s
  SUM_ROWS(type=f32,ne=[128,8192,1,1]):                32516 runs -    32.79 us/run -     4128 kB/run -  120.07 GB/s
  
  SUM(type=f32,ne=[8192,1,1,1]):                       32764 runs -    33.47 us/run -       32 kB/run -    0.91 GB/s
  SUM(type=f32,ne=[8192,8192,1,1]):                      128 runs - 130729.18 us/run -   262144 kB/run -    1.93 GB/s
  SUM(type=f32,ne=[128,8192,1,1]):                      8191 runs -   948.05 us/run -     4096 kB/run -    4.12 GB/s
  
  MEAN(type=f32,ne=[256,256,3,1]):                     32764 runs -    34.43 us/run -      771 kB/run -   21.36 GB/s
  MEAN(type=f32,ne=[8192,1,1,1]):                      32764 runs -    34.04 us/run -       32 kB/run -    0.90 GB/s
  MEAN(type=f32,ne=[8192,8192,1,1]):                    1792 runs -   577.48 us/run -   262176 kB/run -  436.35 GB/s
  MEAN(type=f32,ne=[128,8192,1,1]):                    32516 runs -    32.73 us/run -     4128 kB/run -  120.28 GB/s

jeffbolznv · 2025-08-19T11:52:54Z

Thanks, this is a nice improvement. I think you're right about the misalignment bug.

If you update the supports_op callback for other backends to check ggml_is_contiguous(src0), it will make them skip the new tests as unsupported.

I think your updated shader still requires ggml_is_contiguous_rows(src0) in supports_op.

Acly · 2025-08-19T12:32:27Z

I think your updated shader still requires ggml_is_contiguous_rows(src0) in supports_op.

Hm, it does respect src0->nb[0], but I admit I didn't test it and it would be quite some effort to do so, since CPU doesn't support this case either. So maybe just better to assume contiguous rows and not try to handle it for now.

jeffbolznv · 2025-08-19T13:58:18Z

I think you're right and I just misread the code.

* cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support

jeffbolznv · 2025-08-19T15:21:18Z

ggml/src/ggml-sycl/ggml-sycl.cpp

-        case GGML_OP_POOL_2D:
        case GGML_OP_SUM:
        case GGML_OP_SUM_ROWS:
        case GGML_OP_ARGSORT:


Was it intentional to include argsort? I haven't looked at the code.

it does GGML_ASSERT(ggml_is_contiguous(dst->src[0])) like the others, so I included it since it was in the same place

jeffbolznv · 2025-08-19T16:11:33Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

 static void ggml_vk_sum(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src0, ggml_tensor * dst, bool dryrun = false) {
-    ggml_vk_op_f32<vk_op_push_constants>(ctx, subctx, src0, nullptr, nullptr, dst, GGML_OP_SUM, { (uint32_t)ggml_nelements(src0), 0, 0.0f, 0.0f }, dryrun);
+    vk_op_sum_rows_push_constants p = vk_op_sum_rows_push_constants_init(src0, dst, ggml_nelements(src0));
+    p.nb00 = 1; // treat src0 as flattened 1D tensor


Is this necessary? Wouldn't it already be 1 for contiguous rows?

I wrote it with expectation to make it work with non-contiguous rows. But since I can't easily test it and don't have a use case for it either, I will just add a contiguous rows requirement, and remove p.nb00. Better than code that pretends it works without having tested it.

jeffbolznv · 2025-08-19T16:12:09Z

ggml/src/ggml-vulkan/vulkan-shaders/sum_rows.comp

+uint get_doffset() { return p.misalign_offsets & 0xFFFF; }
+
+// see init_fastdiv_values in ggml-vulkan.cpp
+uint fastdiv(uint n, uint mp, uint L) {


I'd like to unify the multiple copies of these functions, but I can do it in a later change.

yes it would be good to share this stuff... I wanted to improve it on host side too (eg to make upscale fit better), but I think a separate PR is better at this point

…shader

0cc4m

LGTM, works on my devices.

* vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader

vulkan : support ggml_mean

0012b5c

Acly requested a review from 0cc4m as a code owner August 18, 2025 12:28

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 18, 2025

jeffbolznv reviewed Aug 18, 2025

View reviewed changes

Acly added 2 commits August 19, 2025 12:46

vulkan : support sum, sum_rows and mean with non-contiguous tensors

b6c4a11

vulkan : fix subbuffer size not accounting for misalign offset

43ab427

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Aug 19, 2025

tests : add backend-op tests for non-contiguous sum_rows

8ec8ea4

* cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support

Acly force-pushed the vulkan-mean branch from 421a7a4 to 8ec8ea4 Compare August 19, 2025 15:18

jeffbolznv reviewed Aug 19, 2025

View reviewed changes

require ggml_contiguous_rows in supports_op and expect nb00=1 in the …

96308e1

…shader

0cc4m approved these changes Aug 21, 2025

View reviewed changes

0cc4m merged commit 0a9b43e into ggml-org:master Aug 23, 2025
47 checks passed

vulkan : support ggml_mean #15393

vulkan : support ggml_mean #15393

Uh oh!

Conversation

Acly commented Aug 18, 2025

Uh oh!

jeffbolznv Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Acly commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Aug 19, 2025

Uh oh!

Acly commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Aug 19, 2025

Uh oh!

jeffbolznv Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Acly Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Acly Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Acly Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Acly commented Aug 19, 2025 •

edited

Loading

Acly commented Aug 19, 2025 •

edited

Loading