Skip to content

Conversation

jeffbolznv
Copy link
Collaborator

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply.

The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums.

Perf results below. As expected, bigger gains on a bigger GPU, because the serial cost of rms_norm is more pronounced.

5090 before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        194.72 ± 0.92 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        168.21 ± 6.30 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        110.16 ± 1.59 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        555.06 ± 4.54 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        504.33 ± 4.61 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        293.09 ± 2.12 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        178.77 ± 1.90 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        200.01 ± 3.37 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        231.56 ± 5.79 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        211.15 ± 5.57 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        290.32 ± 2.51 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        223.46 ± 1.36 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        268.62 ± 1.38 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         78.73 ± 1.33 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         41.45 ± 0.09 |

5090 after:

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        206.97 ± 0.41 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        177.54 ± 1.72 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        116.34 ± 1.40 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        576.28 ± 5.21 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        521.25 ± 3.71 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        309.44 ± 2.29 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        184.65 ± 1.86 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        209.20 ± 2.47 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        236.89 ± 3.15 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        217.14 ± 6.34 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        308.29 ± 2.10 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        236.79 ± 1.78 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        280.35 ± 1.97 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         83.27 ± 0.36 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         43.49 ± 0.18 |

4070 before:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         87.14 ± 0.14 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         68.34 ± 0.98 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         47.22 ± 0.08 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        402.17 ± 1.33 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        371.73 ± 1.93 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        155.94 ± 0.30 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        132.85 ± 3.87 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        102.39 ± 0.78 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        158.93 ± 7.72 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        118.34 ± 5.01 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        149.11 ± 0.94 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        101.14 ± 0.31 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        117.59 ± 0.10 |

4070 after:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         88.72 ± 0.13 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         69.35 ± 1.15 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         47.59 ± 0.48 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        403.74 ± 3.95 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        376.70 ± 2.96 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        158.07 ± 0.29 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        135.86 ± 1.19 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        103.56 ± 1.38 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        163.34 ± 0.43 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        120.45 ± 3.62 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        152.27 ± 0.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        102.24 ± 0.30 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        118.71 ± 0.26 |

@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner August 13, 2025 04:23
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2025
@jeffbolznv jeffbolznv marked this pull request as draft August 13, 2025 04:23
@jeffbolznv
Copy link
Collaborator Author

Set to draft because there will be an interaction with #15252 when it's merged.

if (p.param3 != 0) {
sum_sq = subgroupAdd(sum_sq);
if (sum_sq != 0 && gl_SubgroupInvocationID == 0) {
atomicAdd(data_atom, sum_sq);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to point out that this potentially introduces a bit of nondeterminism due to floating point addition not being associative. I don't expect it to be a problem, just want to mention in case anybody is concerned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it's not a good idea to introduce nondeterminism in the computations. Are there alternatives?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second commit changes this to write out a partial sum for each workgroup, and the rms_norm shader adds them up, so it's a deterministic order now.

@jeffbolznv jeffbolznv force-pushed the rms_norm_atomic_add branch 2 times, most recently from 075dac2 to c523636 Compare August 17, 2025 04:56
@jeffbolznv jeffbolznv marked this pull request as ready for review August 17, 2025 04:56
@jeffbolznv
Copy link
Collaborator Author

I've rebased this on top of the multi_add change that has been merged, and now the multi_add can also accumulate the partial sums for the rms_norm. I increased the max number of descriptors (from 8 to 12) to handle the full sequence of adds I see in the models.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 17, 2025

This does not pass validation for me. On Nvidia it runs through and gets correct results anyways, but on AMD and Intel it crashes.

AMD:

test-backend-ops: ../src/amd/vulkan/radv_descriptors.h:79: radv_write_buffer_descriptor_impl: Assertion `buffer->vk.size > 0 && range > 0' failed.

Intel:

test-backend-ops: ../src/intel/isl/isl_surface_state.c:986: isl_gfx125_buffer_fill_state_s: Assertion `num_elements > 0' failed.
Validation issues
pci id for fd 8: 10de:2204, driver (null)
kmsro: driver missing
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
Testing 2 devices

VUID-VkDeviceCreateInfo-pNext-pNext(ERROR / SPEC): msgNum: -1876993556 - Validation Error: [ VUID-VkDeviceCreateInfo-pNext-pNext ] | MessageID = 0x901f59ec | vkCreateDevice(): pCreateInfo->pNext chain includes a structure with unknown VkStructureType (1000141000). This error is based on the Valid Usage documentation for version 304 of the Vulkan header.  It is possible that you are using a struct from a private extension or an extension that was added to a later version of the Vulkan header, in which case the use of pCreateInfo->pNext is undefined and may not work correctly with validation enabled.
The Vulkan spec states: Each pNext member of any structure (including this one) in the pNext chain must be either NULL or a pointer to a valid struct for extending VkDeviceCreateInfo (https://docs.vulkan.org/spec/latest/chapters/devsandqueues.html#VUID-VkDeviceCreateInfo-pNext-pNext)
Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24576 MB (24576 MB free)

VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x9fde6b0000000014, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x9fde6b0000000014, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xdd3a8a0000000015, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xdd3a8a0000000015, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xd897d90000000016, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xd897d90000000016, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x84c0580000000017, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x84c0580000000017, type: 23, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x9fde6b0000000014, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x9fde6b0000000014, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xdd3a8a0000000015, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xdd3a8a0000000015, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xd897d90000000016, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xd897d90000000016, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x84c0580000000017, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x84c0580000000017, type: 23, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=1,multi_add=0): OK
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x9fde6b0000000014, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x9fde6b0000000014, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xdd3a8a0000000015, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xdd3a8a0000000015, type: 23, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=1,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x7323f50000000048, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (64) is greater than or equal to buffer size (64).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x7323f50000000048, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x7323f50000000048, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (64) is greater than or equal to buffer size (64).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x7323f50000000048, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x612f93000000004e, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (144) is greater than or equal to buffer size (144).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x612f93000000004e, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x612f93000000004e, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (144) is greater than or equal to buffer size (144).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x612f93000000004e, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
  10854/10854 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

@jeffbolznv
Copy link
Collaborator Author

Validation errors should be fixed now.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 17, 2025

Same thing on Intel as previously with multi_add:

Backend 2/4: Vulkan1
  Device description: Intel(R) Arc(tm) A770 Graphics (DG2)
  Device memory: 16032 MB (16032 MB free)

  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
[ADD] NMSE = 0.078048014 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0,multi_add=0): FAIL
[ADD] NMSE = 0.066694427 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0,multi_add=1): FAIL
[ADD] NMSE = 0.056316214 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0,multi_add=0): FAIL
[ADD] NMSE = 0.051824146 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0,multi_add=1): FAIL
[ADD] NMSE = 0.061151788 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0,multi_add=0): FAIL
[ADD] NMSE = 0.066473594 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0,multi_add=1): FAIL
[ADD] NMSE = 0.198699834 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0,multi_add=0): FAIL
[ADD] NMSE = 0.204058603 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0,multi_add=1): FAIL
  10846/10854 tests passed
  Backend Vulkan1: FAIL

AMD passes now, and no more validation problems.

@jeffbolznv
Copy link
Collaborator Author

Same thing on Intel as previously with multi_add:

Ugh, OK, disabled for Intel.

How is perf on AMD?

@characharm
Copy link
Contributor

characharm commented Aug 17, 2025

master + pr:

model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 155.35 ± 0.50
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 157.60 ± 3.01
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 156.54 ± 2.45
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 158.56 ± 1.69

master:

model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B RPC,Vulkan 99 1 tg128 152.35 ± 3.87
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B RPC,Vulkan 99 1 tg128 154.26 ± 0.43
model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B RPC,Vulkan 99 1 tg128 154.24 ± 0.63

@jeffbolznv jeffbolznv force-pushed the rms_norm_atomic_add branch from 7658305 to cd20ef0 Compare August 21, 2025 17:37
@jeffbolznv jeffbolznv requested a review from 0cc4m August 21, 2025 17:37
@0cc4m
Copy link
Collaborator

0cc4m commented Aug 23, 2025

Same thing on Intel as previously with multi_add:

Ugh, OK, disabled for Intel.

How is perf on AMD?

Sorry for the delay, here are results:

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 76.63 ± 0.28 78.36 ± 0.49 +2.3%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 77.81 ± 0.13 79.37 ± 0.11 +2.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 70.04 ± 0.38 71.59 ± 0.11 +2.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 66.97 ± 0.07 68.21 ± 0.08 +1.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 61.23 ± 0.16 62.39 ± 0.09 +1.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 58.74 ± 0.08 59.77 ± 0.11 +1.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 58.93 ± 0.08 59.81 ± 0.08 +1.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 56.70 ± 0.04 56.98 ± 0.30 +0.5%

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 141.20 ± 0.16 138.68 ± 10.21 -1.8%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 142.26 ± 0.37 145.01 ± 0.40 +1.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 128.34 ± 0.51 130.36 ± 0.34 +1.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 130.66 ± 0.26 132.57 ± 0.12 +1.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 120.26 ± 0.24 124.15 ± 0.79 +3.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 122.76 ± 0.18 125.82 ± 0.93 +2.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 88.45 ± 0.13 89.56 ± 0.07 +1.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 89.81 ± 0.11 90.90 ± 0.11 +1.2%

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 43.29 ± 0.79 42.08 ± 2.30 -2.8%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 44.20 ± 0.05 43.57 ± 0.03 -1.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 37.58 ± 0.57 37.33 ± 0.63 -0.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 28.11 ± 0.02 27.98 ± 0.14 -0.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 33.11 ± 0.02 32.64 ± 0.48 -1.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 25.59 ± 0.01 25.44 ± 0.01 -0.6%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 10.85 ± 0.03 10.82 ± 0.03 -0.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 9.90 ± 0.03 9.90 ± 0.02 +0.0%

Not sure what is going on with Intel, but the difference is too small to hold up the PR. If you have an idea, let me know. Otherwise you can merge.

…le SMs

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.

The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.
@jeffbolznv jeffbolznv force-pushed the rms_norm_atomic_add branch from cd20ef0 to e97e226 Compare August 23, 2025 14:47
@jeffbolznv jeffbolznv merged commit 611f419 into ggml-org:master Aug 23, 2025
48 checks passed
FlorianZimmer pushed a commit to FlorianZimmer/llama.cpp that referenced this pull request Aug 25, 2025
…le SMs (ggml-org#15281)

* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.

The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.

* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.

* complete rebase against fused adds - multi_add shader can also compute partial sums

* fix validation errors

* disable add_rms_fusion for Intel due to possible driver bug

* resolve against ggml-org#15489, sync after clearing partial sums
qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 25, 2025
…le SMs (ggml-org#15281)

* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.

The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.

* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.

* complete rebase against fused adds - multi_add shader can also compute partial sums

* fix validation errors

* disable add_rms_fusion for Intel due to possible driver bug

* resolve against ggml-org#15489, sync after clearing partial sums
@CISC
Copy link
Collaborator

CISC commented Aug 26, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants