vulkan: optimize rms_norm, and allow the work to spread across multiple SMs #15281

jeffbolznv · 2025-08-13T04:23:36Z

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply.

The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums.

Perf results below. As expected, bigger gains on a bigger GPU, because the serial cost of rms_norm is more pronounced.

5090 before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        194.72 ± 0.92 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        168.21 ± 6.30 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        110.16 ± 1.59 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        555.06 ± 4.54 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        504.33 ± 4.61 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        293.09 ± 2.12 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        178.77 ± 1.90 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        200.01 ± 3.37 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        231.56 ± 5.79 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        211.15 ± 5.57 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        290.32 ± 2.51 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        223.46 ± 1.36 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        268.62 ± 1.38 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         78.73 ± 1.33 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         41.45 ± 0.09 |

5090 after:

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        206.97 ± 0.41 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        177.54 ± 1.72 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        116.34 ± 1.40 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        576.28 ± 5.21 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        521.25 ± 3.71 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        309.44 ± 2.29 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        184.65 ± 1.86 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        209.20 ± 2.47 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        236.89 ± 3.15 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        217.14 ± 6.34 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        308.29 ± 2.10 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        236.79 ± 1.78 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        280.35 ± 1.97 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         83.27 ± 0.36 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         43.49 ± 0.18 |

4070 before:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         87.14 ± 0.14 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         68.34 ± 0.98 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         47.22 ± 0.08 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        402.17 ± 1.33 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        371.73 ± 1.93 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        155.94 ± 0.30 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        132.85 ± 3.87 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        102.39 ± 0.78 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        158.93 ± 7.72 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        118.34 ± 5.01 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        149.11 ± 0.94 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        101.14 ± 0.31 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        117.59 ± 0.10 |

4070 after:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         88.72 ± 0.13 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         69.35 ± 1.15 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         47.59 ± 0.48 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        403.74 ± 3.95 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        376.70 ± 2.96 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        158.07 ± 0.29 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        135.86 ± 1.19 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        103.56 ± 1.38 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        163.34 ± 0.43 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        120.45 ± 3.62 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        152.27 ± 0.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        102.24 ± 0.30 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        118.71 ± 0.26 |

jeffbolznv · 2025-08-13T04:24:22Z

Set to draft because there will be an interaction with #15252 when it's merged.

jeffbolznv · 2025-08-13T04:25:51Z

ggml/src/ggml-vulkan/vulkan-shaders/add.comp

Just want to point out that this potentially introduces a bit of nondeterminism due to floating point addition not being associative. I don't expect it to be a problem, just want to mention in case anybody is concerned.

Hm, it's not a good idea to introduce nondeterminism in the computations. Are there alternatives?

Second commit changes this to write out a partial sum for each workgroup, and the rms_norm shader adds them up, so it's a deterministic order now.

jeffbolznv · 2025-08-17T04:59:29Z

I've rebased this on top of the multi_add change that has been merged, and now the multi_add can also accumulate the partial sums for the rms_norm. I increased the max number of descriptors (from 8 to 12) to handle the full sequence of adds I see in the models.

0cc4m · 2025-08-17T09:06:39Z

This does not pass validation for me. On Nvidia it runs through and gets correct results anyways, but on AMD and Intel it crashes.

AMD:

test-backend-ops: ../src/amd/vulkan/radv_descriptors.h:79: radv_write_buffer_descriptor_impl: Assertion `buffer->vk.size > 0 && range > 0' failed.

Intel:

test-backend-ops: ../src/intel/isl/isl_surface_state.c:986: isl_gfx125_buffer_fill_state_s: Assertion `num_elements > 0' failed.

Validation issues

pci id for fd 8: 10de:2204, driver (null)
kmsro: driver missing
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
Testing 2 devices

VUID-VkDeviceCreateInfo-pNext-pNext(ERROR / SPEC): msgNum: -1876993556 - Validation Error: [ VUID-VkDeviceCreateInfo-pNext-pNext ] | MessageID = 0x901f59ec | vkCreateDevice(): pCreateInfo->pNext chain includes a structure with unknown VkStructureType (1000141000). This error is based on the Valid Usage documentation for version 304 of the Vulkan header.  It is possible that you are using a struct from a private extension or an extension that was added to a later version of the Vulkan header, in which case the use of pCreateInfo->pNext is undefined and may not work correctly with validation enabled.
The Vulkan spec states: Each pNext member of any structure (including this one) in the pNext chain must be either NULL or a pointer to a valid struct for extending VkDeviceCreateInfo (https://docs.vulkan.org/spec/latest/chapters/devsandqueues.html#VUID-VkDeviceCreateInfo-pNext-pNext)
Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24576 MB (24576 MB free)

VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x9fde6b0000000014, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x9fde6b0000000014, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xdd3a8a0000000015, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xdd3a8a0000000015, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xd897d90000000016, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xd897d90000000016, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x84c0580000000017, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x84c0580000000017, type: 23, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x9fde6b0000000014, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x9fde6b0000000014, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xdd3a8a0000000015, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xdd3a8a0000000015, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xd897d90000000016, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xd897d90000000016, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x84c0580000000017, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x84c0580000000017, type: 23, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=1,multi_add=0): OK
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0x9fde6b0000000014, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0x9fde6b0000000014, type: 23, name: NULL
VUID-VkDescriptorBufferInfo-buffer-02998(ERROR / SPEC): msgNum: -1731333669 - Validation Error: [ VUID-VkDescriptorBufferInfo-buffer-02998 ] Object 0: handle = 0xdd3a8a0000000015, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; | MessageID = 0x98cdf1db | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].buffer is VK_NULL_HANDLE.
The Vulkan spec states: If the nullDescriptor feature is not enabled, buffer must not be VK_NULL_HANDLE (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-buffer-02998)
    Objects: 1
        [0] 0xdd3a8a0000000015, type: 23, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=1,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x2723ba0000000037, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (16) is greater than or equal to buffer size (16).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x2723ba0000000037, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x7323f50000000048, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (64) is greater than or equal to buffer size (64).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x7323f50000000048, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x7323f50000000048, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (64) is greater than or equal to buffer size (64).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x7323f50000000048, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x612f93000000004e, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (144) is greater than or equal to buffer size (144).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x612f93000000004e, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
VUID-VkDescriptorBufferInfo-offset-00340(ERROR / SPEC): msgNum: -1036144667 - Validation Error: [ VUID-VkDescriptorBufferInfo-offset-00340 ] Object 0: handle = 0x612f93000000004e, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xc23dafe5 | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[3].offset (144) is greater than or equal to buffer size (144).
The Vulkan spec states: offset must be less than the size of buffer (https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#VUID-VkDescriptorBufferInfo-offset-00340)
    Objects: 1
        [0] 0x612f93000000004e, type: 9, name: NULL
  RMS_NORM_MUL_ADD(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
  10854/10854 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

jeffbolznv · 2025-08-17T15:15:38Z

Validation errors should be fixed now.

0cc4m · 2025-08-17T16:32:49Z

Same thing on Intel as previously with multi_add:

Backend 2/4: Vulkan1
  Device description: Intel(R) Arc(tm) A770 Graphics (DG2)
  Device memory: 16032 MB (16032 MB free)

  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=1,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0,multi_add=0): OK
  RMS_NORM_MUL_ADD(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0,multi_add=1): OK
[ADD] NMSE = 0.078048014 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0,multi_add=0): FAIL
[ADD] NMSE = 0.066694427 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0,multi_add=1): FAIL
[ADD] NMSE = 0.056316214 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0,multi_add=0): FAIL
[ADD] NMSE = 0.051824146 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0,multi_add=1): FAIL
[ADD] NMSE = 0.061151788 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0,multi_add=0): FAIL
[ADD] NMSE = 0.066473594 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0,multi_add=1): FAIL
[ADD] NMSE = 0.198699834 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0,multi_add=0): FAIL
[ADD] NMSE = 0.204058603 > 0.000000100   RMS_NORM_MUL_ADD(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0,multi_add=1): FAIL
  10846/10854 tests passed
  Backend Vulkan1: FAIL

AMD passes now, and no more validation problems.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

jeffbolznv · 2025-08-17T16:57:08Z

Same thing on Intel as previously with multi_add:

Ugh, OK, disabled for Intel.

How is perf on AMD?

characharm · 2025-08-17T19:06:35Z

master + pr:

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	155.35 ± 0.50

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	157.60 ± 3.01

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	156.54 ± 2.45

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	158.56 ± 1.69

master:

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	RPC,Vulkan	99	1	tg128	152.35 ± 3.87

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	RPC,Vulkan	99	1	tg128	154.26 ± 0.43

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	RPC,Vulkan	99	1	tg128	154.24 ± 0.63

0cc4m · 2025-08-23T07:05:09Z

Same thing on Intel as previously with multi_add:

Ugh, OK, disabled for Intel.

How is perf on AMD?

Sorry for the delay, here are results:

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	76.63 ± 0.28	78.36 ± 0.49	+2.3%
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	77.81 ± 0.13	79.37 ± 0.11	+2.0%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	70.04 ± 0.38	71.59 ± 0.11	+2.2%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	66.97 ± 0.07	68.21 ± 0.08	+1.9%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	61.23 ± 0.16	62.39 ± 0.09	+1.9%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	58.74 ± 0.08	59.77 ± 0.11	+1.8%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	58.93 ± 0.08	59.81 ± 0.08	+1.5%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	56.70 ± 0.04	56.98 ± 0.30	+0.5%

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	141.20 ± 0.16	138.68 ± 10.21	-1.8%
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	142.26 ± 0.37	145.01 ± 0.40	+1.9%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	128.34 ± 0.51	130.36 ± 0.34	+1.6%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	130.66 ± 0.26	132.57 ± 0.12	+1.5%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	120.26 ± 0.24	124.15 ± 0.79	+3.2%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	122.76 ± 0.18	125.82 ± 0.93	+2.5%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	88.45 ± 0.13	89.56 ± 0.07	+1.3%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	89.81 ± 0.11	90.90 ± 0.11	+1.2%

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	43.29 ± 0.79	42.08 ± 2.30	-2.8%
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	44.20 ± 0.05	43.57 ± 0.03	-1.4%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	37.58 ± 0.57	37.33 ± 0.63	-0.7%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	28.11 ± 0.02	27.98 ± 0.14	-0.5%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	33.11 ± 0.02	32.64 ± 0.48	-1.4%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	25.59 ± 0.01	25.44 ± 0.01	-0.6%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	10.85 ± 0.03	10.82 ± 0.03	-0.3%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	9.90 ± 0.03	9.90 ± 0.02	+0.0%

Not sure what is going on with Intel, but the difference is too small to hold up the PR. If you have an idea, let me know. Otherwise you can merge.

…le SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums.

rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up.

…e partial sums

…le SMs (ggml-org#15281) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against ggml-org#15489, sync after clearing partial sums

CISC · 2025-08-26T13:13:27Z

@qnixsynapse @s-Nick @Rbiessy Looks like the new test added here breaks on SYCL:
https://github.com/ggml-org/ci/blob/results/llama.cpp/61/1f419cff11e4952228162a1c44cb35dff2274a/ggml-6-x86-sycl/stdall#L4230

jeffbolznv requested a review from 0cc4m as a code owner August 13, 2025 04:23

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2025

jeffbolznv marked this pull request as draft August 13, 2025 04:23

jeffbolznv commented Aug 13, 2025

View reviewed changes

jeffbolznv force-pushed the rms_norm_atomic_add branch 2 times, most recently from 075dac2 to c523636 Compare August 17, 2025 04:56

jeffbolznv marked this pull request as ready for review August 17, 2025 04:56

0cc4m reviewed Aug 17, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

jeffbolznv force-pushed the rms_norm_atomic_add branch from 7658305 to cd20ef0 Compare August 21, 2025 17:37

jeffbolznv requested a review from 0cc4m August 21, 2025 17:37

0cc4m approved these changes Aug 23, 2025

View reviewed changes

jeffbolznv added 6 commits August 23, 2025 09:16

Change add+rms_norm optimization to write out an array of partial sums

5643b4a

rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up.

complete rebase against fused adds - multi_add shader can also comput…

7856a7a

…e partial sums

fix validation errors

a675d0c

disable add_rms_fusion for Intel due to possible driver bug

8d382bc

resolve against ggml-org#15489, sync after clearing partial sums

e97e226

jeffbolznv force-pushed the rms_norm_atomic_add branch from cd20ef0 to e97e226 Compare August 23, 2025 14:47

jeffbolznv merged commit 611f419 into ggml-org:master Aug 23, 2025
48 checks passed

vulkan: optimize rms_norm, and allow the work to spread across multiple SMs #15281

vulkan: optimize rms_norm, and allow the work to spread across multiple SMs #15281

Uh oh!

Conversation

jeffbolznv commented Aug 13, 2025

Uh oh!

jeffbolznv commented Aug 13, 2025

Uh oh!

jeffbolznv Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Aug 17, 2025

Uh oh!

0cc4m commented Aug 17, 2025

Uh oh!

jeffbolznv commented Aug 17, 2025

Uh oh!

0cc4m commented Aug 17, 2025

Uh oh!

Uh oh!

jeffbolznv commented Aug 17, 2025

Uh oh!

characharm commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Aug 23, 2025

Uh oh!

Uh oh!

CISC commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

characharm commented Aug 17, 2025 •

edited

Loading