vulkan: Fix crash when FP16 mul_mat accumulation is not supported #16796

rillomas · 2025-10-27T08:21:22Z

Overview

We see crashing at ggml_vk_guess_matmul_id_pipeline_align when a vk_matmul_pipeline has all empty pipelines.

This happens because the following logic in ggml_vk_get_mul_mat_mat_id_pipeline is checking for a nullptr which seems to never happen. The reference is there (thus thinks that fp16acc is supported) but all members are actually empty which leads to a crash later on in ggml_vk_guess_matmul_id_pipeline_align

    bool support_fp16acc = ctx->device->pipeline_dequant_mul_mat_mat_id[src0_type].f16acc != nullptr;
    bool support_fp32acc = ctx->device->pipeline_dequant_mul_mat_mat_id[src0_type].f32acc != nullptr;

We originally found this issue when running ggml-org/gpt-oss-20b-GGUF for src_type0 as GGML_TYPE_MXFP4 though should happen on potentially any src_type0.

Reproducing steps

Windows

Crash currently occurs on following unit test with Windows LunarLake driver 32.0.101.5730.

build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)

λ build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-vulkan.dll
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Arc(TM) 140V GPU (16GB))
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-alderlake.dll score: 108
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-haswell.dll score: 44
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-icelake.dll score: 0
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-sandybridge.dll score: 17
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-skylakex.dll score: 0
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-sse42.dll score: 5
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-x64.dll score: 1
load_backend: loaded CPU backend from C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-alderlake.dll
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) Ultra 7 268V 2.20GHz)
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) 140V GPU (16GB)
  Device memory: 16191 MB (17677 MB free)

(crashes here)

Ubuntu

Crash reproduces on Ubuntu with mesa 25.0.7-0ubuntu0.24.04.2 drivers on LunarLake.

masato@masato-LNL ~/repo/llama.cpp_mine (master)$ dpkg --list | grep mesa
ii  libegl-mesa0:amd64                         25.0.7-0ubuntu0.24.04.2                  amd64        free implementation of the EGL API -- Mesa vendor library
ii  libgl1-mesa-dri:amd64                      25.0.7-0ubuntu0.24.04.2                  amd64        free implementation of the OpenGL API -- DRI modules
ii  libglu1-mesa:amd64                         9.0.2-1.1build1                          amd64        Mesa OpenGL utility library (GLU)
ii  libglx-mesa0:amd64                         25.0.7-0ubuntu0.24.04.2                  amd64        free implementation of the OpenGL API -- GLX vendor library
ii  mesa-libgallium:amd64                      25.0.7-0ubuntu0.24.04.2                  amd64        shared infrastructure for Mesa drivers
ii  mesa-vulkan-drivers:amd64                  25.0.7-0ubuntu0.24.04.2                  amd64        Mesa Vulkan graphics drivers
masato@masato-LNL ~/repo/llama.cpp_mine (master)$ ./build/bin/test-backend-ops -o MUL_MAT_ID
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (LNL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 131072 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Graphics (LNL)
  Device memory: 15857 MB (14271 MB free)

  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=16,n_used=16,b=0,m=32,n=1024,k=16,o=1): OK
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=2,n_used=2,b=0,m=32,n=8192,k=64,o=1): OK
...
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=17,k=256,o=1): OK
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256,o=1): OK
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=129,k=256,o=1): OK
  MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256,o=1): OK
Segmentation fault (core dumped)

Test results using `00faab1`

Windows

GPU driver 32.0.101.5730

All MUL_MAT_ID related backend-ops tests do not crash any more. Several accuracy related tests are failing though should be fixed in more recent drivers. Currently seeing hanging at MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) which gets stuck at ggml_vk_wait_for_fence. Possibly due to some other issue in old driver?

λ build\bin\Release\test-backend-ops.exe -o MUL_MAT_ID
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) 140V GPU (16GB)
  Device memory: 16191 MB (17677 MB free)

  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=16,n_used=16,b=0,m=32,n=1024,k=16,o=1): OK
...
  MUL_MAT_ID(type_a=bf16,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256,o=1): OK
  MUL_MAT_ID(type_a=bf16,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256,o=1): OK
  527/618 tests passed

Failing tests:
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=16,n_used=16,b=0,m=50,n=200,k=64,o=1)
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=16,n_used=16,b=1,m=50,n=200,k=64,o=1)
...
  Backend Vulkan0: FAIL
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
FAIL

GPU driver 32.0.101.8247

All backend-ops passing.

λ build\bin\Release\test-backend-ops.exe
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) 140V GPU (16GB)
  Device memory: 18457 MB (17689 MB free)

  ABS(type=f16,ne_a=[128,2,2,2],v=0): not supported [Vulkan0]
  ABS(type=f16,ne_a=[5,7,11,13],v=0): not supported [Vulkan0]
...
  TOPK_MOE(ne=[8,22,1,1],n_expert_used=4,with_norm=0,delayed_softmax=1): OK
  TOPK_MOE(ne=[32,22,1,1],n_expert_used=8,with_norm=0,delayed_softmax=1): OK
  13747/13747 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Ubuntu

mesa 25.0.7-0ubuntu0.24.04.2

Crashing fixed for MUL_MAT_ID on Ubuntu as well. All backend-ops passing.

masato@masato-LNL ~/repo/llama.cpp_mine (fix-accf16-capability-crash)$ ./build/bin/test-backend-ops
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (LNL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 131072 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Graphics (LNL)
  Device memory: 15857 MB (14271 MB free)

  ABS(type=f16,ne_a=[128,2,2,2],v=0): not supported [Vulkan0]
  ABS(type=f16,ne_a=[5,7,11,13],v=0): not supported [Vulkan0]
...
  TOPK_MOE(ne=[8,22,1,1],n_expert_used=4,with_norm=0,delayed_softmax=1): OK
  TOPK_MOE(ne=[32,22,1,1],n_expert_used=8,with_norm=0,delayed_softmax=1): OK
  13747/13747 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

jeffbolznv · 2025-10-27T13:06:52Z

This change doesn't look right for when using a path other than coopmat1. What is the actual crash, is one of the pipelines null? What are src0_type/src1_type?

rillomas · 2025-10-28T01:10:29Z

Thank you for your comment @jeffbolznv. I've updated the first comment with more details. I'm still figuring out the best way to fix this so if you have any suggestions that will be helpful! I'm also trying to figure out how I can make a unit test to reproduce this on other environments since it seems like it should occur anywhere.

jeffbolznv · 2025-10-28T03:49:08Z

I see, it's a vk_matmul_pipeline so it's never null. I suggest trying to make the logic similar to ggml_vk_get_mul_mat_mat_pipeline.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

…ty-crash

0cc4m

LGTM, this can be merged once you address the last comment.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

…ty-crash

0cc4m · 2025-10-31T07:15:53Z

Thank you!

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 27, 2025

Experimenting crash fix

5abfd16

rillomas force-pushed the fix-accf16-capability-crash branch from ea4d4cc to 5abfd16 Compare October 28, 2025 08:39

jeffbolznv reviewed Oct 28, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

rillomas and others added 2 commits October 29, 2025 10:12

added assert for aborting and fixed comment

c5474a6

Merge branch 'ggml-org:master' into fix-accf16-capability-crash

7661f3b

jeffbolznv approved these changes Oct 29, 2025

View reviewed changes

DajanaV mentioned this pull request Oct 29, 2025

UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported auroralabs-loci/llama.cpp#10

Closed

rillomas marked this pull request as ready for review October 29, 2025 05:20

rillomas requested a review from 0cc4m as a code owner October 29, 2025 05:20

0cc4m reviewed Oct 29, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Show resolved Hide resolved

rillomas added 4 commits October 30, 2025 10:26

changed to check if a pipeline is empty or not

21915e2

Moved function in class definition

f89fdb7

Merge remote-tracking branch 'origin/master' into fix-accf16-capabili…

6fbaddf

…ty-crash

replaced with is_empty

358eb97

0cc4m approved these changes Oct 30, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

rillomas added 2 commits October 31, 2025 13:50

Merge remote-tracking branch 'origin/master' into fix-accf16-capabili…

9645bad

…ty-crash

Modified is_empty to check only unaligned pipelines

00faab1

0cc4m merged commit 2976b03 into ggml-org:master Oct 31, 2025
72 checks passed

rillomas deleted the fix-accf16-capability-crash branch October 31, 2025 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Fix crash when FP16 mul_mat accumulation is not supported #16796

vulkan: Fix crash when FP16 mul_mat accumulation is not supported #16796

rillomas commented Oct 27, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Oct 27, 2025

Uh oh!

rillomas commented Oct 28, 2025

Uh oh!

jeffbolznv commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

0cc4m commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vulkan: Fix crash when FP16 mul_mat accumulation is not supported #16796

vulkan: Fix crash when FP16 mul_mat accumulation is not supported #16796

Conversation

rillomas commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Reproducing steps

Windows

Ubuntu

Test results using 00faab1

Windows

GPU driver 32.0.101.5730

GPU driver 32.0.101.8247

Ubuntu

mesa 25.0.7-0ubuntu0.24.04.2

Uh oh!

jeffbolznv commented Oct 27, 2025

Uh oh!

rillomas commented Oct 28, 2025

Uh oh!

jeffbolznv commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0cc4m commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rillomas commented Oct 27, 2025 •

edited

Loading

Test results using `00faab1`