Skip to content

Conversation

@rillomas
Copy link
Contributor

@rillomas rillomas commented Oct 27, 2025

Overview

We see crashing at ggml_vk_guess_matmul_id_pipeline_align when a vk_matmul_pipeline has all empty pipelines.
crash_point

This happens because the following logic in ggml_vk_get_mul_mat_mat_id_pipeline is checking for a nullptr which seems to never happen. The reference is there (thus thinks that fp16acc is supported) but all members are actually empty which leads to a crash later on in ggml_vk_guess_matmul_id_pipeline_align

    bool support_fp16acc = ctx->device->pipeline_dequant_mul_mat_mat_id[src0_type].f16acc != nullptr;
    bool support_fp32acc = ctx->device->pipeline_dequant_mul_mat_mat_id[src0_type].f32acc != nullptr;
mimatch

We originally found this issue when running ggml-org/gpt-oss-20b-GGUF for src_type0 as GGML_TYPE_MXFP4 though should happen on potentially any src_type0.

Reproducing steps

Windows

Crash currently occurs on following unit test with Windows LunarLake driver 32.0.101.5730.

  • build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)
λ build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-vulkan.dll
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Arc(TM) 140V GPU (16GB))
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-alderlake.dll score: 108
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-haswell.dll score: 44
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-icelake.dll score: 0
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-sandybridge.dll score: 17
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-skylakex.dll score: 0
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-sse42.dll score: 5
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-x64.dll score: 1
load_backend: loaded CPU backend from C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-alderlake.dll
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) Ultra 7 268V 2.20GHz)
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) 140V GPU (16GB)
  Device memory: 16191 MB (17677 MB free)

(crashes here)

Ubuntu

Crash reproduces on Ubuntu with mesa 25.0.7-0ubuntu0.24.04.2 drivers on LunarLake.

masato@masato-LNL ~/repo/llama.cpp_mine (master)$ dpkg --list | grep mesa
ii  libegl-mesa0:amd64                         25.0.7-0ubuntu0.24.04.2                  amd64        free implementation of the EGL API -- Mesa vendor library
ii  libgl1-mesa-dri:amd64                      25.0.7-0ubuntu0.24.04.2                  amd64        free implementation of the OpenGL API -- DRI modules
ii  libglu1-mesa:amd64                         9.0.2-1.1build1                          amd64        Mesa OpenGL utility library (GLU)
ii  libglx-mesa0:amd64                         25.0.7-0ubuntu0.24.04.2                  amd64        free implementation of the OpenGL API -- GLX vendor library
ii  mesa-libgallium:amd64                      25.0.7-0ubuntu0.24.04.2                  amd64        shared infrastructure for Mesa drivers
ii  mesa-vulkan-drivers:amd64                  25.0.7-0ubuntu0.24.04.2                  amd64        Mesa Vulkan graphics drivers
masato@masato-LNL ~/repo/llama.cpp_mine (master)$ ./build/bin/test-backend-ops -o MUL_MAT_ID
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (LNL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 131072 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Graphics (LNL)
  Device memory: 15857 MB (14271 MB free)

  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=16,n_used=16,b=0,m=32,n=1024,k=16,o=1): OK
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=2,n_used=2,b=0,m=32,n=8192,k=64,o=1): OK
...
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=17,k=256,o=1): OK
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256,o=1): OK
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=129,k=256,o=1): OK
  MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256,o=1): OK
Segmentation fault (core dumped)

Test results using 00faab1

Windows

GPU driver 32.0.101.5730

All MUL_MAT_ID related backend-ops tests do not crash any more. Several accuracy related tests are failing though should be fixed in more recent drivers. Currently seeing hanging at MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) which gets stuck at ggml_vk_wait_for_fence. Possibly due to some other issue in old driver?

λ build\bin\Release\test-backend-ops.exe -o MUL_MAT_ID
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) 140V GPU (16GB)
  Device memory: 16191 MB (17677 MB free)

  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=16,n_used=16,b=0,m=32,n=1024,k=16,o=1): OK
...
  MUL_MAT_ID(type_a=bf16,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256,o=1): OK
  MUL_MAT_ID(type_a=bf16,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256,o=1): OK
  527/618 tests passed

Failing tests:
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=16,n_used=16,b=0,m=50,n=200,k=64,o=1)
  MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=16,n_used=16,b=1,m=50,n=200,k=64,o=1)
...
  Backend Vulkan0: FAIL
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
FAIL

GPU driver 32.0.101.8247

All backend-ops passing.

λ build\bin\Release\test-backend-ops.exe
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) 140V GPU (16GB)
  Device memory: 18457 MB (17689 MB free)

  ABS(type=f16,ne_a=[128,2,2,2],v=0): not supported [Vulkan0]
  ABS(type=f16,ne_a=[5,7,11,13],v=0): not supported [Vulkan0]
...
  TOPK_MOE(ne=[8,22,1,1],n_expert_used=4,with_norm=0,delayed_softmax=1): OK
  TOPK_MOE(ne=[32,22,1,1],n_expert_used=8,with_norm=0,delayed_softmax=1): OK
  13747/13747 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Ubuntu

mesa 25.0.7-0ubuntu0.24.04.2

Crashing fixed for MUL_MAT_ID on Ubuntu as well. All backend-ops passing.

masato@masato-LNL ~/repo/llama.cpp_mine (fix-accf16-capability-crash)$ ./build/bin/test-backend-ops
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (LNL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 131072 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Graphics (LNL)
  Device memory: 15857 MB (14271 MB free)

  ABS(type=f16,ne_a=[128,2,2,2],v=0): not supported [Vulkan0]
  ABS(type=f16,ne_a=[5,7,11,13],v=0): not supported [Vulkan0]
...
  TOPK_MOE(ne=[8,22,1,1],n_expert_used=4,with_norm=0,delayed_softmax=1): OK
  TOPK_MOE(ne=[32,22,1,1],n_expert_used=8,with_norm=0,delayed_softmax=1): OK
  13747/13747 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 27, 2025
@jeffbolznv
Copy link
Collaborator

This change doesn't look right for when using a path other than coopmat1. What is the actual crash, is one of the pipelines null? What are src0_type/src1_type?

@rillomas
Copy link
Contributor Author

Thank you for your comment @jeffbolznv. I've updated the first comment with more details. I'm still figuring out the best way to fix this so if you have any suggestions that will be helpful! I'm also trying to figure out how I can make a unit test to reproduce this on other environments since it seems like it should occur anywhere.

@jeffbolznv
Copy link
Collaborator

I see, it's a vk_matmul_pipeline so it's never null. I suggest trying to make the logic similar to ggml_vk_get_mul_mat_mat_pipeline.

@rillomas rillomas force-pushed the fix-accf16-capability-crash branch from ea4d4cc to 5abfd16 Compare October 28, 2025 08:39
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this can be merged once you address the last comment.

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 31, 2025

Thank you!

@0cc4m 0cc4m merged commit 2976b03 into ggml-org:master Oct 31, 2025
72 checks passed
@rillomas rillomas deleted the fix-accf16-capability-crash branch October 31, 2025 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants