-
Notifications
You must be signed in to change notification settings - Fork 13.5k
vulkan: Fix crash when FP16 mul_mat accumulation is not supported #16796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This change doesn't look right for when using a path other than coopmat1. What is the actual crash, is one of the pipelines null? What are src0_type/src1_type? |
|
Thank you for your comment @jeffbolznv. I've updated the first comment with more details. I'm still figuring out the best way to fix this so if you have any suggestions that will be helpful! I'm also trying to figure out how I can make a unit test to reproduce this on other environments since it seems like it should occur anywhere. |
|
I see, it's a vk_matmul_pipeline so it's never null. I suggest trying to make the logic similar to ggml_vk_get_mul_mat_mat_pipeline. |
ea4d4cc to
5abfd16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this can be merged once you address the last comment.
|
Thank you! |
Overview
We see crashing at

ggml_vk_guess_matmul_id_pipeline_alignwhen avk_matmul_pipelinehas all empty pipelines.This happens because the following logic in
ggml_vk_get_mul_mat_mat_id_pipelineis checking for a nullptr which seems to never happen. The reference is there (thus thinks that fp16acc is supported) but all members are actually empty which leads to a crash later on inggml_vk_guess_matmul_id_pipeline_alignWe originally found this issue when running
ggml-org/gpt-oss-20b-GGUFfor src_type0 asGGML_TYPE_MXFP4though should happen on potentially any src_type0.Reproducing steps
Windows
Crash currently occurs on following unit test with Windows LunarLake driver 32.0.101.5730.
build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)Ubuntu
Crash reproduces on Ubuntu with mesa
25.0.7-0ubuntu0.24.04.2drivers on LunarLake.Test results using 00faab1
Windows
GPU driver 32.0.101.5730
All MUL_MAT_ID related backend-ops tests do not crash any more. Several accuracy related tests are failing though should be fixed in more recent drivers. Currently seeing hanging at
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)which gets stuck atggml_vk_wait_for_fence. Possibly due to some other issue in old driver?GPU driver 32.0.101.8247
All backend-ops passing.
Ubuntu
mesa 25.0.7-0ubuntu0.24.04.2
Crashing fixed for MUL_MAT_ID on Ubuntu as well. All backend-ops passing.