Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

Reported at ikawrakow/ik_llama.cpp#608 (comment), but a different fix.

I'm still seeing flash attention fail with this model, but I'll look into that separately.

Remove supports_op check for > 4096 (splitting fixes this)
@jeffbolznv jeffbolznv requested a review from 0cc4m July 14, 2025 21:43
return
tensor->nb[0] == ggml_type_size(tensor->type) &&
tensor->nb[1] == (tensor->nb[0]*tensor->ne[0])/ggml_blck_size(tensor->type) &&
tensor->nb[3] == tensor->nb[2]*tensor->ne[2];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0cc4m do you recall where there is a check for dim3 here at all? Based on the function name it seems like it should only care about dims 0,1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should. I'm not 100% sure, but it was maybe related to multiple mul_mat calls or broadcasting. When this was written the mul_mat shader handled only the first two dimensions and was called multiple times to do the other dimensions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove the last part of the check, there are some failures in mul_mat tests. Maybe worth looking into, but I think this change is OK for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because it falls back to dequant to fp16 + matmul in a few cases due to the third check.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 14, 2025
@jeffbolznv
Copy link
Collaborator Author

I'm still seeing flash attention fail with this model, but I'll look into that separately.

I found that this was hitting the dequant path in mul_mat and was only dequantizing the first batch. Most recent commit fixes this. I still can see some failures in IQ quants if I force this path, but those happen even when the batch dimension is 1.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@0cc4m 0cc4m merged commit ba1ceb3 into ggml-org:master Jul 15, 2025
44 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants