-
Notifications
You must be signed in to change notification settings - Fork 13.5k
vulkan: fix noncontig check for mat_mul_id splitting #14683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Remove supports_op check for > 4096 (splitting fixes this)
| return | ||
| tensor->nb[0] == ggml_type_size(tensor->type) && | ||
| tensor->nb[1] == (tensor->nb[0]*tensor->ne[0])/ggml_blck_size(tensor->type) && | ||
| tensor->nb[3] == tensor->nb[2]*tensor->ne[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0cc4m do you recall where there is a check for dim3 here at all? Based on the function name it seems like it should only care about dims 0,1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should. I'm not 100% sure, but it was maybe related to multiple mul_mat calls or broadcasting. When this was written the mul_mat shader handled only the first two dimensions and was called multiple times to do the other dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove the last part of the check, there are some failures in mul_mat tests. Maybe worth looking into, but I think this change is OK for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because it falls back to dequant to fp16 + matmul in a few cases due to the third check.
I found that this was hitting the dequant path in mul_mat and was only dequantizing the first batch. Most recent commit fixes this. I still can see some failures in IQ quants if I force this path, but those happen even when the batch dimension is 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reported at ikawrakow/ik_llama.cpp#608 (comment), but a different fix.
I'm still seeing flash attention fail with this model, but I'll look into that separately.