Skip to content

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Oct 12, 2025

This heavily refactors the caching structure of the MMQ shader and also makes it more modular, to work with other kinds of quants.

Basically instead of turning the quants into 8-bit integers during load to shared memory, the quant structs now get copied through shared memory into registers and only reshaped into 8-bit integers directly before the integer dot operation. This saves on shared memory and on registers.

TODO:

  • Q2_K
  • Q3_K
  • Q4_K
  • Q5_K
  • Q6_K

Q2_K performance is not that good yet. Mapping the 256-wide quant structure to 32-wide Q8_1 structures is not that easy to do efficiently, so I'm still trying to find the best way to do that. @jeffbolznv Let me know if you see any obvious issues with the implementation.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025
@jeffbolznv
Copy link
Collaborator

Interesting. How is the performance for the legacy quants?

Having the values decoded to 8b in shared memory would allow for using int8 coopmat, so this change seems to prevent that. But if using coopmat for this isn't planned then I guess that's fine.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 12, 2025

Interesting. How is the performance for the legacy quants?

It's a ~10% improvement for Intel, a little less so on AMD and Nvidia.

Having the values decoded to 8b in shared memory would allow for using int8 coopmat, so this change seems to prevent that. But if using coopmat for this isn't planned then I guess that's fine.

Yeah, I gave that a try when I first created this shader and didn't find a good way to use coopmat. I plan to take another look, but I guess I'd create a separate shader for it. There wasn't a good way to add k-quants to the structure it had.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 15, 2025

@jeffbolznv I'm trying to investigate the low performance for q2_k with Nvidia Nsight Graphics, but it's giving me some weird results:
This is the q2_k shader:
image
This is the q4_0 shader:
image
One difference I can see is shared memory, but I actually requested less shared memory for q2_k than for q4_0, so I don't know what's going on there.
Also, the instruction count is quite a bit larger for q2_k, which may be related to the third-most common stall being NOINST.

Additionally, I get something like 12.81 TFLOPS on a normal run, but 14.90 TFLOPS if I disable FP16. (The test is MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1))

The hotspots otherwise are the integer dot math, the mul_q8_1 function and the data load from global to shared memory in block_a_to_shmem
image

Any clue what is going on?

@jeffbolznv
Copy link
Collaborator

One difference I can see is shared memory, but I actually requested less shared memory for q2_k than for q4_0, so I don't know what's going on there

This could be register spilling to shared memory. Might be worth trying a smaller tile size to not be so close to the register limit.

What is the relative performance of Q2_K and Q4_0, in the old and new paths?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 15, 2025

From memory, it's something like 10-14 tflops for the scalar float16 path and around 24 tflops for the q4_0 integer dot one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants