-
Notifications
You must be signed in to change notification settings - Fork 13.4k
vulkan: multi-row k quants #10846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: multi-row k quants #10846
Conversation
Please rebase to reduce the number of commits. |
Multiple rows is a bit slower on RTX 4070, so please change to one row for NVIDIA:
I read through the shader changes and they look good to me. |
Intel is being weird again..
PR:
With
It seems to prefer fewer rows on q2_k to q5_k and more rows on q6_k (but performance is bad there either way). I tested this with models for Q4_K_S and Q6_K and it confirms the findings. |
I can also confirm that 1*rm (fewer rows) is better on Nvidia RTX 3090. The PR looks good, it just needs some changes to the selection logic. It's probably not worth complicating for Intel Q6_K, so let's just stick to fewer rows for k-quants on Nvidia and Intel. The merge conflict needs to be fixed, too. Edit: Also looks good on AMD RX 6800 XT. |
Considering there's a 50% difference in Q6_K performance for Intel I've added a separate variable for it, along with Q8_0 which is also a special case. If there are other quants that don't work well with certain GPUs we can also add them to the list. BTW have you checked the assembly dump for Intel? I have a feeling that it doesn't like certain memory access patterns and splits those up into a bunch of small loads. Maybe you could try loading each superblock into shared memory first before doing the actual dequantizing. |
Does that mean it works best with two rows per shader? |
It's a big difference, but performance is marginal either way. I would prefer not making it more complex cause it increases the number of parameters we need to hand-tune. Maybe it's time for an optimizer.
No, I don't have that much time to devote to Intel.
I meant the PR got optimal performance on it already. |
This should be it I think: Default: 2 rows for old quants, 1 row for Q8_0 and K quants |
I see a significant drop in performance on Nvidia RTX 3090 in tg for a Q4_K_S 8B model:
This doesn't match with the results from @jeffbolznv. Any theories? Edit: Some more data:
|
I just reran and am seeing a small improvement from rm_kq=2:
(absolute numbers a bit different from last time because I didn't use |
Was there a driver update recently? That's the only thing I can come up with considering how both of you mentioned earlier that 1 row was faster on Nvidia. Anyways I've updated the code to use |
Just noticed this but the tests look strange:
If m, n, and k are what I think they are (m*n A matrix and n*k B matrix) then this isn't a standard matrix vector multiplication anymore but rather the product of a column vector and a row vector that will spit out a 4096*14336 matrix as the output. Since our The reason why I'm asking this is because it should be possible to calculate a reasonably optimal row count for your GPU depending on the matrix size. Like for my RX 470:
If our A matrix has 4096 rows and we're multiplying it against a B vector of size 4096 we'll only generate 4096 threads if |
The multiply is MxK * KxN -> MxN. These shaders assign one workgroup to each result row (really each result element, because N==1), and that workgroup computes a dot product with K components where each invocation in the workgroup does a subset of the dot product and then they all add up the partial sums at the end. So it's 4096 workgroups in this test, which should be enough to fill the machine. |
Welp I guess I was too tired last night! This makes much more sense.
That may be enough for my machine, but on @0cc4m's 3090 it's a different story.
In this case the GPU is only around 40% utilized in the minimum scenario, and that's assuming that For my 470 I get the fastest speeds with |
This branch: [root@a5c8c7a7147c bin]# ./llama-bench -m ~/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99
And master: ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
Seems to be within margin or error, if not a tad slower on this branch. Is there no speedup to be expected on a 7900 XTX? Using the radv vulkan drivers in case that matters. |
That's not unexpected, yes. The shaders already run well on RDNA. Older AMD GPUs (GCN) were struggling, cause they prefer more work in shader invocations due to more overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay, I completely missed the last commit. Looks good and thank you for the contribution!
fma(FLOAT_TYPE(b112[l]), FLOAT_TYPE(s4_hi4[3]), sum2)))))))); | ||
[[unroll]] for (uint n = 0; n < num_rows; ++n) { | ||
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; | ||
f16vec2 d = data_a[ib0 + i].d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@netrunnereve I completely missed this, and, because we included all arithmetic_type extensions at the top, so did glslc, but you can't use float16 variables anywhere, unless the device supports them. I found this now because I saw validation issues about use of the float16 extension on a device that does not support it. It might be better to include only the arithmetic type extensions that are actually used, then this kind of issue would show up during shader compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure let me fix this as part of #11081.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind you're already handling this in #11161.
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; | ||
f16vec2 d = data_a[ib0 + i].d; | ||
const FLOAT_TYPE dall = d.x; | ||
const FLOAT_TYPE dmin = d.y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be FLOAT_TYPE(d.x/y)
for both. In other cases you did it correctly.
* multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default
* multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default
* multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default
This allows our k quant mat vec shaders to process multiple rows at a time just like
mul_mat_vec.comp
. It's way faster now and Q4_K_S is catching up to IQ4_NL and Q4_0 on my RX 470.At this point we might want to consider merging the separate k quant files into
mul_mat_vec.comp
as they're reusing quite a bit of code, and maybe do some templating using ifdefs to choose the correct dequantization function. That's better left to another PR though.PR:
Master:
The number of rows used was chosen for my card and may need tuning for different architectures.