cuda : dynamic MMVQ nwarps for narrow matrices#20831
cuda : dynamic MMVQ nwarps for narrow matrices#20831JoursBleu wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Thanks, JoursBleu. Can confirm this resolves the issue, also appears to be a slight improvement. original was 40.2 TPS now 41.14. |
|
According to the llama.cpp AI usage policy:
Your posts very much read like they are machine-generated. Please clarify the extent to which language models were used. |
No, i didn't use AI to write the post. But I do write it in Chinese, and use Claude Opus to translate it to English. |
|
Thank you for clarifying. The problem from our end as maintainers is that we get a lot of spam and the only feasible way for us to sift through it is to ban the use of language models altogether (since it is much easier to determine whether language models were used at all vs. whether someone carefully checked the language model outputs). |
23bae61 to
afb6b78
Compare
|
Hi @IMbackK @JohannesGaessler, would please review when you have time, thanks! |
|
I have not forgotten about this PR but there happen to be other, concurrent PRs that tough the same code. One of them in particular is relevant for getting good scaling with tensor parallelism so I'm prioritizing it over this PR. |
|
Can you rebase on top of master? I'll then review this PR so that it van be merged. |
afb6b78 to
71a7fb1
Compare
|
@JohannesGaessler rebased |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Please change the name of the function to calc_max_nwarps to better reflect how it is being used. The PR seems to be correct based on static code analysis but I will need to check it for regressions.
a19a90b to
8cd2a84
Compare
8cd2a84 to
40dc8a1
Compare
|
Can confirm the updated version still works, without this 36.66 TPS vs 41.22. |
|
hi @JohannesGaessler,
|
Fix MMVQ TG regression on MoE models from #19478.
#19478 increased
nwarpsto 8 on RDNA3/RDNA4 to better utilize memory bandwidth for bs=1 decode. However,nwarps=8assumes wide weight matrices. MoE expert FFN layers are narrow (512–2048 cols), so most warps have no work but still pay__syncthreads()and shared-memory reduction overhead, causing a net TG regression. This patch dynamically clampsnwarpsbased on the actual matrix width to avoid this.R9700 (gfx1201, RDNA4), ROCm 7.2
MoE (regression fixed):
Dense (no regression):
Full whitelist sweep — llama-2-7b, 1x R9700, tg512, r=5:
W7900 (gfx1100, RDNA3), ROCm 7.1
MoE (regression fixed):
Full whitelist sweep — llama-2-7b, 1x W7900, tg512, r=5:
Note: PR description translated with AI assistance.