cuda : dynamic MMVQ nwarps for narrow matrices by JoursBleu · Pull Request #20831 · ggml-org/llama.cpp

JoursBleu · 2026-03-21T02:32:22Z

Fix MMVQ TG regression on MoE models from #19478.

#19478 increased nwarps to 8 on RDNA3/RDNA4 to better utilize memory bandwidth for bs=1 decode. However, nwarps=8 assumes wide weight matrices. MoE expert FFN layers are narrow (512–2048 cols), so most warps have no work but still pay __syncthreads() and shared-memory reduction overhead, causing a net TG regression. This patch dynamically clamps nwarps based on the actual matrix width to avoid this.

R9700 (gfx1201, RDNA4), ROCm 7.2

MoE (regression fixed):

Model	GPUs	base (t/s)	#19478 (t/s)	this PR (t/s)
Qwen3.5-122B-A10B Q4_K_M	4	38.72	34.74 (-10.3%)	39.31
Qwen3.5-35B-A3B Q4_K_M	1	76.92	74.14 (-3.6%)	84.62 (+10.0%)
Qwen3.5-35B-A3B Q5_K_M	1	74.97	75.32 (+0.5%)	81.83 (+9.1%)
Qwen3.5-35B-A3B Q6_K	1	75.60	77.86 (+3.0%)	82.67 (+9.4%)
Qwen3.5-35B-A3B Q8_0	2	65.39	67.37 (+3.0%)	71.45 (+9.3%)

Dense (no regression):

Model	GPUs	base (t/s)	#19478 (t/s)	this PR (t/s)
Qwen2.5-72B Q4_K_M	4	10.42	10.63	10.54

Full whitelist sweep — llama-2-7b, 1x R9700, tg512, r=5:

Quant	base (t/s)	this PR (t/s)	Change
Q4_0	92.67	101.09	+9.1%
Q4_1	88.59	95.45	+7.7%
Q5_0	81.52	87.77	+7.7%
Q5_1	78.08	83.99	+7.6%
Q8_0	59.32	63.07	+6.3%
Q2_K	93.39	94.43	+1.1%
Q3_K	91.63	91.67	+0.0%
Q4_K	91.26	95.93	+5.1%
Q5_K	81.65	86.79	+6.3%
Q6_K	72.37	75.25	+4.0%
IQ2_XXS	89.06	89.61	+0.6%
IQ2_XS	86.99	87.34	+0.4%
IQ2_S	84.29	84.14	-0.2%
IQ3_XXS	82.35	82.11	-0.3%
IQ3_S	80.89	81.02	+0.2%
IQ4_NL	93.72	99.02	+5.7%
IQ4_XS	94.49	104.83	+10.9%

W7900 (gfx1100, RDNA3), ROCm 7.1

MoE (regression fixed):

Model	GPUs	base (t/s)	#19478 (t/s)	this PR (t/s)
Qwen3.5-35B-A3B Q4_K_M	1	76.33	69.68 (-8.7%)	77.10 (+1.0%)
Qwen3.5-35B-A3B Q5_K_M	1	70.53	69.30 (-1.7%)	73.05 (+3.6%)
Qwen3.5-35B-A3B Q6_K	1	72.31	70.12 (-3.0%)	73.28 (+1.3%)
Qwen3.5-35B-A3B Q8_0	1	69.98	66.89 (-4.4%)	72.67 (+3.8%)

Full whitelist sweep — llama-2-7b, 1x W7900, tg512, r=5:

Quant	base (t/s)	this PR (t/s)	Change
Q4_0	98.49	98.72	+0.2%
Q5_0	85.76	86.47	+0.8%
Q8_0	66.79	68.02	+1.8%
Q2_K	92.12	92.44	+0.3%
Q3_K_S	88.57	88.86	+0.3%
Q4_K_S	82.09	84.79	+3.3%
Q5_K_S	77.39	77.11	-0.4%
Q6_K	73.71	76.33	+3.6%

Note: PR description translated with AI assistance.

meven3000 · 2026-03-21T11:28:37Z

Thanks, JoursBleu.

Can confirm this resolves the issue, also appears to be a slight improvement. original was 40.2 TPS now 41.14.

JohannesGaessler · 2026-03-21T11:31:04Z

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Your posts very much read like they are machine-generated. Please clarify the extent to which language models were used.

JoursBleu · 2026-03-21T14:08:58Z

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Your posts very much read like they are machine-generated. Please clarify the extent to which language models were used.

No, i didn't use AI to write the post. But I do write it in Chinese, and use Claude Opus to translate it to English.
The model did modify some wording, but it didn't change my original meaning, so I don't think it violated the policy...

JohannesGaessler · 2026-03-22T10:11:38Z

Thank you for clarifying. The problem from our end as maintainers is that we get a lot of spam and the only feasible way for us to sift through it is to ban the use of language models altogether (since it is much easier to determine whether language models were used at all vs. whether someone carefully checked the language model outputs).

JoursBleu · 2026-03-24T07:16:17Z

Hi @IMbackK @JohannesGaessler, would please review when you have time, thanks!

JohannesGaessler · 2026-03-24T21:16:43Z

I have not forgotten about this PR but there happen to be other, concurrent PRs that tough the same code. One of them in particular is relevant for getting good scaling with tensor parallelism so I'm prioritizing it over this PR.

JohannesGaessler · 2026-04-01T09:51:36Z

Can you rebase on top of master? I'll then review this PR so that it van be merged.

JoursBleu · 2026-04-01T11:00:33Z

@JohannesGaessler rebased

JohannesGaessler

Please change the name of the function to calc_max_nwarps to better reflect how it is being used. The PR seems to be correct based on static code analysis but I will need to check it for regressions.

ggml/src/ggml-cuda/mmvq.cu

meven3000 · 2026-04-02T03:04:30Z

Can confirm the updated version still works, without this 36.66 TPS vs 41.22.

JoursBleu · 2026-04-02T13:17:14Z

hi @JohannesGaessler,
Done:

Rename the func;
Limit the effection to RDNA3/RDNA4 GPUs.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 21, 2026

JoursBleu mentioned this pull request Mar 21, 2026

cuda : add RDNA4-specific MMVQ parameter table for bs=1 decode #19478

Merged

JoursBleu marked this pull request as ready for review March 22, 2026 09:37

JoursBleu requested a review from a team as a code owner March 22, 2026 09:37

JoursBleu force-pushed the fix/mmvq-dynamic-nwarps-moe branch from 23bae61 to afb6b78 Compare March 23, 2026 09:00

JoursBleu force-pushed the fix/mmvq-dynamic-nwarps-moe branch from afb6b78 to 71a7fb1 Compare April 1, 2026 10:58

JohannesGaessler reviewed Apr 1, 2026

View reviewed changes

ggml/src/ggml-cuda/mmvq.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mmvq.cu Show resolved Hide resolved

JoursBleu force-pushed the fix/mmvq-dynamic-nwarps-moe branch 2 times, most recently from a19a90b to 8cd2a84 Compare April 2, 2026 01:23

mmvq: dynamic nwarps based on matrix width for MoE models

40dc8a1

JoursBleu force-pushed the fix/mmvq-dynamic-nwarps-moe branch from 8cd2a84 to 40dc8a1 Compare April 2, 2026 02:32

loci-dev mentioned this pull request Apr 2, 2026

UPSTREAM PR #20831: cuda : dynamic MMVQ nwarps for narrow matrices auroralabs-loci/llama.cpp#1327

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : dynamic MMVQ nwarps for narrow matrices#20831

cuda : dynamic MMVQ nwarps for narrow matrices#20831
JoursBleu wants to merge 1 commit intoggml-org:masterfrom
JoursBleu:fix/mmvq-dynamic-nwarps-moe

JoursBleu commented Mar 21, 2026 •

edited

Loading

Uh oh!

meven3000 commented Mar 21, 2026

Uh oh!

JohannesGaessler commented Mar 21, 2026

Uh oh!

JoursBleu commented Mar 21, 2026

Uh oh!

JohannesGaessler commented Mar 22, 2026

Uh oh!

JoursBleu commented Mar 24, 2026

Uh oh!

JohannesGaessler commented Mar 24, 2026

Uh oh!

JohannesGaessler commented Apr 1, 2026

Uh oh!

JoursBleu commented Apr 1, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

meven3000 commented Apr 2, 2026

Uh oh!

JoursBleu commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JoursBleu commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

R9700 (gfx1201, RDNA4), ROCm 7.2

W7900 (gfx1100, RDNA3), ROCm 7.1

Uh oh!

meven3000 commented Mar 21, 2026

Uh oh!

JohannesGaessler commented Mar 21, 2026

Uh oh!

JoursBleu commented Mar 21, 2026

Uh oh!

JohannesGaessler commented Mar 22, 2026

Uh oh!

JoursBleu commented Mar 24, 2026

Uh oh!

JohannesGaessler commented Mar 24, 2026

Uh oh!

JohannesGaessler commented Apr 1, 2026

Uh oh!

JoursBleu commented Apr 1, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

meven3000 commented Apr 2, 2026

Uh oh!

JoursBleu commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JoursBleu commented Mar 21, 2026 •

edited

Loading