Skip to content

cuda : dynamic MMVQ nwarps for narrow matrices#20831

Open
JoursBleu wants to merge 1 commit intoggml-org:masterfrom
JoursBleu:fix/mmvq-dynamic-nwarps-moe
Open

cuda : dynamic MMVQ nwarps for narrow matrices#20831
JoursBleu wants to merge 1 commit intoggml-org:masterfrom
JoursBleu:fix/mmvq-dynamic-nwarps-moe

Conversation

@JoursBleu
Copy link
Copy Markdown
Contributor

@JoursBleu JoursBleu commented Mar 21, 2026

Fix MMVQ TG regression on MoE models from #19478.

#19478 increased nwarps to 8 on RDNA3/RDNA4 to better utilize memory bandwidth for bs=1 decode. However, nwarps=8 assumes wide weight matrices. MoE expert FFN layers are narrow (512–2048 cols), so most warps have no work but still pay __syncthreads() and shared-memory reduction overhead, causing a net TG regression. This patch dynamically clamps nwarps based on the actual matrix width to avoid this.

R9700 (gfx1201, RDNA4), ROCm 7.2

MoE (regression fixed):

Model GPUs base (t/s) #19478 (t/s) this PR (t/s)
Qwen3.5-122B-A10B Q4_K_M 4 38.72 34.74 (-10.3%) 39.31
Qwen3.5-35B-A3B Q4_K_M 1 76.92 74.14 (-3.6%) 84.62 (+10.0%)
Qwen3.5-35B-A3B Q5_K_M 1 74.97 75.32 (+0.5%) 81.83 (+9.1%)
Qwen3.5-35B-A3B Q6_K 1 75.60 77.86 (+3.0%) 82.67 (+9.4%)
Qwen3.5-35B-A3B Q8_0 2 65.39 67.37 (+3.0%) 71.45 (+9.3%)

Dense (no regression):

Model GPUs base (t/s) #19478 (t/s) this PR (t/s)
Qwen2.5-72B Q4_K_M 4 10.42 10.63 10.54

Full whitelist sweep — llama-2-7b, 1x R9700, tg512, r=5:

Quant base (t/s) this PR (t/s) Change
Q4_0 92.67 101.09 +9.1%
Q4_1 88.59 95.45 +7.7%
Q5_0 81.52 87.77 +7.7%
Q5_1 78.08 83.99 +7.6%
Q8_0 59.32 63.07 +6.3%
Q2_K 93.39 94.43 +1.1%
Q3_K 91.63 91.67 +0.0%
Q4_K 91.26 95.93 +5.1%
Q5_K 81.65 86.79 +6.3%
Q6_K 72.37 75.25 +4.0%
IQ2_XXS 89.06 89.61 +0.6%
IQ2_XS 86.99 87.34 +0.4%
IQ2_S 84.29 84.14 -0.2%
IQ3_XXS 82.35 82.11 -0.3%
IQ3_S 80.89 81.02 +0.2%
IQ4_NL 93.72 99.02 +5.7%
IQ4_XS 94.49 104.83 +10.9%

W7900 (gfx1100, RDNA3), ROCm 7.1

MoE (regression fixed):

Model GPUs base (t/s) #19478 (t/s) this PR (t/s)
Qwen3.5-35B-A3B Q4_K_M 1 76.33 69.68 (-8.7%) 77.10 (+1.0%)
Qwen3.5-35B-A3B Q5_K_M 1 70.53 69.30 (-1.7%) 73.05 (+3.6%)
Qwen3.5-35B-A3B Q6_K 1 72.31 70.12 (-3.0%) 73.28 (+1.3%)
Qwen3.5-35B-A3B Q8_0 1 69.98 66.89 (-4.4%) 72.67 (+3.8%)

Full whitelist sweep — llama-2-7b, 1x W7900, tg512, r=5:

Quant base (t/s) this PR (t/s) Change
Q4_0 98.49 98.72 +0.2%
Q5_0 85.76 86.47 +0.8%
Q8_0 66.79 68.02 +1.8%
Q2_K 92.12 92.44 +0.3%
Q3_K_S 88.57 88.86 +0.3%
Q4_K_S 82.09 84.79 +3.3%
Q5_K_S 77.39 77.11 -0.4%
Q6_K 73.71 76.33 +3.6%

Note: PR description translated with AI assistance.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 21, 2026
@meven3000
Copy link
Copy Markdown

Thanks, JoursBleu.

Can confirm this resolves the issue, also appears to be a slight improvement. original was 40.2 TPS now 41.14.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Your posts very much read like they are machine-generated. Please clarify the extent to which language models were used.

@JoursBleu
Copy link
Copy Markdown
Contributor Author

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Your posts very much read like they are machine-generated. Please clarify the extent to which language models were used.

No, i didn't use AI to write the post. But I do write it in Chinese, and use Claude Opus to translate it to English.
The model did modify some wording, but it didn't change my original meaning, so I don't think it violated the policy...

@JoursBleu JoursBleu marked this pull request as ready for review March 22, 2026 09:37
@JoursBleu JoursBleu requested a review from a team as a code owner March 22, 2026 09:37
@JohannesGaessler
Copy link
Copy Markdown
Contributor

Thank you for clarifying. The problem from our end as maintainers is that we get a lot of spam and the only feasible way for us to sift through it is to ban the use of language models altogether (since it is much easier to determine whether language models were used at all vs. whether someone carefully checked the language model outputs).

@JoursBleu JoursBleu force-pushed the fix/mmvq-dynamic-nwarps-moe branch from 23bae61 to afb6b78 Compare March 23, 2026 09:00
@JoursBleu
Copy link
Copy Markdown
Contributor Author

Hi @IMbackK @JohannesGaessler, would please review when you have time, thanks!

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I have not forgotten about this PR but there happen to be other, concurrent PRs that tough the same code. One of them in particular is relevant for getting good scaling with tensor parallelism so I'm prioritizing it over this PR.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Can you rebase on top of master? I'll then review this PR so that it van be merged.

@JoursBleu JoursBleu force-pushed the fix/mmvq-dynamic-nwarps-moe branch from afb6b78 to 71a7fb1 Compare April 1, 2026 10:58
@JoursBleu
Copy link
Copy Markdown
Contributor Author

@JohannesGaessler rebased

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the name of the function to calc_max_nwarps to better reflect how it is being used. The PR seems to be correct based on static code analysis but I will need to check it for regressions.

@JoursBleu JoursBleu force-pushed the fix/mmvq-dynamic-nwarps-moe branch 2 times, most recently from a19a90b to 8cd2a84 Compare April 2, 2026 01:23
@JoursBleu JoursBleu force-pushed the fix/mmvq-dynamic-nwarps-moe branch from 8cd2a84 to 40dc8a1 Compare April 2, 2026 02:32
@meven3000
Copy link
Copy Markdown

Can confirm the updated version still works, without this 36.66 TPS vs 41.22.

@JoursBleu
Copy link
Copy Markdown
Contributor Author

hi @JohannesGaessler,
Done:

  • Rename the func;
  • Limit the effection to RDNA3/RDNA4 GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants