vulkan: Submit once enough matmul work has been recorded #12406

jeffbolznv · 2025-03-16T04:33:41Z

I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB. This seems to resolve the issue, and also increases perf for non-FA a bit.

Perf on RTX 4070:

before:
llama-bench -m  C:\models\Llama-3.2-3B-Instruct-Q8_0.gguf -m C:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m C:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m C:\models\bartowski\gemma-2-9b-it-GGUF\gemma-2-9b-it-Q8_0.gguf -m C:\models\Moonlight-16B-A3B-Instruct-Q4_K_M.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -m C:\models\Qwen2.5-14B-Instruct-Q4_K_M.gguf -fa 0,1 -p 0 -n 128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  0 |         tg128 |        100.43 ± 1.66 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |         tg128 |         89.52 ± 1.43 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  0 |         tg128 |         75.03 ± 0.30 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |         tg128 |         68.05 ± 1.27 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  0 |         tg128 |        135.42 ± 1.38 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |         tg128 |        134.46 ± 0.91 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  0 |         tg128 |         37.65 ± 0.18 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |         tg128 |         38.46 ± 0.19 |
| deepseek2 16B Q4_K - Medium    |   9.81 GiB |    15.96 B | Vulkan     |  99 |  0 |         tg128 |        124.24 ± 1.57 |
| deepseek2 16B Q4_K - Medium    |   9.81 GiB |    15.96 B | Vulkan     |  99 |  1 |         tg128 |        122.97 ± 1.18 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  0 |         tg128 |        122.29 ± 1.52 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |         tg128 |        118.14 ± 1.36 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  0 |         tg128 |         40.62 ± 0.29 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |         tg128 |         41.78 ± 0.21 |

after:
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  0 |         tg128 |        103.18 ± 0.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |         tg128 |        103.59 ± 0.72 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  0 |         tg128 |         76.15 ± 0.72 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |         tg128 |         77.32 ± 0.93 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  0 |         tg128 |        140.36 ± 0.49 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |         tg128 |        140.17 ± 0.25 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  0 |         tg128 |         38.88 ± 0.24 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |         tg128 |         39.15 ± 0.05 |
| deepseek2 16B Q4_K - Medium    |   9.81 GiB |    15.96 B | Vulkan     |  99 |  0 |         tg128 |        122.28 ± 0.55 |
| deepseek2 16B Q4_K - Medium    |   9.81 GiB |    15.96 B | Vulkan     |  99 |  1 |         tg128 |        122.18 ± 0.28 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  0 |         tg128 |        124.44 ± 1.25 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |         tg128 |        124.19 ± 0.96 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  0 |         tg128 |         41.70 ± 0.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |         tg128 |         42.09 ± 0.18 |

0cc4m · 2025-03-17T19:24:07Z

Interesting. I tested this and can reproduce the uplift you report, but also a number of regressions in non-FA cases, especially with smaller models. Not sure if this is an actual problem or just a difference in an extreme (big gpu small model) case unlikely to happen.

RTX 3090:

model	size	params	backend	ngl	fa	test	t/s Master	t/s PR
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	0	tg128	98.16 ± 0.29	103.04 ± 0.81
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	1	tg128	83.99 ± 0.04	99.64 ± 1.78
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	93.40 ± 0.17	93.09 ± 0.36
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	81.45 ± 0.13	94.18 ± 0.08
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	73.90 ± 1.45	74.09 ± 0.02
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	66.83 ± 0.03	75.00 ± 0.04
llama 1B F16	2.05 GiB	1.10 B	Vulkan	99	0	tg128	192.84 ± 1.38	183.44 ± 1.87
llama 1B F16	2.05 GiB	1.10 B	Vulkan	99	1	tg128	157.89 ± 0.24	196.36 ± 0.44
llama 1B Q2_K - Medium	411.41 MiB	1.10 B	Vulkan	99	0	tg128	231.79 ± 6.12	209.04 ± 2.52
llama 1B Q2_K - Medium	411.41 MiB	1.10 B	Vulkan	99	1	tg128	186.65 ± 0.44	230.23 ± 0.41
llama 1B Q3_K - Medium	523.67 MiB	1.10 B	Vulkan	99	0	tg128	275.86 ± 0.51	215.98 ± 3.09
llama 1B Q3_K - Medium	523.67 MiB	1.10 B	Vulkan	99	1	tg128	192.67 ± 1.09	239.30 ± 0.74
llama 1B Q5_K - Medium	745.11 MiB	1.10 B	Vulkan	99	0	tg128	290.27 ± 1.26	253.95 ± 23.56
llama 1B Q5_K - Medium	745.11 MiB	1.10 B	Vulkan	99	1	tg128	198.72 ± 1.85	258.13 ± 0.92
llama 1B Q6_K	860.86 MiB	1.10 B	Vulkan	99	0	tg128	280.01 ± 0.52	228.08 ± 10.11
llama 1B Q6_K	860.86 MiB	1.10 B	Vulkan	99	1	tg128	195.23 ± 1.44	252.30 ± 0.87

AMD Radeon Pro VII:

model	size	params	backend	ngl	test	t/s Master	t/s PR
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	tg128	61.10 ± 0.39	61.21 ± 0.22
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	tg128	63.13 ± 0.57	63.55 ± 0.28
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	tg128	43.50 ± 0.10	43.50 ± 0.07
llama 1B F16	2.05 GiB	1.10 B	Vulkan	99	tg128	134.36 ± 4.29	142.48 ± 0.31
llama 1B Q2_K - Medium	411.41 MiB	1.10 B	Vulkan	99	tg128	212.14 ± 0.59	203.89 ± 0.58
llama 1B Q3_K - Medium	523.67 MiB	1.10 B	Vulkan	99	tg128	205.39 ± 0.34	199.09 ± 0.21
llama 1B Q5_K - Medium	745.11 MiB	1.10 B	Vulkan	99	tg128	203.12 ± 0.29	203.58 ± 0.47
llama 1B Q6_K	860.86 MiB	1.10 B	Vulkan	99	tg128	196.22 ± 0.46	195.63 ± 0.46

Intel A770:

model	size	params	backend	ngl	test	t/s Master	t/s PR
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	tg128	36.21 ± 0.05	36.39 ± 0.10
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	tg128	20.07 ± 0.06	20.29 ± 0.01
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	tg128	10.69 ± 0.01	10.69 ± 0.00
llama 1B F16	2.05 GiB	1.10 B	Vulkan	99	tg128	85.02 ± 0.04	85.73 ± 0.31
llama 1B Q2_K - Medium	411.41 MiB	1.10 B	Vulkan	99	tg128	119.54 ± 0.10	107.88 ± 0.03
llama 1B Q3_K - Medium	523.67 MiB	1.10 B	Vulkan	99	tg128	96.97 ± 0.10	95.12 ± 0.12
llama 1B Q5_K - Medium	745.11 MiB	1.10 B	Vulkan	99	tg128	86.62 ± 0.03	86.84 ± 0.10
llama 1B Q6_K	860.86 MiB	1.10 B	Vulkan	99	tg128	62.76 ± 0.02	62.92 ± 0.07

jeffbolznv · 2025-03-17T19:43:33Z

Interesting, maybe I need to scale down the threshold for smaller models. I'll poke around at it and get back to you.

I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.

jeffbolznv · 2025-03-18T02:36:58Z

I think I've found a good scale factor, @0cc4m please try again.

0cc4m · 2025-03-18T06:12:55Z

model	size	params	backend	ngl	fa	test	t/s Master	t/s PR
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	0	tg128	99.94 ± 1.29	98.34 ± 0.20
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	1	tg128	85.37 ± 0.09	99.29 ± 0.07
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	93.56 ± 3.02	93.59 ± 0.33
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	83.33 ± 0.12	94.80 ± 0.02
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	75.05 ± 0.92	74.34 ± 0.11
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	67.65 ± 0.07	75.25 ± 0.04
llama 1B F16	2.05 GiB	1.10 B	Vulkan	99	0	tg128	192.94 ± 1.40	185.22 ± 1.85
llama 1B F16	2.05 GiB	1.10 B	Vulkan	99	1	tg128	158.86 ± 0.11	198.56 ± 0.28
llama 1B Q2_K - Medium	411.41 MiB	1.10 B	Vulkan	99	0	tg128	234.37 ± 7.87	220.30 ± 6.70
llama 1B Q2_K - Medium	411.41 MiB	1.10 B	Vulkan	99	1	tg128	187.78 ± 0.28	243.37 ± 0.41
llama 1B Q3_K - Medium	523.67 MiB	1.10 B	Vulkan	99	0	tg128	243.43 ± 12.67	228.20 ± 10.67
llama 1B Q3_K - Medium	523.67 MiB	1.10 B	Vulkan	99	1	tg128	192.51 ± 0.40	251.42 ± 0.64
llama 1B Q5_K - Medium	745.11 MiB	1.10 B	Vulkan	99	0	tg128	288.11 ± 0.97	286.46 ± 0.89
llama 1B Q5_K - Medium	745.11 MiB	1.10 B	Vulkan	99	1	tg128	201.18 ± 1.71	264.75 ± 5.85
llama 1B Q6_K	860.86 MiB	1.10 B	Vulkan	99	0	tg128	280.03 ± 1.76	239.34 ± 22.91
llama 1B Q6_K	860.86 MiB	1.10 B	Vulkan	99	1	tg128	196.63 ± 1.32	255.72 ± 0.82

It's a little better, yeah.

jeffbolznv · 2025-03-18T14:16:34Z

This is what I had measured for small models on 3090:

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------master------: |    8bd64be	  656c97f
| llama 1B Q2_K - Medium         | 459.11 MiB |     1.10 B | Vulkan     |  99 |  0 |         tg128 |        237.82 ± 4.73 |    217.84     232.07
| llama 1B Q2_K - Medium         | 459.11 MiB |     1.10 B | Vulkan     |  99 |  1 |         tg128 |       193.39 ± 13.20 |    239.67     250.41
| llama 1B Q3_K - Small          | 475.51 MiB |     1.10 B | Vulkan     |  99 |  0 |         tg128 |        241.91 ± 2.07 |    223.60     239.38
| llama 1B Q3_K - Small          | 475.51 MiB |     1.10 B | Vulkan     |  99 |  1 |         tg128 |        198.74 ± 0.76 |    239.96     243.39
| llama 1B Q4_0                  | 606.53 MiB |     1.10 B | Vulkan     |  99 |  0 |         tg128 |        284.21 ± 3.17 |    278.67     268.17
| llama 1B Q4_0                  | 606.53 MiB |     1.10 B | Vulkan     |  99 |  1 |         tg128 |        228.91 ± 1.40 |    254.55     298.63
| llama 1B Q6_K                  | 860.86 MiB |     1.10 B | Vulkan     |  99 |  0 |         tg128 |        257.43 ± 1.70 |    242.45     246.51
| llama 1B Q6_K                  | 860.86 MiB |     1.10 B | Vulkan     |  99 |  1 |         tg128 |        208.32 ± 2.09 |    255.69     264.43

0cc4m · 2025-03-18T14:40:45Z

The only thing I can think of is that my system is using an AMD EPYC 7302, which has rather low single-core performance. That could mean it needs more time to record and submit command buffers, exaggerating the difference.

I think it's fine to merge regardless, since large models are working well.

0cc4m

Any idea when coopmat2 will be in a release driver?

jeffbolznv · 2025-03-18T15:20:40Z

It'll be in the 575 release. I can't comment specifically on when that'll be out, but in general it tends to be a few months between major releases.

jeffbolznv requested a review from 0cc4m March 16, 2025 04:33

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 16, 2025

jeffbolznv force-pushed the matmul_bytes_submit_heuristic branch from c065fc5 to 8bd64be Compare March 17, 2025 02:05

jeffbolznv force-pushed the matmul_bytes_submit_heuristic branch from 8bd64be to 656c97f Compare March 18, 2025 02:36

0cc4m approved these changes Mar 18, 2025

View reviewed changes

0cc4m merged commit c446b2e into ggml-org:master Mar 19, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Submit once enough matmul work has been recorded #12406

vulkan: Submit once enough matmul work has been recorded #12406

Uh oh!

jeffbolznv commented Mar 16, 2025

Uh oh!

0cc4m commented Mar 17, 2025

Uh oh!

jeffbolznv commented Mar 17, 2025

Uh oh!

jeffbolznv commented Mar 18, 2025

Uh oh!

0cc4m commented Mar 18, 2025

Uh oh!

jeffbolznv commented Mar 18, 2025

Uh oh!

0cc4m commented Mar 18, 2025

Uh oh!

0cc4m left a comment

Uh oh!

jeffbolznv commented Mar 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vulkan: Submit once enough matmul work has been recorded #12406

vulkan: Submit once enough matmul work has been recorded #12406

Uh oh!

Conversation

jeffbolznv commented Mar 16, 2025

Uh oh!

0cc4m commented Mar 17, 2025

Uh oh!

jeffbolznv commented Mar 17, 2025

Uh oh!

jeffbolznv commented Mar 18, 2025

Uh oh!

0cc4m commented Mar 18, 2025

Uh oh!

jeffbolznv commented Mar 18, 2025

Uh oh!

0cc4m commented Mar 18, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Mar 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants