Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB. This seems to resolve the issue, and also increases perf for non-FA a bit.

Perf on RTX 4070:

before:
llama-bench -m  C:\models\Llama-3.2-3B-Instruct-Q8_0.gguf -m C:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m C:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m C:\models\bartowski\gemma-2-9b-it-GGUF\gemma-2-9b-it-Q8_0.gguf -m C:\models\Moonlight-16B-A3B-Instruct-Q4_K_M.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -m C:\models\Qwen2.5-14B-Instruct-Q4_K_M.gguf -fa 0,1 -p 0 -n 128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  0 |         tg128 |        100.43 ± 1.66 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |         tg128 |         89.52 ± 1.43 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  0 |         tg128 |         75.03 ± 0.30 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |         tg128 |         68.05 ± 1.27 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  0 |         tg128 |        135.42 ± 1.38 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |         tg128 |        134.46 ± 0.91 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  0 |         tg128 |         37.65 ± 0.18 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |         tg128 |         38.46 ± 0.19 |
| deepseek2 16B Q4_K - Medium    |   9.81 GiB |    15.96 B | Vulkan     |  99 |  0 |         tg128 |        124.24 ± 1.57 |
| deepseek2 16B Q4_K - Medium    |   9.81 GiB |    15.96 B | Vulkan     |  99 |  1 |         tg128 |        122.97 ± 1.18 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  0 |         tg128 |        122.29 ± 1.52 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |         tg128 |        118.14 ± 1.36 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  0 |         tg128 |         40.62 ± 0.29 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |         tg128 |         41.78 ± 0.21 |

after:
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  0 |         tg128 |        103.18 ± 0.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |         tg128 |        103.59 ± 0.72 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  0 |         tg128 |         76.15 ± 0.72 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |         tg128 |         77.32 ± 0.93 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  0 |         tg128 |        140.36 ± 0.49 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |         tg128 |        140.17 ± 0.25 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  0 |         tg128 |         38.88 ± 0.24 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |         tg128 |         39.15 ± 0.05 |
| deepseek2 16B Q4_K - Medium    |   9.81 GiB |    15.96 B | Vulkan     |  99 |  0 |         tg128 |        122.28 ± 0.55 |
| deepseek2 16B Q4_K - Medium    |   9.81 GiB |    15.96 B | Vulkan     |  99 |  1 |         tg128 |        122.18 ± 0.28 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  0 |         tg128 |        124.44 ± 1.25 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |         tg128 |        124.19 ± 0.96 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  0 |         tg128 |         41.70 ± 0.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |         tg128 |         42.09 ± 0.18 |

@jeffbolznv jeffbolznv requested a review from 0cc4m March 16, 2025 04:33
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 16, 2025
@jeffbolznv jeffbolznv force-pushed the matmul_bytes_submit_heuristic branch from c065fc5 to 8bd64be Compare March 17, 2025 02:05
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 17, 2025

Interesting. I tested this and can reproduce the uplift you report, but also a number of regressions in non-FA cases, especially with smaller models. Not sure if this is an actual problem or just a difference in an extreme (big gpu small model) case unlikely to happen.

RTX 3090:

model size params backend ngl fa test t/s Master t/s PR
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 0 tg128 98.16 ± 0.29 103.04 ± 0.81
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 1 tg128 83.99 ± 0.04 99.64 ± 1.78
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 93.40 ± 0.17 93.09 ± 0.36
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 81.45 ± 0.13 94.18 ± 0.08
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 73.90 ± 1.45 74.09 ± 0.02
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 66.83 ± 0.03 75.00 ± 0.04
llama 1B F16 2.05 GiB 1.10 B Vulkan 99 0 tg128 192.84 ± 1.38 183.44 ± 1.87
llama 1B F16 2.05 GiB 1.10 B Vulkan 99 1 tg128 157.89 ± 0.24 196.36 ± 0.44
llama 1B Q2_K - Medium 411.41 MiB 1.10 B Vulkan 99 0 tg128 231.79 ± 6.12 209.04 ± 2.52
llama 1B Q2_K - Medium 411.41 MiB 1.10 B Vulkan 99 1 tg128 186.65 ± 0.44 230.23 ± 0.41
llama 1B Q3_K - Medium 523.67 MiB 1.10 B Vulkan 99 0 tg128 275.86 ± 0.51 215.98 ± 3.09
llama 1B Q3_K - Medium 523.67 MiB 1.10 B Vulkan 99 1 tg128 192.67 ± 1.09 239.30 ± 0.74
llama 1B Q5_K - Medium 745.11 MiB 1.10 B Vulkan 99 0 tg128 290.27 ± 1.26 253.95 ± 23.56
llama 1B Q5_K - Medium 745.11 MiB 1.10 B Vulkan 99 1 tg128 198.72 ± 1.85 258.13 ± 0.92
llama 1B Q6_K 860.86 MiB 1.10 B Vulkan 99 0 tg128 280.01 ± 0.52 228.08 ± 10.11
llama 1B Q6_K 860.86 MiB 1.10 B Vulkan 99 1 tg128 195.23 ± 1.44 252.30 ± 0.87

AMD Radeon Pro VII:

model size params backend ngl test t/s Master t/s PR
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 tg128 61.10 ± 0.39 61.21 ± 0.22
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 63.13 ± 0.57 63.55 ± 0.28
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 43.50 ± 0.10 43.50 ± 0.07
llama 1B F16 2.05 GiB 1.10 B Vulkan 99 tg128 134.36 ± 4.29 142.48 ± 0.31
llama 1B Q2_K - Medium 411.41 MiB 1.10 B Vulkan 99 tg128 212.14 ± 0.59 203.89 ± 0.58
llama 1B Q3_K - Medium 523.67 MiB 1.10 B Vulkan 99 tg128 205.39 ± 0.34 199.09 ± 0.21
llama 1B Q5_K - Medium 745.11 MiB 1.10 B Vulkan 99 tg128 203.12 ± 0.29 203.58 ± 0.47
llama 1B Q6_K 860.86 MiB 1.10 B Vulkan 99 tg128 196.22 ± 0.46 195.63 ± 0.46

Intel A770:

model size params backend ngl test t/s Master t/s PR
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 tg128 36.21 ± 0.05 36.39 ± 0.10
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 20.07 ± 0.06 20.29 ± 0.01
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 tg128 10.69 ± 0.01 10.69 ± 0.00
llama 1B F16 2.05 GiB 1.10 B Vulkan 99 tg128 85.02 ± 0.04 85.73 ± 0.31
llama 1B Q2_K - Medium 411.41 MiB 1.10 B Vulkan 99 tg128 119.54 ± 0.10 107.88 ± 0.03
llama 1B Q3_K - Medium 523.67 MiB 1.10 B Vulkan 99 tg128 96.97 ± 0.10 95.12 ± 0.12
llama 1B Q5_K - Medium 745.11 MiB 1.10 B Vulkan 99 tg128 86.62 ± 0.03 86.84 ± 0.10
llama 1B Q6_K 860.86 MiB 1.10 B Vulkan 99 tg128 62.76 ± 0.02 62.92 ± 0.07

@jeffbolznv
Copy link
Collaborator Author

Interesting, maybe I need to scale down the threshold for smaller models. I'll poke around at it and get back to you.

I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
@jeffbolznv jeffbolznv force-pushed the matmul_bytes_submit_heuristic branch from 8bd64be to 656c97f Compare March 18, 2025 02:36
@jeffbolznv
Copy link
Collaborator Author

I think I've found a good scale factor, @0cc4m please try again.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 18, 2025

model size params backend ngl fa test t/s Master t/s PR
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 0 tg128 99.94 ± 1.29 98.34 ± 0.20
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 1 tg128 85.37 ± 0.09 99.29 ± 0.07
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 93.56 ± 3.02 93.59 ± 0.33
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 83.33 ± 0.12 94.80 ± 0.02
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 75.05 ± 0.92 74.34 ± 0.11
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 67.65 ± 0.07 75.25 ± 0.04
llama 1B F16 2.05 GiB 1.10 B Vulkan 99 0 tg128 192.94 ± 1.40 185.22 ± 1.85
llama 1B F16 2.05 GiB 1.10 B Vulkan 99 1 tg128 158.86 ± 0.11 198.56 ± 0.28
llama 1B Q2_K - Medium 411.41 MiB 1.10 B Vulkan 99 0 tg128 234.37 ± 7.87 220.30 ± 6.70
llama 1B Q2_K - Medium 411.41 MiB 1.10 B Vulkan 99 1 tg128 187.78 ± 0.28 243.37 ± 0.41
llama 1B Q3_K - Medium 523.67 MiB 1.10 B Vulkan 99 0 tg128 243.43 ± 12.67 228.20 ± 10.67
llama 1B Q3_K - Medium 523.67 MiB 1.10 B Vulkan 99 1 tg128 192.51 ± 0.40 251.42 ± 0.64
llama 1B Q5_K - Medium 745.11 MiB 1.10 B Vulkan 99 0 tg128 288.11 ± 0.97 286.46 ± 0.89
llama 1B Q5_K - Medium 745.11 MiB 1.10 B Vulkan 99 1 tg128 201.18 ± 1.71 264.75 ± 5.85
llama 1B Q6_K 860.86 MiB 1.10 B Vulkan 99 0 tg128 280.03 ± 1.76 239.34 ± 22.91
llama 1B Q6_K 860.86 MiB 1.10 B Vulkan 99 1 tg128 196.63 ± 1.32 255.72 ± 0.82

It's a little better, yeah.

@jeffbolznv
Copy link
Collaborator Author

This is what I had measured for small models on 3090:

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------master------: |    8bd64be	  656c97f
| llama 1B Q2_K - Medium         | 459.11 MiB |     1.10 B | Vulkan     |  99 |  0 |         tg128 |        237.82 ± 4.73 |    217.84     232.07
| llama 1B Q2_K - Medium         | 459.11 MiB |     1.10 B | Vulkan     |  99 |  1 |         tg128 |       193.39 ± 13.20 |    239.67     250.41
| llama 1B Q3_K - Small          | 475.51 MiB |     1.10 B | Vulkan     |  99 |  0 |         tg128 |        241.91 ± 2.07 |    223.60     239.38
| llama 1B Q3_K - Small          | 475.51 MiB |     1.10 B | Vulkan     |  99 |  1 |         tg128 |        198.74 ± 0.76 |    239.96     243.39
| llama 1B Q4_0                  | 606.53 MiB |     1.10 B | Vulkan     |  99 |  0 |         tg128 |        284.21 ± 3.17 |    278.67     268.17
| llama 1B Q4_0                  | 606.53 MiB |     1.10 B | Vulkan     |  99 |  1 |         tg128 |        228.91 ± 1.40 |    254.55     298.63
| llama 1B Q6_K                  | 860.86 MiB |     1.10 B | Vulkan     |  99 |  0 |         tg128 |        257.43 ± 1.70 |    242.45     246.51
| llama 1B Q6_K                  | 860.86 MiB |     1.10 B | Vulkan     |  99 |  1 |         tg128 |        208.32 ± 2.09 |    255.69     264.43

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 18, 2025

The only thing I can think of is that my system is using an AMD EPYC 7302, which has rather low single-core performance. That could mean it needs more time to record and submit command buffers, exaggerating the difference.

I think it's fine to merge regardless, since large models are working well.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea when coopmat2 will be in a release driver?

@jeffbolznv
Copy link
Collaborator Author

It'll be in the 575 release. I can't comment specifically on when that'll be out, but in general it tends to be a few months between major releases.

@0cc4m 0cc4m merged commit c446b2e into ggml-org:master Mar 19, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants