Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

Do some of the logic ops in packed u32.

Perf results on RTX 4070. Note that this "phi3 3B Q4_K" model uses Q5_K maybe a third of the time.

before
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51120 runs -    98.65 us/run - 117.44 MFLOP/run -   1.19 TFLOPS
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        108.54  1.25 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |        112.41  2.25 |

after
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  60492 runs -    82.96 us/run - 117.44 MFLOP/run -   1.42 TFLOPS
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        109.39  0.47 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |        117.24  1.19 |

@jeffbolznv jeffbolznv requested a review from 0cc4m November 25, 2024 04:04
@daniandtheweb
Copy link
Contributor

These changes make quite a big difference on my Radeon 5700XT.

model size params backend ngl threads test branch t/s
qwen2 7B Q5_K - Small 4.94 GiB 7.62 B Vulkan 99 4 tg128 master 41.07 ± 0.06
qwen2 7B Q5_K - Small 4.94 GiB 7.62 B Vulkan 99 4 tg128 PR 49.23 ± 0.42

@netrunnereve
Copy link
Collaborator

I haven't tried it with an actual model but our tests show that it's now 6% faster on a RX 570.

Master:
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -  1195.05 us/run - 117.44 MFLOP/run -  98.27 GFLOPS
PR:
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -  1124.40 us/run - 117.44 MFLOP/run - 104.45 GFLOPS

@0cc4m 0cc4m merged commit 249a790 into ggml-org:master Nov 27, 2024
7 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants