Skip to content

Conversation

@Djip007
Copy link
Contributor

@Djip007 Djip007 commented Apr 5, 2025

Wanted to do more but this patch is simple.

On none AVX512 CPU there is only 16 register. The compiler do not reorder the madd ops on sgemm, so did chose op order on build time.

pp have +20/50% CPU with only 16 AVX2 register, for bf16 and fp16 quantisation.

with curent 0.9.2 release:

+---------------------------------------------------------------------------------------------------+
|                     AMD Ryzen 9 5950X 16-Core Processor (znver3) - 125.7 GiB                      |
|                                 Meta Llama 3.1 8B Instruct - BF16                                 |
+---------------------------------------------------------------------------------------------------+
|          test | run number | avg time   | tokens processed | pp t/s     | tg t/s     | ttft       |
| ------------- | ---------- | ---------- | ---------------- | ---------- | ---------- | ---------- |
|   pp1024+tg16 | 1/1        | 22.56 s    | 1040 / 1040      | 57.58      | 3.35       | 18.08 s    |
|  pp4096+tg256 | 1/1        | 155.84 s   | 4352 / 4352      | 53.39      | 3.24       | 77.03 s    |
|  pp2048+tg256 | 1/1        | 113.79 s   | 2304 / 2304      | 56.33      | 3.31       | 36.66 s    |
|  pp2048+tg768 | 1/1        | 269.31 s   | 2816 / 2816      | 56.58      | 3.29       | 36.50 s    |
| pp1024+tg1024 | 1/1        | 326.04 s   | 2048 / 2048      | 57.29      | 3.32       | 18.17 s    |
| pp1280+tg3072 | 1/1        | 960.03 s   | 4352 / 4352      | 57.91      | 3.28       | 22.40 s    |
|  pp384+tg1152 | 1/1        | 351.37 s   | 1536 / 1536      | 60.50      | 3.34       | 6.64 s     |
|   pp64+tg1024 | 1/1        | 307.51 s   | 1088 / 1088      | 55.26      | 3.34       | 1.45 s     |
|   pp16+tg1536 | 1/1        | 459.47 s   | 1552 / 1552      | 31.54      | 3.35       | 803.25 ms  |
+---------------------------------------------------------------------------------------------------+

with this patch:

+---------------------------------------------------------------------------------------------------+
|                     AMD Ryzen 9 5950X 16-Core Processor (znver3) - 125.7 GiB                      |
|                                 Meta Llama 3.1 8B Instruct - BF16                                 |
+---------------------------------------------------------------------------------------------------+
|          test | run number | avg time   | tokens processed | pp t/s     | tg t/s     | ttft       |
| ------------- | ---------- | ---------- | ---------------- | ---------- | ---------- | ---------- |
|   pp1024+tg16 | 1/1        | 17.56 s    | 1040 / 1040      | 80.14      | 3.34       | 13.08 s    |
|  pp4096+tg256 | 1/1        | 135.69 s   | 4352 / 4352      | 72.53      | 3.23       | 56.78 s    |
|  pp2048+tg256 | 1/1        | 103.93 s   | 2304 / 2304      | 77.48      | 3.30       | 26.74 s    |
|  pp2048+tg768 | 1/1        | 260.07 s   | 2816 / 2816      | 77.00      | 3.29       | 26.90 s    |
| pp1024+tg1024 | 1/1        | 321.62 s   | 2048 / 2048      | 80.11      | 3.32       | 13.08 s    |
| pp1280+tg3072 | 1/1        | 957.08 s   | 4352 / 4352      | 80.13      | 3.26       | 16.28 s    |
|  pp384+tg1152 | 1/1        | 349.37 s   | 1536 / 1536      | 86.59      | 3.34       | 4.73 s     |
|   pp64+tg1024 | 1/1        | 306.06 s   | 1088 / 1088      | 72.06      | 3.36       | 1.18 s     |
|   pp16+tg1536 | 1/1        | 459.80 s   | 1552 / 1552      | 31.69      | 3.34       | 800.94 ms  |
+---------------------------------------------------------------------------------------------------+

and even best with FP16:
with curent 0.9.2 release:

+---------------------------------------------------------------------------------------------------+
|                     AMD Ryzen 9 5950X 16-Core Processor (znver3) - 125.7 GiB                      |
|                                 Meta Llama 3.1 8B Instruct - F16                                  |
+---------------------------------------------------------------------------------------------------+
|          test | run number | avg time   | tokens processed | pp t/s     | tg t/s     | ttft       |
| ------------- | ---------- | ---------- | ---------------- | ---------- | ---------- | ---------- |
|   pp1024+tg16 | 1/1        | 20.79 s    | 1040 / 1040      | 64.02      | 3.33       | 16.30 s    |
|  pp4096+tg256 | 1/1        | 149.69 s   | 4352 / 4352      | 58.62      | 3.21       | 70.19 s    |
|  pp2048+tg256 | 1/1        | 110.74 s   | 2304 / 2304      | 62.32      | 3.29       | 33.17 s    |
|  pp2048+tg768 | 1/1        | 267.39 s   | 2816 / 2816      | 62.35      | 3.27       | 33.15 s    |
| pp1024+tg1024 | 1/1        | 325.56 s   | 2048 / 2048      | 64.08      | 3.31       | 16.28 s    |
| pp1280+tg3072 | 1/1        | 964.08 s   | 4352 / 4352      | 63.84      | 3.25       | 20.35 s    |
|  pp384+tg1152 | 1/1        | 353.07 s   | 1536 / 1536      | 66.41      | 3.32       | 6.08 s     |
|   pp64+tg1024 | 1/1        | 306.73 s   | 1088 / 1088      | 58.19      | 3.35       | 1.40 s     |
|   pp16+tg1536 | 1/1        | 460.77 s   | 1552 / 1552      | 30.87      | 3.34       | 814.15 ms  |
+---------------------------------------------------------------------------------------------------+

with this patch:

+---------------------------------------------------------------------------------------------------+
|                     AMD Ryzen 9 5950X 16-Core Processor (znver3) - 125.7 GiB                      |
|                                 Meta Llama 3.1 8B Instruct - F16                                  |
+---------------------------------------------------------------------------------------------------+
|          test | run number | avg time   | tokens processed | pp t/s     | tg t/s     | ttft       |
| ------------- | ---------- | ---------- | ---------------- | ---------- | ---------- | ---------- |
|   pp1024+tg16 | 1/1        | 14.94 s    | 1040 / 1040      | 100.96     | 3.33       | 10.44 s    |
|  pp4096+tg256 | 1/1        | 125.59 s   | 4352 / 4352      | 89.61      | 3.20       | 46.02 s    |
|  pp2048+tg256 | 1/1        | 99.06 s    | 2304 / 2304      | 97.06      | 3.28       | 21.40 s    |
|  pp2048+tg768 | 1/1        | 255.93 s   | 2816 / 2816      | 96.78      | 3.27       | 21.47 s    |
| pp1024+tg1024 | 1/1        | 319.86 s   | 2048 / 2048      | 100.97     | 3.31       | 10.44 s    |
| pp1280+tg3072 | 1/1        | 957.87 s   | 4352 / 4352      | 99.93      | 3.25       | 13.11 s    |
|  pp384+tg1152 | 1/1        | 349.94 s   | 1536 / 1536      | 106.73     | 3.33       | 3.90 s     |
|   pp64+tg1024 | 1/1        | 306.60 s   | 1088 / 1088      | 80.97      | 3.35       | 1.09 s     |
|   pp16+tg1536 | 1/1        | 462.19 s   | 1552 / 1552      | 30.86      | 3.33       | 817.90 ms  |
+---------------------------------------------------------------------------------------------------+

Note: LocalScore is realy nice. 👍

pp have +20% CPU with only 16 AVX2 register.
@Djip007
Copy link
Contributor Author

Djip007 commented Apr 5, 2025

I have by error upload 1 result with this patch:
https://www.localscore.ai/result/302

May be we need to find a way to report en non official llamafile release ...

@cjpais
Copy link
Collaborator

cjpais commented Apr 7, 2025

@jart curious if you would mind taking a look at this and if you have any comments. I will test it and merge if it looks good to you and you don't have any major comments

@reneleonhardt
Copy link

Is this testable now after 0.9.3 has been released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants