reduce pressure on CPU register #737

Djip007 · 2025-04-05T22:00:52Z

Wanted to do more but this patch is simple.

On none AVX512 CPU there is only 16 register. The compiler do not reorder the madd ops on sgemm, so did chose op order on build time.

pp have +20/50% CPU with only 16 AVX2 register, for bf16 and fp16 quantisation.

with curent 0.9.2 release:

+---------------------------------------------------------------------------------------------------+
|                     AMD Ryzen 9 5950X 16-Core Processor (znver3) - 125.7 GiB                      |
|                                 Meta Llama 3.1 8B Instruct - BF16                                 |
+---------------------------------------------------------------------------------------------------+
|          test | run number | avg time   | tokens processed | pp t/s     | tg t/s     | ttft       |
| ------------- | ---------- | ---------- | ---------------- | ---------- | ---------- | ---------- |
|   pp1024+tg16 | 1/1        | 22.56 s    | 1040 / 1040      | 57.58      | 3.35       | 18.08 s    |
|  pp4096+tg256 | 1/1        | 155.84 s   | 4352 / 4352      | 53.39      | 3.24       | 77.03 s    |
|  pp2048+tg256 | 1/1        | 113.79 s   | 2304 / 2304      | 56.33      | 3.31       | 36.66 s    |
|  pp2048+tg768 | 1/1        | 269.31 s   | 2816 / 2816      | 56.58      | 3.29       | 36.50 s    |
| pp1024+tg1024 | 1/1        | 326.04 s   | 2048 / 2048      | 57.29      | 3.32       | 18.17 s    |
| pp1280+tg3072 | 1/1        | 960.03 s   | 4352 / 4352      | 57.91      | 3.28       | 22.40 s    |
|  pp384+tg1152 | 1/1        | 351.37 s   | 1536 / 1536      | 60.50      | 3.34       | 6.64 s     |
|   pp64+tg1024 | 1/1        | 307.51 s   | 1088 / 1088      | 55.26      | 3.34       | 1.45 s     |
|   pp16+tg1536 | 1/1        | 459.47 s   | 1552 / 1552      | 31.54      | 3.35       | 803.25 ms  |
+---------------------------------------------------------------------------------------------------+

with this patch:

+---------------------------------------------------------------------------------------------------+
|                     AMD Ryzen 9 5950X 16-Core Processor (znver3) - 125.7 GiB                      |
|                                 Meta Llama 3.1 8B Instruct - BF16                                 |
+---------------------------------------------------------------------------------------------------+
|          test | run number | avg time   | tokens processed | pp t/s     | tg t/s     | ttft       |
| ------------- | ---------- | ---------- | ---------------- | ---------- | ---------- | ---------- |
|   pp1024+tg16 | 1/1        | 17.56 s    | 1040 / 1040      | 80.14      | 3.34       | 13.08 s    |
|  pp4096+tg256 | 1/1        | 135.69 s   | 4352 / 4352      | 72.53      | 3.23       | 56.78 s    |
|  pp2048+tg256 | 1/1        | 103.93 s   | 2304 / 2304      | 77.48      | 3.30       | 26.74 s    |
|  pp2048+tg768 | 1/1        | 260.07 s   | 2816 / 2816      | 77.00      | 3.29       | 26.90 s    |
| pp1024+tg1024 | 1/1        | 321.62 s   | 2048 / 2048      | 80.11      | 3.32       | 13.08 s    |
| pp1280+tg3072 | 1/1        | 957.08 s   | 4352 / 4352      | 80.13      | 3.26       | 16.28 s    |
|  pp384+tg1152 | 1/1        | 349.37 s   | 1536 / 1536      | 86.59      | 3.34       | 4.73 s     |
|   pp64+tg1024 | 1/1        | 306.06 s   | 1088 / 1088      | 72.06      | 3.36       | 1.18 s     |
|   pp16+tg1536 | 1/1        | 459.80 s   | 1552 / 1552      | 31.69      | 3.34       | 800.94 ms  |
+---------------------------------------------------------------------------------------------------+

and even best with FP16:
with curent 0.9.2 release:

+---------------------------------------------------------------------------------------------------+
|                     AMD Ryzen 9 5950X 16-Core Processor (znver3) - 125.7 GiB                      |
|                                 Meta Llama 3.1 8B Instruct - F16                                  |
+---------------------------------------------------------------------------------------------------+
|          test | run number | avg time   | tokens processed | pp t/s     | tg t/s     | ttft       |
| ------------- | ---------- | ---------- | ---------------- | ---------- | ---------- | ---------- |
|   pp1024+tg16 | 1/1        | 20.79 s    | 1040 / 1040      | 64.02      | 3.33       | 16.30 s    |
|  pp4096+tg256 | 1/1        | 149.69 s   | 4352 / 4352      | 58.62      | 3.21       | 70.19 s    |
|  pp2048+tg256 | 1/1        | 110.74 s   | 2304 / 2304      | 62.32      | 3.29       | 33.17 s    |
|  pp2048+tg768 | 1/1        | 267.39 s   | 2816 / 2816      | 62.35      | 3.27       | 33.15 s    |
| pp1024+tg1024 | 1/1        | 325.56 s   | 2048 / 2048      | 64.08      | 3.31       | 16.28 s    |
| pp1280+tg3072 | 1/1        | 964.08 s   | 4352 / 4352      | 63.84      | 3.25       | 20.35 s    |
|  pp384+tg1152 | 1/1        | 353.07 s   | 1536 / 1536      | 66.41      | 3.32       | 6.08 s     |
|   pp64+tg1024 | 1/1        | 306.73 s   | 1088 / 1088      | 58.19      | 3.35       | 1.40 s     |
|   pp16+tg1536 | 1/1        | 460.77 s   | 1552 / 1552      | 30.87      | 3.34       | 814.15 ms  |
+---------------------------------------------------------------------------------------------------+

with this patch:

+---------------------------------------------------------------------------------------------------+
|                     AMD Ryzen 9 5950X 16-Core Processor (znver3) - 125.7 GiB                      |
|                                 Meta Llama 3.1 8B Instruct - F16                                  |
+---------------------------------------------------------------------------------------------------+
|          test | run number | avg time   | tokens processed | pp t/s     | tg t/s     | ttft       |
| ------------- | ---------- | ---------- | ---------------- | ---------- | ---------- | ---------- |
|   pp1024+tg16 | 1/1        | 14.94 s    | 1040 / 1040      | 100.96     | 3.33       | 10.44 s    |
|  pp4096+tg256 | 1/1        | 125.59 s   | 4352 / 4352      | 89.61      | 3.20       | 46.02 s    |
|  pp2048+tg256 | 1/1        | 99.06 s    | 2304 / 2304      | 97.06      | 3.28       | 21.40 s    |
|  pp2048+tg768 | 1/1        | 255.93 s   | 2816 / 2816      | 96.78      | 3.27       | 21.47 s    |
| pp1024+tg1024 | 1/1        | 319.86 s   | 2048 / 2048      | 100.97     | 3.31       | 10.44 s    |
| pp1280+tg3072 | 1/1        | 957.87 s   | 4352 / 4352      | 99.93      | 3.25       | 13.11 s    |
|  pp384+tg1152 | 1/1        | 349.94 s   | 1536 / 1536      | 106.73     | 3.33       | 3.90 s     |
|   pp64+tg1024 | 1/1        | 306.60 s   | 1088 / 1088      | 80.97      | 3.35       | 1.09 s     |
|   pp16+tg1536 | 1/1        | 462.19 s   | 1552 / 1552      | 30.86      | 3.33       | 817.90 ms  |
+---------------------------------------------------------------------------------------------------+

Note: LocalScore is realy nice. 👍

pp have +20% CPU with only 16 AVX2 register.

Djip007 · 2025-04-05T22:11:50Z

I have by error upload 1 result with this patch:
https://www.localscore.ai/result/302

May be we need to find a way to report en non official llamafile release ...

cjpais · 2025-04-07T17:04:50Z

@jart curious if you would mind taking a look at this and if you have any comments. I will test it and merge if it looks good to you and you don't have any major comments

reneleonhardt · 2025-05-15T04:44:30Z

Is this testable now after 0.9.3 has been released?

reduce pressure on CPU register

77476da

pp have +20% CPU with only 16 AVX2 register.

github-actions bot added the llamafile label Apr 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reduce pressure on CPU register #737

reduce pressure on CPU register #737

Uh oh!

Djip007 commented Apr 5, 2025

Uh oh!

Djip007 commented Apr 5, 2025

Uh oh!

cjpais commented Apr 7, 2025

Uh oh!

reneleonhardt commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

reduce pressure on CPU register #737

Are you sure you want to change the base?

reduce pressure on CPU register #737

Uh oh!

Conversation

Djip007 commented Apr 5, 2025

Uh oh!

Djip007 commented Apr 5, 2025

Uh oh!

cjpais commented Apr 7, 2025

Uh oh!

reneleonhardt commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants