Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.

before
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |  1 |         pp512 |     5131.92  192.48 |

after
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |  1 |         pp512 |     5400.29  205.68 |

Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
@jeffbolznv jeffbolznv requested a review from 0cc4m December 16, 2024 22:28
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 16, 2024
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't test this without a driver change, but the code looks fine.

@0cc4m 0cc4m merged commit a91a413 into ggml-org:master Dec 21, 2024
2 checks passed
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants