-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Add experimental ggml-hexagon backend for the Hexagon NPU #16547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
You can do backend-specific reorderings by implementing the graph_optimize function. |
Yeah, I saw you guys added that function recently. Will check it out. |
Are the changes in |
Yep. Only for that. I'm going to play with the |
You can try to reuse the llama.cpp/ggml/src/ggml-metal/ggml-metal-common.h Lines 43 to 49 in 7049736
It reorders the nodes to improve concurrency, which effectively results in stacking the matrix multiplications together. |
4018db2
to
e8e5407
Compare
It would be preferable to avoid using extra buffer types unless absolutely necessary, because using them adds a significant amount of complexity to the user code. Repacking the weights does not need to be restricted to extra buffer types. If the backend only can work with repacked tensors, the logic can be baked into the backend's default buffer type. There was some discussion about the limits of the qualcomm NPU memory bandwidth, can you confirm that's the case? If the weights were stored in a buffer type that's compatible with the CPU backend, it would be possible to use the CPU backend for generation, and the hexagon backend for batch processing. Repacking will make that difficult, unless it uses the same repacking format than the CPU backend. Would it be possible to use the same repacking as the CPU backend? |
This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX **Note:** This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <[email protected]> Co-Authored-By: Todor Boinovski <[email protected]>
e8e5407
to
1d600de
Compare
@ggerganov quick update. I can definitely use |
That's what I started with (ie no-repack buffers) but that makes it tricky to do partial offloads and introduces lots of copies because the scheduler thinks the buffers are not shareable/reusable.
In the previous SOC generations (Gen3, Gen4, X-Elite) the CPU does technically have more memory BW (a lot more in the X-Elite). In Gen5 and X2-Elite the BW is about the same.
Functionally, it's possible to use Q4_0, MXFP4, etc as is with HVX, unfortunately, it's about the worst layout for it :) It might be possible to share the same format if we were to redo ARM64 CPU_REPACK to use something like the above. It'd be absolutely awesome to have a common optimized format for CPU/GPU/NPU, ideally on-disk so that we mmap and don't repack at all like the original GGML, but it's very tricky in practice. |
ad1d9b8
to
027399b
Compare
@slaren @bandoti @CISC btw That |
It is at least ripe for some refactoring. |
This PR introduces a new experimental backend
ggml-hexagon
with support for the Hexagon NPU.Highlights:
Note: This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.
Please see (docs/backend/hexagon/README.md) for build and basic usage info.
Also (docs/backend/hexagon/developer.md) for some notes on things like buffer management, etc.
Tested with the following models:
Known-issues:
HTP-REPACK
buffers (see some notes/questions below)Future-work:
@slaren
It'd be good to make
extra_buffers_type
a bit better exposed/integrated. I started thinking thatdevice_get_buffer_type
should probably just return a list of buffer types. Currently only the CPU backend needed those extra bufs but now the Hexagon needs them too (ie all buffers are actually normal host buffers from CPU perspective and extras are needed only to force the REPACK). I added basic support for that in the model loader (separate commit here in the PR) and was going to sprinkle some more but we should discuss if there is perhaps a better way to handle this.I included some wrapper scripts (under docs/backends/hexagon) to make running stuff over ADB easier. Let me know if we should put them in some other directory. They are a bit Snapdragon specific because of the env vars needed to find Hexagon/HTP libraries (described in the developer.md).
@ggerganov
I included a commit that groups the Attention and FFN MATMUL Ops.
Like this:
This allows us to easily reuse dynamically quantized
attn_norm-27
. Hexagon/HTP backend places quantized tensors in VTCM (basically a SW managed cache) where subsequent Ops can reuse it.This change doesn't seem to have any affect on other backends (I did some basic checks with CPU, OpenCL, CUDA, and Metal). Please let me know what you think. Perhaps, you have suggestions for how to do this better.
Marking as a Draft for now because I'm working on enabling the CI.
Otherwise all required bits and pieces are ready to go.
Some outputs of the Hexagon Backend in action on my Galaxy S25+