Skip to content

Conversation

max-krasnyansky
Copy link
Collaborator

@max-krasnyansky max-krasnyansky commented Oct 13, 2025

This PR introduces a new experimental backend ggml-hexagon with support for the Hexagon NPU.

Highlights:

  • Supports Hexagon versions: v73, v75, v79, and v81
  • Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
  • Supports Q4_0, Q8_0, MXFP4, and FP32 data types
  • Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX
  • Minimal build dependencies (just needs Android NDK and Hexagon-SDK Community Edition)

Note: This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Please see (docs/backend/hexagon/README.md) for build and basic usage info.
Also (docs/backend/hexagon/developer.md) for some notes on things like buffer management, etc.

Tested with the following models:

  • Llama-3.2-1B-Instruct-Q4_0.gguf
  • Llama-3.2-3B-Instruct-Q4_0.gguf
  • Llama-3.1-8B-Instruct-Q4_0.gguf
  • Qwen3-4B-Q4_0.gguf
  • Qwen3-8B-128K-Q4_0.gguf
  • Qwen3-14B-128K-Q4_0.gguf
  • LFM2-1.2B-Q4_0.gguf
  • OLMoE-1B-7B-0125-Instruct-Q4_0.gguf
  • gpt-oss-20b-mxfp4.gguf & gpt-oss-20b.mxfp4-q4_0.gguf (requires latest devices with 16+ GB of DDR)

Known-issues:

  • Test-backend-ops failures for supported Ops. There are a few corner-cases that we don't handle yet which doesn't affect the models listed above. We will follow up with fixes for that.
  • Tensor-override option needs some updates to work with the HTP-REPACK buffers (see some notes/questions below)
  • Integration (buffer sharing, etc) with the OpenCL/Adreno backend needs work

Future-work:

  • More optimizations (kernels, fusion, etc), more Ops (SHORTCONV, etc), more DataTypes
  • Better integration with the OpenCL/Adreno backend
  • Support for Windows on Snapdragon devices

@slaren
It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types. Currently only the CPU backend needed those extra bufs but now the Hexagon needs them too (ie all buffers are actually normal host buffers from CPU perspective and extras are needed only to force the REPACK). I added basic support for that in the model loader (separate commit here in the PR) and was going to sprinkle some more but we should discuss if there is perhaps a better way to handle this.

I included some wrapper scripts (under docs/backends/hexagon) to make running stuff over ADB easier. Let me know if we should put them in some other directory. They are a bit Snapdragon specific because of the env vars needed to find Hexagon/HTP libraries (described in the developer.md).


@ggerganov
I included a commit that groups the Attention and FFN MATMUL Ops.
Like this:

node #841 ( MUL_MAT):  Qcur-27 (   1M) [ HTP0 ] use=1: blk.27.attn_q.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #842 ( MUL_MAT):  Kcur-27 ( 512K) [ HTP0 ] use=1: blk.27.attn_k.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #843 ( MUL_MAT):  Vcur-27 ( 512K) [ HTP0 ] use=1: blk.27.attn_v.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #845 (    ROPE):  Qcur-27 (   1M) [ HTP0 ] use=1:   Qcur-27 (reshaped) [ HTP0 ] HTP0#leaf_6#0 [ NULL ] 

This allows us to easily reuse dynamically quantized attn_norm-27. Hexagon/HTP backend places quantized tensors in VTCM (basically a SW managed cache) where subsequent Ops can reuse it.
This change doesn't seem to have any affect on other backends (I did some basic checks with CPU, OpenCL, CUDA, and Metal). Please let me know what you think. Perhaps, you have suggestions for how to do this better.


Marking as a Draft for now because I'm working on enabling the CI.
Otherwise all required bits and pieces are ready to go.


Some outputs of the Hexagon Backend in action on my Galaxy S25+
~/src/llama.cpp-hexagon$ M=../gguf/Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 docs/backend/hexagon/run-cli.sh -no-cnv -p \"what is the most popular cookie in the world?\"
+ adb shell ...
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
...
ggml-hex: allocating new registry : ndev 1
ggml-hex: HTP arch version v79
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007974ffcd90
build: 6733 (6a8cf8914) with Android (13324770, +pgo, +bolt, +lto, +mlgo, based on r530567d) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
...
load_tensors: offloaded 17/17 layers to GPU
load_tensors:          CPU model buffer size =   225.49 MiB
load_tensors:         HTP0 model buffer size =     0.26 MiB
load_tensors:  HTP0-REPACK model buffer size =   504.00 MiB
...
llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache:       HTP0 KV buffer size =   136.00 MiB
llama_kv_cache: size =  136.00 MiB (  8192 cells,  16 layers,  1/1 seqs), K (q8_0):   68.00 MiB, V (q8_0):   68.00 MiB
llama_context:       HTP0 compute buffer size =    15.00 MiB
llama_context:        CPU compute buffer size =    62.62 MiB
llama_context: graph nodes  = 503
llama_context: graph splits = 41
...
 Chocolate chip cookies are the most popular cookie in the world. According to the International Association of Culinary Professionals, chocolate chip cookies are the most popular cookie in the world.
In the United States, chocolate chip cookies are a beloved favorite, and many bakeries and restaurants offer them as a classic treat. They are often associated with family gatherings, road trips, and comfort food.
In fact, according to a survey conducted by YouGov in 2020, chocolate chip cookies are the most popular cookie in the United States, with 27% of respondents naming them as their favorite. This makes chocolate chip cookies the top choice among Americans, and a close second in other countries. [end of text]


llama_perf_sampler_print:    sampling time =       7.12 ms /   147 runs   (    0.05 ms per token, 20634.48 tokens per second)
llama_perf_context_print:        load time =     623.27 ms
llama_perf_context_print: prompt eval time =      81.60 ms /    11 tokens (    7.42 ms per token,   134.81 tokens per second)
llama_perf_context_print:        eval time =    2428.18 ms /   135 runs   (   17.99 ms per token,    55.60 tokens per second)
llama_perf_context_print:       total time =    2630.13 ms /   146 tokens
llama_perf_context_print:    graphs reused =        134
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - Host               |                  439 =   225 +     136 +      77                |
llama_memory_breakdown_print: |   - HTP0-REPACK        |                  504 =   504 +       0 +       0                |



~/src/llama.cpp-hexagon$ M=LFM2-1.2B-Q4_0.gguf D=HTP0 docs/backend/hexagon/run-cli.sh -no-cnv -p \"what is the most popular cookie in the world?\"
+ adb shell ...
...
load_tensors: offloaded 17/17 layers to GPU
load_tensors:          CPU model buffer size =   105.24 MiB
load_tensors:         HTP0 model buffer size =     0.25 MiB
load_tensors:  HTP0-REPACK model buffer size =   555.75 MiB
...
llama_context: n_ctx_per_seq (8192) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.25 MiB
llama_kv_cache:       HTP0 KV buffer size =    51.00 MiB
llama_kv_cache: size =   51.00 MiB (  8192 cells,   6 layers,  1/1 seqs), K (q8_0):   25.50 MiB, V (q8_0):   25.50 MiB
llama_memory_recurrent:       HTP0 RS buffer size =     0.16 MiB
llama_memory_recurrent: size =    0.16 MiB (     1 cells,  16 layers,  1 seqs), R (f32):    0.16 MiB, S (f32):    0.00 MiB
llama_context:       HTP0 compute buffer size =    14.00 MiB
llama_context:        CPU compute buffer size =    32.00 MiB
llama_context: graph nodes  = 549
llama_context: graph splits = 49
...
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

...
**(a) Chocolate chip** is the answer, but keep in mind that preferences can vary widely by region and individual taste.

For the most accurate and current data, you might want to look into recent market analyses or consumer surveys. However, chocolate chip remains a strong contender due to its universal appeal and cultural significance. [end of text]


llama_perf_sampler_print:    sampling time =       9.49 ms /   301 runs   (    0.03 ms per token, 31720.94 tokens per second)
llama_perf_context_print:        load time =     694.86 ms
llama_perf_context_print: prompt eval time =      76.28 ms /    11 tokens (    6.93 ms per token,   144.20 tokens per second)
llama_perf_context_print:        eval time =    4698.82 ms /   289 runs   (   16.26 ms per token,    61.50 tokens per second)
llama_perf_context_print:       total time =    4793.75 ms /   300 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - Host               |                  202 =   105 +      51 +      46                |
llama_memory_breakdown_print: |   - HTP0-REPACK        |                  555 =   555 +       0 +       0                |

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Oct 13, 2025
@max-krasnyansky max-krasnyansky requested a review from lhez October 13, 2025 01:25
@jeffbolznv
Copy link
Collaborator

I included a commit that groups the Attention and FFN MATMUL Ops.
...
Please let me know what you think. Perhaps, you have suggestions for how to do this better.

You can do backend-specific reorderings by implementing the graph_optimize function.

@max-krasnyansky
Copy link
Collaborator Author

I included a commit that groups the Attention and FFN MATMUL Ops.
...
Please let me know what you think. Perhaps, you have suggestions for how to do this better.

You can do backend-specific reorderings by implementing the graph_optimize function.

Yeah, I saw you guys added that function recently. Will check it out.

@ggerganov
Copy link
Member

Are the changes in llama-graph and llama-model (excluding the extra buffer type changes) only needed for the VTCM utilization? If yes, then it's better to avoid these changes and implement the necessary logic in graph_optimize as suggested.

@max-krasnyansky
Copy link
Collaborator Author

Are the changes in llama-graph and llama-model (excluding the extra buffer type changes) only needed for the VTCM utilization? If yes, then it's better to avoid these changes and implement the necessary logic in graph_optimize as suggested.

Yep. Only for that. I'm going to play with the graph_optimize shortly, it's probably going to be a bit more involved/expensive. i.e the optimizer will need to re-scan the nodes and figure out the dependencies where as graph builder just needs to call build_forward after the MUL_MATs, but let me try.

@ggerganov
Copy link
Member

You can try to reuse the ggml_graph_optimize from the Metal backend:

// reorder the nodes in the graph to improve concurrency, while respecting fusion
//
// note: this implementation is generic and not specific to metal
// if it proves to work well, we can start using it for other backends in the future
void ggml_graph_optimize(struct ggml_cgraph * gf);

It reorders the nodes to improve concurrency, which effectively results in stacking the matrix multiplications together.

@slaren
Copy link
Member

slaren commented Oct 14, 2025

It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types.

It would be preferable to avoid using extra buffer types unless absolutely necessary, because using them adds a significant amount of complexity to the user code. Repacking the weights does not need to be restricted to extra buffer types. If the backend only can work with repacked tensors, the logic can be baked into the backend's default buffer type.

There was some discussion about the limits of the qualcomm NPU memory bandwidth, can you confirm that's the case? If the weights were stored in a buffer type that's compatible with the CPU backend, it would be possible to use the CPU backend for generation, and the hexagon backend for batch processing. Repacking will make that difficult, unless it uses the same repacking format than the CPU backend. Would it be possible to use the same repacking as the CPU backend?

max-krasnyansky and others added 3 commits October 14, 2025 11:01
This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.

Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX

**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Co-Authored-By: Rajdeep Ganguly <[email protected]>
Co-Authored-By: Todor Boinovski <[email protected]>
@max-krasnyansky
Copy link
Collaborator Author

You can try to reuse the ggml_graph_optimize from the Metal backend:

// reorder the nodes in the graph to improve concurrency, while respecting fusion
//
// note: this implementation is generic and not specific to metal
// if it proves to work well, we can start using it for other backends in the future
void ggml_graph_optimize(struct ggml_cgraph * gf);

It reorders the nodes to improve concurrency, which effectively results in stacking the matrix multiplications together.

@ggerganov quick update. I can definitely use graph_optimize to do the stacking of the matmuls. I was going to add some basic fusions anyway and that will require graph_optimize. So I went ahead and removed that commit for llama-model and llama-graph from this PR, everything is perfectly functional without it.
I'm working on the optimizer and will either include a simple version in this PR or in a follow up if it takes longer to implement/test.

@max-krasnyansky
Copy link
Collaborator Author

max-krasnyansky commented Oct 14, 2025

@slaren

It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types.

It would be preferable to avoid using extra buffer types unless absolutely necessary, because using them adds a significant amount of complexity to the user code. Repacking the weights does not need to be restricted to extra buffer types. If the backend only can work with repacked tensors, the logic can be baked into the backend's default buffer type.

That's what I started with (ie no-repack buffers) but that makes it tricky to do partial offloads and introduces lots of copies because the scheduler thinks the buffers are not shareable/reusable.
All FP32 and FP16 (once we add them) Ops can/do share the buffers right now. The repack is needed only for the quantized tensors (ie very similar to the CPU backend in that sense).

There was some discussion about the limits of the qualcomm NPU memory bandwidth, can you confirm that's the case?

In the previous SOC generations (Gen3, Gen4, X-Elite) the CPU does technically have more memory BW (a lot more in the X-Elite). In Gen5 and X2-Elite the BW is about the same.
For Gen3/4 it might still makes sense to offload to the NPU for power saving and/or simply to free up the CPU for other tasks.
Also the NPU has a really nice DMA engine (one per HW thread) that can bring in the chunks, optionally bypassing the L2, and do some transformations (alignment, etc; we're using that in MUL_MAT ops). So it's more flexible/efficient than the CPU even if the raw BW is lower.

If the weights were stored in a buffer type that's compatible with the CPU backend, it would be possible to use the CPU backend for generation, and the hexagon backend for batch processing. Repacking will make that difficult, unless it uses the same repacking format than the CPU backend. Would it be possible to use the same repacking as the CPU backend?

Functionally, it's possible to use Q4_0, MXFP4, etc as is with HVX, unfortunately, it's about the worst layout for it :)
i.e Nothing is aligned, etc. It's very expensive to load and unpack.
The repack I implemented right now is very HVX friendly. Basically, it stores all the row-quants first followed by all the row-block-scales. The quants are repacked into 32x4x2 blocks (256 elements -> 2x HVX vectors). The DMA is used to properly align each row to 128 as we bring the chunks into the VTCM and then we just do nicely aligned loads of 128 bytes and expand into 256 INT8 elements. This might have to change as we add more optimizations.

It might be possible to share the same format if we were to redo ARM64 CPU_REPACK to use something like the above.
Basically, we need the scales to be separate and not mixed in because they mess up the alignment.
I was planing on spending some time to play around with the ARM64 CPU_REPACK to see if we can make something common, but right now even the CPU backend itself is using different repack layouts depending on the available instructions. So it's difficult even there.

It'd be absolutely awesome to have a common optimized format for CPU/GPU/NPU, ideally on-disk so that we mmap and don't repack at all like the original GGML, but it's very tricky in practice.

@github-actions github-actions bot added the devops improvements to build systems and github actions label Oct 14, 2025
@max-krasnyansky
Copy link
Collaborator Author

@slaren @bandoti @CISC
Basic CI is in. I added android-ndk-build that builds vanilla ARM64 CPU and Snapdragon (CPU/GPU/NPU) flavors.
Those builds can be just pushed via ADB to the devices and run.
I'm going to hook it up to the Qualcomm Device Cloud so that we can run jobs on the actual Snapdragon-based devices, but that will need some more work. I'm going to prototype in the separate repo first.

btw That build.yml is getting long. I kind of like how you guys did linux-cross builds in the separate yaml, it's collapsible in the UI, etc.
Should we do the same for android, ubuntu, windows, etc?

@CISC
Copy link
Collaborator

CISC commented Oct 15, 2025

btw That build.yml is getting long. I kind of like how you guys did linux-cross builds in the separate yaml, it's collapsible in the UI, etc. Should we do the same for android, ubuntu, windows, etc?

It is at least ripe for some refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants