Add experimental ggml-hexagon backend for the Hexagon NPU #16547

max-krasnyansky · 2025-10-13T01:25:16Z

This PR introduces a new experimental backend ggml-hexagon with support for the Hexagon NPU.

Highlights:

Supports Hexagon versions: v73, v75, v79, and v81
Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
Supports Q4_0, Q8_0, MXFP4, and FP32 data types
Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX
Minimal build dependencies (just needs Android NDK and Hexagon-SDK Community Edition)

Note: This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Please see (docs/backend/hexagon/README.md) for build and basic usage info.
Also (docs/backend/hexagon/developer.md) for some notes on things like buffer management, etc.

Tested with the following models:

Llama-3.2-1B-Instruct-Q4_0.gguf
Llama-3.2-3B-Instruct-Q4_0.gguf
Llama-3.1-8B-Instruct-Q4_0.gguf
Qwen3-4B-Q4_0.gguf
Qwen3-8B-128K-Q4_0.gguf
Qwen3-14B-128K-Q4_0.gguf
LFM2-1.2B-Q4_0.gguf
OLMoE-1B-7B-0125-Instruct-Q4_0.gguf
gpt-oss-20b-mxfp4.gguf & gpt-oss-20b.mxfp4-q4_0.gguf (requires latest devices with 16+ GB of DDR)

Known-issues:

Test-backend-ops failures for supported Ops. There are a few corner-cases that we don't handle yet which doesn't affect the models listed above. We will follow up with fixes for that.
Tensor-override option needs some updates to work with the HTP-REPACK buffers (see some notes/questions below)
Integration (buffer sharing, etc) with the OpenCL/Adreno backend needs work

Future-work:

More optimizations (kernels, fusion, etc), more Ops (SHORTCONV, etc), more DataTypes
Better integration with the OpenCL/Adreno backend
Support for Windows on Snapdragon devices

@slaren
It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types. Currently only the CPU backend needed those extra bufs but now the Hexagon needs them too (ie all buffers are actually normal host buffers from CPU perspective and extras are needed only to force the REPACK). I added basic support for that in the model loader (separate commit here in the PR) and was going to sprinkle some more but we should discuss if there is perhaps a better way to handle this.

I included some wrapper scripts (under docs/backends/hexagon) to make running stuff over ADB easier. Let me know if we should put them in some other directory. They are a bit Snapdragon specific because of the env vars needed to find Hexagon/HTP libraries (described in the developer.md).

@ggerganov
I included a commit that groups the Attention and FFN MATMUL Ops.
Like this:

node #841 ( MUL_MAT):  Qcur-27 (   1M) [ HTP0 ] use=1: blk.27.attn_q.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #842 ( MUL_MAT):  Kcur-27 ( 512K) [ HTP0 ] use=1: blk.27.attn_k.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #843 ( MUL_MAT):  Vcur-27 ( 512K) [ HTP0 ] use=1: blk.27.attn_v.weight [ HTP0 ]  attn_norm-27 [ HTP0 ] 
node #845 (    ROPE):  Qcur-27 (   1M) [ HTP0 ] use=1:   Qcur-27 (reshaped) [ HTP0 ] HTP0#leaf_6#0 [ NULL ]

This allows us to easily reuse dynamically quantized attn_norm-27. Hexagon/HTP backend places quantized tensors in VTCM (basically a SW managed cache) where subsequent Ops can reuse it.
This change doesn't seem to have any affect on other backends (I did some basic checks with CPU, OpenCL, CUDA, and Metal). Please let me know what you think. Perhaps, you have suggestions for how to do this better.

Marking as a Draft for now because I'm working on enabling the CI.
Otherwise all required bits and pieces are ready to go.

Some outputs of the Hexagon Backend in action on my Galaxy S25+

~/src/llama.cpp-hexagon$ M=../gguf/Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 docs/backend/hexagon/run-cli.sh -no-cnv -p \"what is the most popular cookie in the world?\"
+ adb shell ...
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
...
ggml-hex: allocating new registry : ndev 1
ggml-hex: HTP arch version v79
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007974ffcd90
build: 6733 (6a8cf8914) with Android (13324770, +pgo, +bolt, +lto, +mlgo, based on r530567d) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
...
load_tensors: offloaded 17/17 layers to GPU
load_tensors:          CPU model buffer size =   225.49 MiB
load_tensors:         HTP0 model buffer size =     0.26 MiB
load_tensors:  HTP0-REPACK model buffer size =   504.00 MiB
...
llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache:       HTP0 KV buffer size =   136.00 MiB
llama_kv_cache: size =  136.00 MiB (  8192 cells,  16 layers,  1/1 seqs), K (q8_0):   68.00 MiB, V (q8_0):   68.00 MiB
llama_context:       HTP0 compute buffer size =    15.00 MiB
llama_context:        CPU compute buffer size =    62.62 MiB
llama_context: graph nodes  = 503
llama_context: graph splits = 41
...
 Chocolate chip cookies are the most popular cookie in the world. According to the International Association of Culinary Professionals, chocolate chip cookies are the most popular cookie in the world.
In the United States, chocolate chip cookies are a beloved favorite, and many bakeries and restaurants offer them as a classic treat. They are often associated with family gatherings, road trips, and comfort food.
In fact, according to a survey conducted by YouGov in 2020, chocolate chip cookies are the most popular cookie in the United States, with 27% of respondents naming them as their favorite. This makes chocolate chip cookies the top choice among Americans, and a close second in other countries. [end of text]


llama_perf_sampler_print:    sampling time =       7.12 ms /   147 runs   (    0.05 ms per token, 20634.48 tokens per second)
llama_perf_context_print:        load time =     623.27 ms
llama_perf_context_print: prompt eval time =      81.60 ms /    11 tokens (    7.42 ms per token,   134.81 tokens per second)
llama_perf_context_print:        eval time =    2428.18 ms /   135 runs   (   17.99 ms per token,    55.60 tokens per second)
llama_perf_context_print:       total time =    2630.13 ms /   146 tokens
llama_perf_context_print:    graphs reused =        134
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - Host               |                  439 =   225 +     136 +      77                |
llama_memory_breakdown_print: |   - HTP0-REPACK        |                  504 =   504 +       0 +       0                |



~/src/llama.cpp-hexagon$ M=LFM2-1.2B-Q4_0.gguf D=HTP0 docs/backend/hexagon/run-cli.sh -no-cnv -p \"what is the most popular cookie in the world?\"
+ adb shell ...
...
load_tensors: offloaded 17/17 layers to GPU
load_tensors:          CPU model buffer size =   105.24 MiB
load_tensors:         HTP0 model buffer size =     0.25 MiB
load_tensors:  HTP0-REPACK model buffer size =   555.75 MiB
...
llama_context: n_ctx_per_seq (8192) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.25 MiB
llama_kv_cache:       HTP0 KV buffer size =    51.00 MiB
llama_kv_cache: size =   51.00 MiB (  8192 cells,   6 layers,  1/1 seqs), K (q8_0):   25.50 MiB, V (q8_0):   25.50 MiB
llama_memory_recurrent:       HTP0 RS buffer size =     0.16 MiB
llama_memory_recurrent: size =    0.16 MiB (     1 cells,  16 layers,  1 seqs), R (f32):    0.16 MiB, S (f32):    0.00 MiB
llama_context:       HTP0 compute buffer size =    14.00 MiB
llama_context:        CPU compute buffer size =    32.00 MiB
llama_context: graph nodes  = 549
llama_context: graph splits = 49
...
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

...
**(a) Chocolate chip** is the answer, but keep in mind that preferences can vary widely by region and individual taste.

For the most accurate and current data, you might want to look into recent market analyses or consumer surveys. However, chocolate chip remains a strong contender due to its universal appeal and cultural significance. [end of text]


llama_perf_sampler_print:    sampling time =       9.49 ms /   301 runs   (    0.03 ms per token, 31720.94 tokens per second)
llama_perf_context_print:        load time =     694.86 ms
llama_perf_context_print: prompt eval time =      76.28 ms /    11 tokens (    6.93 ms per token,   144.20 tokens per second)
llama_perf_context_print:        eval time =    4698.82 ms /   289 runs   (   16.26 ms per token,    61.50 tokens per second)
llama_perf_context_print:       total time =    4793.75 ms /   300 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - Host               |                  202 =   105 +      51 +      46                |
llama_memory_breakdown_print: |   - HTP0-REPACK        |                  555 =   555 +       0 +       0                |

jeffbolznv · 2025-10-13T02:49:27Z

I included a commit that groups the Attention and FFN MATMUL Ops.
...
Please let me know what you think. Perhaps, you have suggestions for how to do this better.

You can do backend-specific reorderings by implementing the graph_optimize function.

max-krasnyansky · 2025-10-13T04:24:29Z

I included a commit that groups the Attention and FFN MATMUL Ops.
...
Please let me know what you think. Perhaps, you have suggestions for how to do this better.

You can do backend-specific reorderings by implementing the graph_optimize function.

Yeah, I saw you guys added that function recently. Will check it out.

ggerganov · 2025-10-13T08:02:57Z

Are the changes in llama-graph and llama-model (excluding the extra buffer type changes) only needed for the VTCM utilization? If yes, then it's better to avoid these changes and implement the necessary logic in graph_optimize as suggested.

max-krasnyansky · 2025-10-13T15:40:30Z

Are the changes in llama-graph and llama-model (excluding the extra buffer type changes) only needed for the VTCM utilization? If yes, then it's better to avoid these changes and implement the necessary logic in graph_optimize as suggested.

Yep. Only for that. I'm going to play with the graph_optimize shortly, it's probably going to be a bit more involved/expensive. i.e the optimizer will need to re-scan the nodes and figure out the dependencies where as graph builder just needs to call build_forward after the MUL_MATs, but let me try.

ggerganov · 2025-10-13T18:16:00Z

You can try to reuse the ggml_graph_optimize from the Metal backend:

llama.cpp/ggml/src/ggml-metal/ggml-metal-common.h

Lines 43 to 49 in 7049736

    
           // reorder the nodes in the graph to improve concurrency, while respecting fusion 
        
           // 
        
           // note: this implementation is generic and not specific to metal 
        
           //       if it proves to work well, we can start using it for other backends in the future 
        
           void ggml_graph_optimize(struct ggml_cgraph * gf);

It reorders the nodes to improve concurrency, which effectively results in stacking the matrix multiplications together.

slaren · 2025-10-14T13:57:45Z

It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types.

It would be preferable to avoid using extra buffer types unless absolutely necessary, because using them adds a significant amount of complexity to the user code. Repacking the weights does not need to be restricted to extra buffer types. If the backend only can work with repacked tensors, the logic can be baked into the backend's default buffer type.

There was some discussion about the limits of the qualcomm NPU memory bandwidth, can you confirm that's the case? If the weights were stored in a buffer type that's compatible with the CPU backend, it would be possible to use the CPU backend for generation, and the hexagon backend for batch processing. Repacking will make that difficult, unless it uses the same repacking format than the CPU backend. Would it be possible to use the same repacking as the CPU backend?

max-krasnyansky · 2025-10-14T18:11:39Z

You can try to reuse the ggml_graph_optimize from the Metal backend:

llama.cpp/ggml/src/ggml-metal/ggml-metal-common.h

Lines 43 to 49 in 7049736

// reorder the nodes in the graph to improve concurrency, while respecting fusion

//

// note: this implementation is generic and not specific to metal

// if it proves to work well, we can start using it for other backends in the future

void ggml_graph_optimize(struct ggml_cgraph * gf);

It reorders the nodes to improve concurrency, which effectively results in stacking the matrix multiplications together.

@ggerganov quick update. I can definitely use graph_optimize to do the stacking of the matmuls. I was going to add some basic fusions anyway and that will require graph_optimize. So I went ahead and removed that commit for llama-model and llama-graph from this PR, everything is perfectly functional without it.
I'm working on the optimizer and will either include a simple version in this PR or in a follow up if it takes longer to implement/test.

max-krasnyansky · 2025-10-14T19:04:42Z

@slaren

It'd be good to make extra_buffers_type a bit better exposed/integrated. I started thinking that device_get_buffer_type should probably just return a list of buffer types.

It would be preferable to avoid using extra buffer types unless absolutely necessary, because using them adds a significant amount of complexity to the user code. Repacking the weights does not need to be restricted to extra buffer types. If the backend only can work with repacked tensors, the logic can be baked into the backend's default buffer type.

That's what I started with (ie no-repack buffers) but that makes it tricky to do partial offloads and introduces lots of copies because the scheduler thinks the buffers are not shareable/reusable.
All FP32 and FP16 (once we add them) Ops can/do share the buffers right now. The repack is needed only for the quantized tensors (ie very similar to the CPU backend in that sense).

There was some discussion about the limits of the qualcomm NPU memory bandwidth, can you confirm that's the case?

In the previous SOC generations (Gen3, Gen4, X-Elite) the CPU does technically have more memory BW (a lot more in the X-Elite). In Gen5 and X2-Elite the BW is about the same.
For Gen3/4 it might still makes sense to offload to the NPU for power saving and/or simply to free up the CPU for other tasks.
Also the NPU has a really nice DMA engine (one per HW thread) that can bring in the chunks, optionally bypassing the L2, and do some transformations (alignment, etc; we're using that in MUL_MAT ops). So it's more flexible/efficient than the CPU even if the raw BW is lower.

If the weights were stored in a buffer type that's compatible with the CPU backend, it would be possible to use the CPU backend for generation, and the hexagon backend for batch processing. Repacking will make that difficult, unless it uses the same repacking format than the CPU backend. Would it be possible to use the same repacking as the CPU backend?

Functionally, it's possible to use Q4_0, MXFP4, etc as is with HVX, unfortunately, it's about the worst layout for it :)
i.e Nothing is aligned, etc. It's very expensive to load and unpack.
The repack I implemented right now is very HVX friendly. Basically, it stores all the row-quants first followed by all the row-block-scales. The quants are repacked into 32x4x2 blocks (256 elements -> 2x HVX vectors). The DMA is used to properly align each row to 128 as we bring the chunks into the VTCM and then we just do nicely aligned loads of 128 bytes and expand into 256 INT8 elements. This might have to change as we add more optimizations.

It might be possible to share the same format if we were to redo ARM64 CPU_REPACK to use something like the above.
Basically, we need the scales to be separate and not mixed in because they mess up the alignment.
I was planing on spending some time to play around with the ARM64 CPU_REPACK to see if we can make something common, but right now even the CPU backend itself is using different repack layouts depending on the available instructions. So it's difficult even there.

It'd be absolutely awesome to have a common optimized format for CPU/GPU/NPU, ideally on-disk so that we mmap and don't repack at all like the original GGML, but it's very tricky in practice.

max-krasnyansky · 2025-10-15T02:10:33Z

@slaren @bandoti @CISC
Basic CI is in. I added android-ndk-build that builds vanilla ARM64 CPU and Snapdragon (CPU/GPU/NPU) flavors.
Those builds can be just pushed via ADB to the devices and run.
I'm going to hook it up to the Qualcomm Device Cloud so that we can run jobs on the actual Snapdragon-based devices, but that will need some more work. I'm going to prototype in the separate repo first.

btw That build.yml is getting long. I kind of like how you guys did linux-cross builds in the separate yaml, it's collapsible in the UI, etc.
Should we do the same for android, ubuntu, windows, etc?

CISC · 2025-10-15T08:04:35Z

btw That build.yml is getting long. I kind of like how you guys did linux-cross builds in the separate yaml, it's collapsible in the UI, etc. Should we do the same for android, ubuntu, windows, etc?

It is at least ripe for some refactoring.

max-krasnyansky · 2025-10-16T00:27:11Z

You can try to reuse the ggml_graph_optimize from the Metal backend:

@ggerganov @jeffbolznv
I added a simple optimizer to stack up the MUL_MATs with the same input and it's working great.
It ended up being pretty simple especially with the reuse from Metal backend. Thanks for the suggestions!

Quick question for you guys.
Is the optimizer allowed to modify the nodes themselves? i.e would the mods be sticky and visible in graph_compute?
During the compute I need to know a couple of things. 1) wether the src1 can reused 2) what is the last compute op in the graph.
I'm doing both dynamically but the optimizer could just add some backend specific flags to the nodes since it already knows all that.

jeffbolznv · 2025-10-16T01:52:13Z

IMO the nodes should be treated as const.

ggml/src/ggml-hexagon/ggml-hexagon.cpp

max-krasnyansky · 2025-10-16T15:50:32Z

IMO the nodes should be treated as const.

Ok. Looks like Georgi agrees as well :)
I'll keep that logic in graph_compute.

max-krasnyansky · 2025-10-17T07:03:33Z

@slaren Let me know if you had any more thoughts on the REPACK buffers.
Currently the main features are fully functional with that change in llama-model to allow for extra_bufs for all devs.
Everything works as expected with CPU/OpenCL/Hexagon and I checked my CPU/Metal and CPU/CUDA setups and don't see any obvious issues.
One thing that I'd like to add soon-ish is the ability to override the tensors with HTP-REPACK buffers. I'm thinking for now I could just add a simple change (similar to llama-model) to allow for extra_bufs in that path.
Basically, these changes are just allowing any backend to have extra_bufs and not just the CPU.

We could revisit things and come up with a better solution in the followup. I keep thinking it be nice if we could just keep everything as a host buffer and somehow enforce set/get_tensor for specific tensors. It's sort of what the non-host buffers do but non-host is kind of an overkill in the Hexagon case. ie It ends up being a separate memory mapping, etc just to force the set/get_tensor and REPACK.

I made good progress on enabling Qualcomm Device Cloud (aka QDC) where we can run CI jobs on physical devices.
It does need a bit more work and setup. I'll reach out to you guys via email to setup some hidden CI vars using a private API_TOKEN.

…e tensors Same asserts as the CPU backend.

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels.

this should handle all failure paths in the session allocation.

max-krasnyansky · 2025-10-22T17:30:14Z

Hmm. Not sure why arm64 ubuntu builds are failing with missing GGML_F32_STEP.
Seems unrelated but digging in just in case ...

slaren · 2025-10-22T18:01:15Z

Don't worry about the build failing, it is unrelated.

ggml/src/ggml-hexagon/ggml-hexagon.cpp

ggml/src/ggml-hexagon/htp/binary-ops.c

ggml/src/ggml-hexagon/htp/hvx-utils.c

…failure)

…6547) * model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX **Note:** This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <[email protected]> Co-Authored-By: Todor Boinovski <[email protected]> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <[email protected]> Co-authored-by: Todor Boinovski <[email protected]>

DamonFool · 2025-11-26T06:56:17Z

Thanks @max-krasnyansky for your great work.

When npu is enabled, we see random hang in our experiments.
After some investigation, the bug seems to be in the cpu backend.
For more info, please see #17515

github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Oct 13, 2025

max-krasnyansky requested a review from lhez October 13, 2025 01:25

max-krasnyansky force-pushed the hexagon branch from 4018db2 to e8e5407 Compare October 13, 2025 20:51

max-krasnyansky force-pushed the hexagon branch from e8e5407 to 1d600de Compare October 14, 2025 18:06

github-actions bot added the devops improvements to build systems and github actions label Oct 14, 2025

max-krasnyansky force-pushed the hexagon branch from ad1d9b8 to 027399b Compare October 15, 2025 01:31

max-krasnyansky force-pushed the hexagon branch from 027399b to ec2aaaa Compare October 16, 2025 00:20

github-actions bot added the script Script related label Oct 16, 2025

ggerganov reviewed Oct 16, 2025

View reviewed changes

ggml/src/ggml-hexagon/ggml-hexagon.cpp Show resolved Hide resolved

max-krasnyansky marked this pull request as ready for review October 17, 2025 06:44

max-krasnyansky requested review from CISC and slaren as code owners October 17, 2025 06:44

github-actions bot added the python python script changes label Oct 17, 2025

max-krasnyansky force-pushed the hexagon branch 2 times, most recently from beb50c8 to 55ef9c8 Compare October 19, 2025 05:41

max-krasnyansky added 10 commits October 22, 2025 10:05

hexagon: remove unused logic for setting tensor flags for the views

6d2d0bd

hexagon: add asserts to set/get_tensor to make sure we handle complet…

18d7d20

…e tensors Same asserts as the CPU backend.

hexagon: use cpy_tensor slow path for non-host buffers

26a90a0

hexagon: error checks in the buffer allocator

a8e5ad8

cmake: move include(extProj) under ggml-hexagon

dc001b9

hexagon: don't forget to delete the backend on free

c749b86

hexagon: set/get_tensor size assert apply only to quantized tensors

0c01229

hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now

62ef4eb

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels.

docs: typos in hexagon developer docs (libggm-...)

19041f7

hexagon: overhaul error handling in the session/device allocation

3e4ff73

this should handle all failure paths in the session allocation.

max-krasnyansky force-pushed the hexagon branch from 4c16180 to 3e4ff73 Compare October 22, 2025 17:09

ggerganov reviewed Oct 22, 2025

View reviewed changes

ggml/src/ggml-hexagon/ggml-hexagon.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-hexagon/ggml-hexagon.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-hexagon/ggml-hexagon.cpp Outdated Show resolved Hide resolved

hexagon: update cmake presets to enable fp16 vectors

6acc285

slaren reviewed Oct 22, 2025

View reviewed changes

ggml/src/ggml-hexagon/htp/binary-ops.c Show resolved Hide resolved

hexagon: remove unused time_usec function

dda466c

slaren approved these changes Oct 22, 2025

View reviewed changes

ggml/src/ggml-hexagon/htp/hvx-utils.c Outdated Show resolved Hide resolved

max-krasnyansky added 3 commits October 22, 2025 11:31

hexagon: don't forget to release buffer contexts

b0e5beb

hexagon: fixed indents in hvx-utils (missed clang-format auto-format …

3049de5

…failure)

hexagon: remove custom can_repeat function and use ggml_can_repeat

f7d7411

max-krasnyansky merged commit 63d2fc4 into ggml-org:master Oct 22, 2025
75 checks passed

chaitan3 mentioned this pull request Oct 24, 2025

[Feat]: Add support for Qualcomm Hexagon NPU a-ghorbani/pocketpal-ai#460

Open

Arvin-928 mentioned this pull request Oct 30, 2025

Eval bug:Garbled text appears when running the Qwen3-0.6B model on a mobile phone using the Hexagon backend #16854

Open

chraac mentioned this pull request Nov 9, 2025

Feature Request: Compile bug: QCM6490 Platform Support chraac/llama.cpp#60

Open

4 tasks

hans00 mentioned this pull request Nov 19, 2025

Enable opencl and hexagon backend for Qualcomm device mybigday/llama.node#108

Closed

DamonFool mentioned this pull request Nov 26, 2025

ggml-cpu: randomly hang forever in ggml_barrier on weak memory model systems #17515

Open

Add experimental ggml-hexagon backend for the Hexagon NPU #16547

Add experimental ggml-hexagon backend for the Hexagon NPU #16547

Conversation

max-krasnyansky commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Oct 13, 2025

Uh oh!

max-krasnyansky commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

max-krasnyansky commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

slaren commented Oct 14, 2025

Uh oh!

max-krasnyansky commented Oct 14, 2025

Uh oh!

max-krasnyansky commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Oct 15, 2025

Uh oh!

CISC commented Oct 15, 2025

Uh oh!

max-krasnyansky commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Oct 16, 2025

Uh oh!

Uh oh!

max-krasnyansky commented Oct 16, 2025

Uh oh!

max-krasnyansky commented Oct 17, 2025

Uh oh!

max-krasnyansky commented Oct 22, 2025

Uh oh!

slaren commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DamonFool commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

max-krasnyansky commented Oct 13, 2025 •

edited

Loading

max-krasnyansky commented Oct 14, 2025 •

edited

Loading

max-krasnyansky commented Oct 16, 2025 •

edited

Loading