Skip to content

Commit 63d2fc4

Browse files
max-krasnyanskymyworldrajtboinovski1
authored
Add experimental ggml-hexagon backend for the Hexagon NPU (ggml-org#16547)
* model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX **Note:** This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <[email protected]> Co-Authored-By: Todor Boinovski <[email protected]> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <[email protected]> Co-authored-by: Todor Boinovski <[email protected]>
1 parent a2e0088 commit 63d2fc4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+13530
-0
lines changed

.github/workflows/build.yml

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1305,6 +1305,81 @@ jobs:
13051305
cd examples/llama.android
13061306
./gradlew build --no-daemon
13071307
1308+
android-ndk-build:
1309+
runs-on: ubuntu-latest
1310+
1311+
env:
1312+
OPENCL_VERSION: 2025.07.22
1313+
1314+
strategy:
1315+
matrix:
1316+
include:
1317+
- build: 'arm64-cpu'
1318+
defines: '-D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -D GGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm -G Ninja -D LLAMA_CURL=OFF -D GGML_OPENMP=OFF'
1319+
- build: 'arm64-snapdragon'
1320+
defines: '--preset arm64-android-snapdragon-release'
1321+
1322+
steps:
1323+
- name: Clone
1324+
id: checkout
1325+
uses: actions/checkout@v4
1326+
1327+
- name: Install OpenCL Headers and Libs
1328+
id: install_opencl
1329+
if: ${{ matrix.build == 'arm64-snapdragon' }}
1330+
run: |
1331+
mkdir opencl
1332+
curl -L -o opencl/clhpp.tar.gz https://github.com/KhronosGroup/OpenCL-CLHPP/archive/refs/tags/v${OPENCL_VERSION}.tar.gz
1333+
curl -L -o opencl/headers.tar.gz https://github.com/KhronosGroup/OpenCL-Headers/archive/refs/tags/v${OPENCL_VERSION}.tar.gz
1334+
curl -L -o opencl/icd-loader.tar.gz https://github.com/KhronosGroup/OpenCL-ICD-Loader/archive/refs/tags/v${OPENCL_VERSION}.tar.gz
1335+
tar -xaf opencl/headers.tar.gz -C opencl
1336+
tar -xaf opencl/clhpp.tar.gz -C opencl
1337+
tar -xaf opencl/icd-loader.tar.gz -C opencl
1338+
sudo cp -r opencl/OpenCL-Headers-${OPENCL_VERSION}/CL ${ANDROID_NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
1339+
sudo cp -r opencl/OpenCL-CLHPP-${OPENCL_VERSION}/include/CL/* ${ANDROID_NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/CL
1340+
cd opencl/OpenCL-ICD-Loader-${OPENCL_VERSION}
1341+
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -DOPENCL_ICD_LOADER_HEADERS_DIR=${ANDROID_NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=31 -DANDROID_STL=c++_shared
1342+
cmake --build build
1343+
sudo cp build/libOpenCL.so ${ANDROID_NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
1344+
rm -rf opencl
1345+
1346+
- name: Install Hexagon SDK
1347+
id: install_hexsdk
1348+
if: ${{ matrix.build == 'arm64-snapdragon' }}
1349+
env:
1350+
HEXSDK_VER: 6.4.0.2
1351+
HEXTLS_VER: 19.0.04
1352+
run: |
1353+
curl -L -o hex-sdk.tar.gz https://github.com/snapdragon-toolchain/hexagon-sdk/releases/download/v$HEXSDK_VER/hexagon-sdk-v$HEXSDK_VER-amd64-lnx.tar.xz
1354+
mkdir hex-sdk
1355+
tar -xaf hex-sdk.tar.gz -C hex-sdk
1356+
ls -l hex-sdk
1357+
sudo mv hex-sdk /opt/hexagon
1358+
echo "HEXAGON_SDK_ROOT=/opt/hexagon/$HEXSDK_VER" >> "$GITHUB_ENV"
1359+
echo "HEXAGON_TOOLS_ROOT=/opt/hexagon/$HEXSDK_VER/tools/HEXAGON_Tools/$HEXTLS_VER" >> "$GITHUB_ENV"
1360+
echo "DEFAULT_HLOS_ARCH=64" >> "$GITHUB_ENV"
1361+
echo "DEFAULT_TOOLS_VARIANT=toolv19" >> "$GITHUB_ENV"
1362+
echo "DEFAULT_NO_QURT_INC=0" >> "$GITHUB_ENV"
1363+
echo "DEFAULT_DSP_ARCH=v73" >> "$GITHUB_ENV"
1364+
1365+
- name: Update CMake presets
1366+
id: update_presets
1367+
if: ${{ matrix.build == 'arm64-snapdragon' }}
1368+
run: |
1369+
cp docs/backend/hexagon/CMakeUserPresets.json .
1370+
1371+
- name: Build
1372+
id: ndk_build
1373+
run: |
1374+
cmake ${{ matrix.defines }} -B build
1375+
cmake --build build
1376+
cmake --install build --prefix pkg-adb/llama.cpp
1377+
1378+
- name: Test
1379+
id: cmake_test
1380+
run: |
1381+
echo "FIXME: test on devices"
1382+
13081383
openEuler-latest-cmake-cann:
13091384
if: ${{ github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'Ascend NPU') }}
13101385
defaults:

CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
/ggml/src/ggml-impl.h @ggerganov @slaren
6666
/ggml/src/ggml-metal/ @ggerganov
6767
/ggml/src/ggml-opencl/ @lhez @max-krasnyansky
68+
/ggml/src/ggml-hexagon/ @max-krasnyansky
6869
/ggml/src/ggml-opt.cpp @JohannesGaessler
6970
/ggml/src/ggml-quants.* @ggerganov
7071
/ggml/src/ggml-rpc/ @rgerganov

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
280280
| [IBM zDNN](docs/backend/zDNN.md) | IBM Z & LinuxONE |
281281
| [WebGPU [In Progress]](docs/build.md#webgpu) | All |
282282
| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |
283+
| [Hexagon [In Progress]](docs/backend/hexagon/README.md) | Snapdragon |
283284

284285
## Obtaining and quantizing models
285286

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
{
2+
"version": 4,
3+
"configurePresets": [
4+
{
5+
"name": "arm64-android-snapdragon",
6+
"hidden": true,
7+
"architecture": { "value": "arm64", "strategy": "external" },
8+
"toolset": { "value": "host=x86_64", "strategy": "external" },
9+
"cacheVariables": {
10+
"ANDROID_ABI": "arm64-v8a",
11+
"ANDROID_PLATFORM": "android-31",
12+
"CMAKE_TOOLCHAIN_FILE": "$env{ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake",
13+
"CMAKE_C_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
14+
"CMAKE_CXX_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
15+
"CMAKE_C_FLAGS_RELEASE": "-O3 -DNDEBUG",
16+
"CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG",
17+
"CMAKE_C_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",
18+
"CMAKE_CXX_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",
19+
"HEXAGON_SDK_ROOT": "$env{HEXAGON_SDK_ROOT}",
20+
"PREBUILT_LIB_DIR": "android_aarch64",
21+
"GGML_OPENMP": "OFF",
22+
"GGML_LLAMAFILE": "OFF",
23+
"GGML_OPENCL": "ON",
24+
"GGML_HEXAGON": "ON",
25+
"LLAMA_CURL": "OFF"
26+
}
27+
},
28+
29+
{
30+
"name": "arm64-windows-snapdragon",
31+
"inherits": [ "base", "arm64-windows-llvm" ],
32+
"cacheVariables": {
33+
"HEXAGON_SDK_ROOT": "$env{HEXAGON_SDK_ROOT}",
34+
"PREBUILT_LIB_DIR": "windows_aarch64",
35+
"GGML_OPENMP": "OFF",
36+
"GGML_LLAMAFILE": "OFF",
37+
"GGML_OPENCL": "ON",
38+
"GGML_HEXAGON": "ON",
39+
"LLAMA_CURL": "OFF"
40+
}
41+
},
42+
43+
{ "name": "arm64-android-snapdragon-debug" , "inherits": [ "base", "arm64-android-snapdragon", "debug" ] },
44+
{ "name": "arm64-android-snapdragon-release", "inherits": [ "base", "arm64-android-snapdragon", "release" ] },
45+
46+
{ "name": "arm64-windows-snapdragon-debug" , "inherits": [ "base", "arm64-windows-snapdragon", "debug" ] },
47+
{ "name": "arm64-windows-snapdragon-release", "inherits": [ "base", "arm64-windows-snapdragon", "release" ] }
48+
]
49+
}

docs/backend/hexagon/README.md

Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# Snapdragon-based Android devices
2+
3+
## How to Build
4+
5+
The easiest way to build llama.cpp for a Snapdragon-based Android device is using the toolchain Docker image (see github.com/snapdragon-toolchain).
6+
This image includes Android NDK, OpenCL SDK, Hexagon SDK, CMake, etc.
7+
8+
This method works on Linux, macOS, and Windows. macOS and Windows users should install Docker Desktop.
9+
10+
```
11+
~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.3
12+
[d]/> cd /workspace
13+
```
14+
15+
The rest of the Android build process assumes that you're running inside the toolchain container.
16+
Let's build llama.cpp with CPU, OpenCL, and Hexagon backends via CMake presets:
17+
18+
```
19+
[d]/workspace> cp docs/backend/hexagon/CMakeUserPresets.json .
20+
21+
[d]/workspace> cmake --preset arm64-android-snapdragon-release -B build-snapdragon
22+
Preset CMake variables:
23+
ANDROID_ABI="arm64-v8a"
24+
...
25+
CMAKE_TOOLCHAIN_FILE="/opt/android-ndk-r28b/build/cmake/android.toolchain.cmake"
26+
GGML_HEXAGON="ON"
27+
GGML_OPENCL="ON"
28+
GGML_OPENMP="OFF"
29+
HEXAGON_SDK_ROOT="/opt/hexagon/6.4.0.2"
30+
...
31+
-- Including OpenCL backend
32+
-- Including Hexagon backend
33+
...
34+
-- Build files have been written to: /workspace/build-snapdragon
35+
36+
[d]/workspace> cmake --build build-snapdragon
37+
...
38+
[144/356] Performing build step for 'htp-v73'
39+
[1/16] Generating htp_iface_skel.c, htp_iface_stub.c, htp_iface.h
40+
[2/16] Building C object CMakeFiles/ggml-htp-v73.dir/hvx-sigmoid.c.obj
41+
[3/16] Building C object CMakeFiles/ggml-htp-v73.dir/htp-dma.c.obj
42+
[4/16] Building C object CMakeFiles/ggml-htp-v73.dir/worker-pool.c.obj
43+
...
44+
-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v73.so
45+
-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v75.so
46+
...
47+
```
48+
49+
To generate an installable "package" simply use cmake --install:
50+
51+
```
52+
[d]/workspace> cmake --install build-snapdragon --prefix pkg-adb/llama.cpp
53+
-- Install configuration: "Release"
54+
-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-cpu.so
55+
-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-opencl.so
56+
-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-hexagon.so
57+
-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v73.so
58+
-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v75.so
59+
-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v79.so
60+
-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v81.so
61+
-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml.so
62+
...
63+
-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-bench
64+
-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-cli
65+
...
66+
```
67+
68+
## How to Install
69+
70+
For this step, your device needs to be configured for on-device development.
71+
Please see https://developer.android.com/studio/debug/dev-options for details.
72+
73+
Once ADB is enabled, use `adb push` to install `pkg-snapdragon` on the device.
74+
**Note that the toolchain Docker image doesn't have ADB and doesn't set up the ADB bridge. Please use native ADB on the host.**
75+
76+
```
77+
~/src/llama.cpp$ adb push pkg-adb/llama.cpp /data/local/tmp/
78+
pkg-adb/llama.cpp/bin/: 67 files pushed, 0 skipped. 190.2 MB/s (919095042 bytes in 4.607s)
79+
pkg-adb/llama.cpp/include/: 19 files pushed, 0 skipped. 20.5 MB/s (255173 bytes in 0.012s)
80+
pkg-adb/llama.cpp/lib/: 16 files pushed, 0 skipped. 144.4 MB/s (43801382 bytes in 0.289s)
81+
102 files pushed, 0 skipped. 186.9 MB/s (963151597 bytes in 4.914s)
82+
```
83+
84+
At this point, you should also install some models:
85+
86+
```
87+
~/src/llama.cpp$ wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
88+
...
89+
2025-10-11 12:04:52 (10.7 MB/s) - ‘Llama-3.2-1B-Instruct-Q4_0.gguf’ saved [773025920/773025920]
90+
91+
~/src/llama.cpp$ adb push Llama-3.2-1B-Instruct-Q4_0.gguf /data/local/tmp/gguf
92+
Llama-3.2-1B-Instruct-Q4_0.gguf: 1 file pushed, 0 skipped. 38.3 MB/s (773025920 bytes in 19.250s)
93+
```
94+
95+
## How to Run
96+
97+
The easiest way to run llama.cpp cli tools is using provided wrapper scripts that properly set up all required environment variables.
98+
99+
llama.cpp supports three backends on Snapdragon-based devices: CPU, Adreno GPU (GPUOpenCL), and Hexagon NPU (HTP0-4).
100+
You can select which backend to run the model on using the `D=` variable, which maps to the `--device` option.
101+
102+
Hexagon NPU behaves as a "GPU" device when it comes to `-ngl` and other offload-related options.
103+
104+
Here are some examples of running various llama.cpp tools via ADB.
105+
106+
Simple question for Llama-3.2-1B
107+
108+
```
109+
~/src/llama.cpp$ M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -no-cnv -p "what is the most popular cookie in the world?"
110+
...
111+
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
112+
ggml-hex: Hexagon Arch version v79
113+
ggml-hex: allocating new session: HTP0
114+
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb4000072c7955e50
115+
...
116+
load_tensors: offloading output layer to GPU
117+
load_tensors: offloaded 17/17 layers to GPU
118+
load_tensors: CPU model buffer size = 225.49 MiB
119+
load_tensors: HTP0 model buffer size = 0.26 MiB
120+
load_tensors: HTP0-REPACK model buffer size = 504.00 MiB
121+
...
122+
I hope this helps you understand the world's most popular cookies! [end of text]
123+
...
124+
llama_perf_sampler_print: sampling time = 30.08 ms / 487 runs ( 0.06 ms per token, 16191.77 tokens per second)
125+
llama_perf_context_print: load time = 617.94 ms
126+
llama_perf_context_print: prompt eval time = 80.76 ms / 11 tokens ( 7.34 ms per token, 136.21 tokens per second)
127+
llama_perf_context_print: eval time = 9210.59 ms / 475 runs ( 19.39 ms per token, 51.57 tokens per second)
128+
llama_perf_context_print: total time = 9454.92 ms / 486 tokens
129+
llama_perf_context_print: graphs reused = 473
130+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
131+
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
132+
llama_memory_breakdown_print: | - Host | 439 = 225 + 136 + 77 |
133+
llama_memory_breakdown_print: | - HTP0-REPACK | 504 = 504 + 0 + 0 |
134+
```
135+
136+
Summary request for OLMoE-1B-7B. This is a large model that requires two HTP sessions/devices
137+
138+
```
139+
~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-cli.sh -f surfing.txt -no-cnv
140+
...
141+
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
142+
ggml-hex: Hexagon Arch version v81
143+
ggml-hex: allocating new session: HTP0
144+
ggml-hex: allocating new session: HTP1
145+
...
146+
load_tensors: offloading output layer to GPU
147+
load_tensors: offloaded 17/17 layers to GPU
148+
load_tensors: CPU model buffer size = 143.86 MiB
149+
load_tensors: HTP1 model buffer size = 0.23 MiB
150+
load_tensors: HTP1-REPACK model buffer size = 1575.00 MiB
151+
load_tensors: HTP0 model buffer size = 0.28 MiB
152+
load_tensors: HTP0-REPACK model buffer size = 2025.00 MiB
153+
...
154+
llama_context: CPU output buffer size = 0.19 MiB
155+
llama_kv_cache: HTP1 KV buffer size = 238.00 MiB
156+
llama_kv_cache: HTP0 KV buffer size = 306.00 MiB
157+
llama_kv_cache: size = 544.00 MiB ( 8192 cells, 16 layers, 1/1 seqs), K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
158+
llama_context: HTP0 compute buffer size = 15.00 MiB
159+
llama_context: HTP1 compute buffer size = 15.00 MiB
160+
llama_context: CPU compute buffer size = 24.56 MiB
161+
...
162+
llama_perf_context_print: prompt eval time = 1730.57 ms / 212 tokens ( 8.16 ms per token, 122.50 tokens per second)
163+
llama_perf_context_print: eval time = 5624.75 ms / 257 runs ( 21.89 ms per token, 45.69 tokens per second)
164+
llama_perf_context_print: total time = 7377.33 ms / 469 tokens
165+
llama_perf_context_print: graphs reused = 255
166+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
167+
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
168+
llama_memory_breakdown_print: | - HTP1 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
169+
llama_memory_breakdown_print: | - Host | 742 = 144 + 544 + 54 |
170+
llama_memory_breakdown_print: | - HTP1-REPACK | 1575 = 1575 + 0 + 0 |
171+
llama_memory_breakdown_print: | - HTP0-REPACK | 2025 = 2025 + 0 + 0 |
172+
```
173+
174+
Op test for MUL_MAT
175+
176+
```
177+
~/src/llama.cpp$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT
178+
...
179+
Backend 2/3: HTP0
180+
Device description: Hexagon
181+
Device memory: 2048 MB (2048 MB free)
182+
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
183+
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
184+
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
185+
186+
~/src/llama.cpp-hexagon$ M=Llama-3.2-1B-Instruct-Q4_0.gguf ./scripts/snapdragon/adb/run-bench.sh -p 128 -n 64
187+
...
188+
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
189+
ggml-hex: Hexagon Arch version v79
190+
ggml-hex: allocating new session: HTP0
191+
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007d4b231090
192+
| model | size | params | backend | ngl | threads | n_batch | mmap | test | t/s |
193+
| ---------------| ---------: | -----: | ---------- | --: | ------: | ------: | ---: | ----: | ------------: |
194+
| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | pp128 | 169.42 ± 1.75 |
195+
| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | tg64 | 51.54 ± 1.13 |
196+
197+
build: 6a8cf8914 (6733)
198+
```
199+
200+
## Environment variables
201+
202+
- `GGML_HEXAGON_NDEV=1`
203+
Controls the number of devices/sessions to allocate. The default is 1.
204+
Most quantized models under 4B fit into a single session; an 8B model needs two, and a 20B model needs four.
205+
206+
- `GGML_HEXAGON_NHVX=0`
207+
Controls the number of HVX hardware threads to use. The default is all (actual number varies depending on the hardware version).
208+
209+
- `GGML_HEXAGON_HOSTBUF=1`
210+
Controls whether the Hexagon backend allocates host buffers. By default, all buffers except for REPACK are host buffers.
211+
This option is required for testing Ops that require REPACK buffers (MUL_MAT and MUL_MAT_ID).
212+
213+
- `GGML_HEXAGON_VERBOSE=1`
214+
Enables verbose logging of Ops from the backend. Example output:
215+
216+
```
217+
ggml-hex: HTP0 graph-compute n_nodes 2
218+
ggml-hex: HTP0 matmul : blk.27.ffn_up.weight x ffn_norm-27 -> ffn_up-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x1
219+
ggml-hex: HTP0 matmul : blk.27.ffn_gate.weight x ffn_norm-27 -> ffn_gate-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x3
220+
ggml-hex: HTP0 graph-compute n_nodes 1
221+
ggml-hex: HTP0 matmul : blk.27.ffn_down.weight x ffn_gate_par-27 -> ffn_out-27 : 8192:3072 x 8192:1 -> 3072:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x0
222+
ggml-hex: HTP0 get-tensor result_output : data 0x7592487000 offset 0 size 513024
223+
```
224+
225+
- `GGML_HEXAGON_PROFILE=1`
226+
Generates a host-side profile for the ggml-hexagon Ops.
227+
228+
- `GGML_HEXAGON_OPMASK=0x0`
229+
Allows enabling specific stages of the processing pipeline:
230+
231+
- `0x1` Enable Op Queue (i.e., queuing Ops into NPU)
232+
- `0x2` Enable Dynamic Quantizer (if needed for the Op)
233+
- `0x4` Enable Op Compute (MUL_MAT, etc.)
234+
235+
Examples:
236+
237+
`GGML_HEXAGON_OPMASK=0x1 llama-cli ...` - Ops are enqueued but NPU-side processing is stubbed out
238+
`GGML_HEXAGON_OPMASK=0x3 llama-cli ...` - NPU performs dynamic quantization and skips the rest
239+
`GGML_HEXAGON_OPMASK=0x7 llama-cli ...` - Full queuing and processing of Ops (default)

0 commit comments

Comments
 (0)