|
| 1 | +# Snapdragon-based Android devices |
| 2 | + |
| 3 | +## How to Build |
| 4 | + |
| 5 | +The easiest way to build llama.cpp for a Snapdragon-based Android device is using the toolchain Docker image (see github.com/snapdragon-toolchain). |
| 6 | +This image includes Android NDK, OpenCL SDK, Hexagon SDK, CMake, etc. |
| 7 | + |
| 8 | +This method works on Linux, macOS, and Windows. macOS and Windows users should install Docker Desktop. |
| 9 | + |
| 10 | +``` |
| 11 | +~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.3 |
| 12 | +[d]/> cd /workspace |
| 13 | +``` |
| 14 | + |
| 15 | +The rest of the Android build process assumes that you're running inside the toolchain container. |
| 16 | +Let's build llama.cpp with CPU, OpenCL, and Hexagon backends via CMake presets: |
| 17 | + |
| 18 | +``` |
| 19 | +[d]/workspace> cp docs/backend/hexagon/CMakeUserPresets.json . |
| 20 | +
|
| 21 | +[d]/workspace> cmake --preset arm64-android-snapdragon-release -B build-snapdragon |
| 22 | +Preset CMake variables: |
| 23 | + ANDROID_ABI="arm64-v8a" |
| 24 | + ... |
| 25 | + CMAKE_TOOLCHAIN_FILE="/opt/android-ndk-r28b/build/cmake/android.toolchain.cmake" |
| 26 | + GGML_HEXAGON="ON" |
| 27 | + GGML_OPENCL="ON" |
| 28 | + GGML_OPENMP="OFF" |
| 29 | + HEXAGON_SDK_ROOT="/opt/hexagon/6.4.0.2" |
| 30 | +... |
| 31 | +-- Including OpenCL backend |
| 32 | +-- Including Hexagon backend |
| 33 | +... |
| 34 | +-- Build files have been written to: /workspace/build-snapdragon |
| 35 | +
|
| 36 | +[d]/workspace> cmake --build build-snapdragon |
| 37 | +... |
| 38 | +[144/356] Performing build step for 'htp-v73' |
| 39 | +[1/16] Generating htp_iface_skel.c, htp_iface_stub.c, htp_iface.h |
| 40 | +[2/16] Building C object CMakeFiles/ggml-htp-v73.dir/hvx-sigmoid.c.obj |
| 41 | +[3/16] Building C object CMakeFiles/ggml-htp-v73.dir/htp-dma.c.obj |
| 42 | +[4/16] Building C object CMakeFiles/ggml-htp-v73.dir/worker-pool.c.obj |
| 43 | +... |
| 44 | +-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v73.so |
| 45 | +-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v75.so |
| 46 | +... |
| 47 | +``` |
| 48 | + |
| 49 | +To generate an installable "package" simply use cmake --install: |
| 50 | + |
| 51 | +``` |
| 52 | +[d]/workspace> cmake --install build-snapdragon --prefix pkg-adb/llama.cpp |
| 53 | +-- Install configuration: "Release" |
| 54 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-cpu.so |
| 55 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-opencl.so |
| 56 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-hexagon.so |
| 57 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v73.so |
| 58 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v75.so |
| 59 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v79.so |
| 60 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v81.so |
| 61 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml.so |
| 62 | +... |
| 63 | +-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-bench |
| 64 | +-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-cli |
| 65 | +... |
| 66 | +``` |
| 67 | + |
| 68 | +## How to Install |
| 69 | + |
| 70 | +For this step, your device needs to be configured for on-device development. |
| 71 | +Please see https://developer.android.com/studio/debug/dev-options for details. |
| 72 | + |
| 73 | +Once ADB is enabled, use `adb push` to install `pkg-snapdragon` on the device. |
| 74 | +**Note that the toolchain Docker image doesn't have ADB and doesn't set up the ADB bridge. Please use native ADB on the host.** |
| 75 | + |
| 76 | +``` |
| 77 | +~/src/llama.cpp$ adb push pkg-adb/llama.cpp /data/local/tmp/ |
| 78 | +pkg-adb/llama.cpp/bin/: 67 files pushed, 0 skipped. 190.2 MB/s (919095042 bytes in 4.607s) |
| 79 | +pkg-adb/llama.cpp/include/: 19 files pushed, 0 skipped. 20.5 MB/s (255173 bytes in 0.012s) |
| 80 | +pkg-adb/llama.cpp/lib/: 16 files pushed, 0 skipped. 144.4 MB/s (43801382 bytes in 0.289s) |
| 81 | +102 files pushed, 0 skipped. 186.9 MB/s (963151597 bytes in 4.914s) |
| 82 | +``` |
| 83 | + |
| 84 | +At this point, you should also install some models: |
| 85 | + |
| 86 | +``` |
| 87 | +~/src/llama.cpp$ wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf |
| 88 | +... |
| 89 | +2025-10-11 12:04:52 (10.7 MB/s) - ‘Llama-3.2-1B-Instruct-Q4_0.gguf’ saved [773025920/773025920] |
| 90 | +
|
| 91 | +~/src/llama.cpp$ adb push Llama-3.2-1B-Instruct-Q4_0.gguf /data/local/tmp/gguf |
| 92 | +Llama-3.2-1B-Instruct-Q4_0.gguf: 1 file pushed, 0 skipped. 38.3 MB/s (773025920 bytes in 19.250s) |
| 93 | +``` |
| 94 | + |
| 95 | +## How to Run |
| 96 | + |
| 97 | +The easiest way to run llama.cpp cli tools is using provided wrapper scripts that properly set up all required environment variables. |
| 98 | + |
| 99 | +llama.cpp supports three backends on Snapdragon-based devices: CPU, Adreno GPU (GPUOpenCL), and Hexagon NPU (HTP0-4). |
| 100 | +You can select which backend to run the model on using the `D=` variable, which maps to the `--device` option. |
| 101 | + |
| 102 | +Hexagon NPU behaves as a "GPU" device when it comes to `-ngl` and other offload-related options. |
| 103 | + |
| 104 | +Here are some examples of running various llama.cpp tools via ADB. |
| 105 | + |
| 106 | +Simple question for Llama-3.2-1B |
| 107 | + |
| 108 | +``` |
| 109 | +~/src/llama.cpp$ M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -no-cnv -p "what is the most popular cookie in the world?" |
| 110 | +... |
| 111 | +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1 |
| 112 | +ggml-hex: Hexagon Arch version v79 |
| 113 | +ggml-hex: allocating new session: HTP0 |
| 114 | +ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb4000072c7955e50 |
| 115 | +... |
| 116 | +load_tensors: offloading output layer to GPU |
| 117 | +load_tensors: offloaded 17/17 layers to GPU |
| 118 | +load_tensors: CPU model buffer size = 225.49 MiB |
| 119 | +load_tensors: HTP0 model buffer size = 0.26 MiB |
| 120 | +load_tensors: HTP0-REPACK model buffer size = 504.00 MiB |
| 121 | +... |
| 122 | +I hope this helps you understand the world's most popular cookies! [end of text] |
| 123 | +... |
| 124 | +llama_perf_sampler_print: sampling time = 30.08 ms / 487 runs ( 0.06 ms per token, 16191.77 tokens per second) |
| 125 | +llama_perf_context_print: load time = 617.94 ms |
| 126 | +llama_perf_context_print: prompt eval time = 80.76 ms / 11 tokens ( 7.34 ms per token, 136.21 tokens per second) |
| 127 | +llama_perf_context_print: eval time = 9210.59 ms / 475 runs ( 19.39 ms per token, 51.57 tokens per second) |
| 128 | +llama_perf_context_print: total time = 9454.92 ms / 486 tokens |
| 129 | +llama_perf_context_print: graphs reused = 473 |
| 130 | +llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | |
| 131 | +llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | |
| 132 | +llama_memory_breakdown_print: | - Host | 439 = 225 + 136 + 77 | |
| 133 | +llama_memory_breakdown_print: | - HTP0-REPACK | 504 = 504 + 0 + 0 | |
| 134 | +``` |
| 135 | + |
| 136 | +Summary request for OLMoE-1B-7B. This is a large model that requires two HTP sessions/devices |
| 137 | + |
| 138 | +``` |
| 139 | +~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-cli.sh -f surfing.txt -no-cnv |
| 140 | +... |
| 141 | +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1 |
| 142 | +ggml-hex: Hexagon Arch version v81 |
| 143 | +ggml-hex: allocating new session: HTP0 |
| 144 | +ggml-hex: allocating new session: HTP1 |
| 145 | +... |
| 146 | +load_tensors: offloading output layer to GPU |
| 147 | +load_tensors: offloaded 17/17 layers to GPU |
| 148 | +load_tensors: CPU model buffer size = 143.86 MiB |
| 149 | +load_tensors: HTP1 model buffer size = 0.23 MiB |
| 150 | +load_tensors: HTP1-REPACK model buffer size = 1575.00 MiB |
| 151 | +load_tensors: HTP0 model buffer size = 0.28 MiB |
| 152 | +load_tensors: HTP0-REPACK model buffer size = 2025.00 MiB |
| 153 | +... |
| 154 | +llama_context: CPU output buffer size = 0.19 MiB |
| 155 | +llama_kv_cache: HTP1 KV buffer size = 238.00 MiB |
| 156 | +llama_kv_cache: HTP0 KV buffer size = 306.00 MiB |
| 157 | +llama_kv_cache: size = 544.00 MiB ( 8192 cells, 16 layers, 1/1 seqs), K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB |
| 158 | +llama_context: HTP0 compute buffer size = 15.00 MiB |
| 159 | +llama_context: HTP1 compute buffer size = 15.00 MiB |
| 160 | +llama_context: CPU compute buffer size = 24.56 MiB |
| 161 | +... |
| 162 | +llama_perf_context_print: prompt eval time = 1730.57 ms / 212 tokens ( 8.16 ms per token, 122.50 tokens per second) |
| 163 | +llama_perf_context_print: eval time = 5624.75 ms / 257 runs ( 21.89 ms per token, 45.69 tokens per second) |
| 164 | +llama_perf_context_print: total time = 7377.33 ms / 469 tokens |
| 165 | +llama_perf_context_print: graphs reused = 255 |
| 166 | +llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | |
| 167 | +llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | |
| 168 | +llama_memory_breakdown_print: | - HTP1 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | |
| 169 | +llama_memory_breakdown_print: | - Host | 742 = 144 + 544 + 54 | |
| 170 | +llama_memory_breakdown_print: | - HTP1-REPACK | 1575 = 1575 + 0 + 0 | |
| 171 | +llama_memory_breakdown_print: | - HTP0-REPACK | 2025 = 2025 + 0 + 0 | |
| 172 | +``` |
| 173 | + |
| 174 | +Op test for MUL_MAT |
| 175 | + |
| 176 | +``` |
| 177 | +~/src/llama.cpp$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT |
| 178 | +... |
| 179 | +Backend 2/3: HTP0 |
| 180 | +Device description: Hexagon |
| 181 | +Device memory: 2048 MB (2048 MB free) |
| 182 | +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK |
| 183 | +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK |
| 184 | +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK |
| 185 | +
|
| 186 | +~/src/llama.cpp-hexagon$ M=Llama-3.2-1B-Instruct-Q4_0.gguf ./scripts/snapdragon/adb/run-bench.sh -p 128 -n 64 |
| 187 | +... |
| 188 | +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1 |
| 189 | +ggml-hex: Hexagon Arch version v79 |
| 190 | +ggml-hex: allocating new session: HTP0 |
| 191 | +ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007d4b231090 |
| 192 | +| model | size | params | backend | ngl | threads | n_batch | mmap | test | t/s | |
| 193 | +| ---------------| ---------: | -----: | ---------- | --: | ------: | ------: | ---: | ----: | ------------: | |
| 194 | +| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | pp128 | 169.42 ± 1.75 | |
| 195 | +| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | tg64 | 51.54 ± 1.13 | |
| 196 | +
|
| 197 | +build: 6a8cf8914 (6733) |
| 198 | +``` |
| 199 | + |
| 200 | +## Environment variables |
| 201 | + |
| 202 | +- `GGML_HEXAGON_NDEV=1` |
| 203 | + Controls the number of devices/sessions to allocate. The default is 1. |
| 204 | + Most quantized models under 4B fit into a single session; an 8B model needs two, and a 20B model needs four. |
| 205 | + |
| 206 | +- `GGML_HEXAGON_NHVX=0` |
| 207 | + Controls the number of HVX hardware threads to use. The default is all (actual number varies depending on the hardware version). |
| 208 | + |
| 209 | +- `GGML_HEXAGON_HOSTBUF=1` |
| 210 | + Controls whether the Hexagon backend allocates host buffers. By default, all buffers except for REPACK are host buffers. |
| 211 | + This option is required for testing Ops that require REPACK buffers (MUL_MAT and MUL_MAT_ID). |
| 212 | + |
| 213 | +- `GGML_HEXAGON_VERBOSE=1` |
| 214 | + Enables verbose logging of Ops from the backend. Example output: |
| 215 | + |
| 216 | + ``` |
| 217 | + ggml-hex: HTP0 graph-compute n_nodes 2 |
| 218 | + ggml-hex: HTP0 matmul : blk.27.ffn_up.weight x ffn_norm-27 -> ffn_up-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x1 |
| 219 | + ggml-hex: HTP0 matmul : blk.27.ffn_gate.weight x ffn_norm-27 -> ffn_gate-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x3 |
| 220 | + ggml-hex: HTP0 graph-compute n_nodes 1 |
| 221 | + ggml-hex: HTP0 matmul : blk.27.ffn_down.weight x ffn_gate_par-27 -> ffn_out-27 : 8192:3072 x 8192:1 -> 3072:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x0 |
| 222 | + ggml-hex: HTP0 get-tensor result_output : data 0x7592487000 offset 0 size 513024 |
| 223 | + ``` |
| 224 | + |
| 225 | +- `GGML_HEXAGON_PROFILE=1` |
| 226 | + Generates a host-side profile for the ggml-hexagon Ops. |
| 227 | + |
| 228 | +- `GGML_HEXAGON_OPMASK=0x0` |
| 229 | + Allows enabling specific stages of the processing pipeline: |
| 230 | + |
| 231 | + - `0x1` Enable Op Queue (i.e., queuing Ops into NPU) |
| 232 | + - `0x2` Enable Dynamic Quantizer (if needed for the Op) |
| 233 | + - `0x4` Enable Op Compute (MUL_MAT, etc.) |
| 234 | + |
| 235 | + Examples: |
| 236 | + |
| 237 | + `GGML_HEXAGON_OPMASK=0x1 llama-cli ...` - Ops are enqueued but NPU-side processing is stubbed out |
| 238 | + `GGML_HEXAGON_OPMASK=0x3 llama-cli ...` - NPU performs dynamic quantization and skips the rest |
| 239 | + `GGML_HEXAGON_OPMASK=0x7 llama-cli ...` - Full queuing and processing of Ops (default) |
0 commit comments