| 
 | 1 | +# Snapdragon-based Android devices  | 
 | 2 | + | 
 | 3 | +## How to Build  | 
 | 4 | + | 
 | 5 | +The easiest way to build llama.cpp for a Snapdragon-based Android device is using the toolchain Docker image (see github.com/snapdragon-toolchain).  | 
 | 6 | +This image includes Android NDK, OpenCL SDK, Hexagon SDK, CMake, etc.  | 
 | 7 | + | 
 | 8 | +This method works on Linux, macOS, and Windows. macOS and Windows users should install Docker Desktop.  | 
 | 9 | + | 
 | 10 | +```  | 
 | 11 | +~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.3  | 
 | 12 | +[d]/> cd /workspace  | 
 | 13 | +```  | 
 | 14 | + | 
 | 15 | +The rest of the Android build process assumes that you're running inside the toolchain container.  | 
 | 16 | +Let's build llama.cpp with CPU, OpenCL, and Hexagon backends via CMake presets:  | 
 | 17 | + | 
 | 18 | +```  | 
 | 19 | +[d]/workspace> cp docs/backend/hexagon/CMakeUserPresets.json .  | 
 | 20 | +
  | 
 | 21 | +[d]/workspace> cmake --preset arm64-android-snapdragon-release -B build-snapdragon  | 
 | 22 | +Preset CMake variables:  | 
 | 23 | +  ANDROID_ABI="arm64-v8a"  | 
 | 24 | +  ...  | 
 | 25 | +  CMAKE_TOOLCHAIN_FILE="/opt/android-ndk-r28b/build/cmake/android.toolchain.cmake"  | 
 | 26 | +  GGML_HEXAGON="ON"  | 
 | 27 | +  GGML_OPENCL="ON"  | 
 | 28 | +  GGML_OPENMP="OFF"  | 
 | 29 | +  HEXAGON_SDK_ROOT="/opt/hexagon/6.4.0.2"  | 
 | 30 | +...  | 
 | 31 | +-- Including OpenCL backend  | 
 | 32 | +-- Including Hexagon backend  | 
 | 33 | +...  | 
 | 34 | +-- Build files have been written to: /workspace/build-snapdragon  | 
 | 35 | +
  | 
 | 36 | +[d]/workspace> cmake --build build-snapdragon  | 
 | 37 | +...  | 
 | 38 | +[144/356] Performing build step for 'htp-v73'  | 
 | 39 | +[1/16] Generating htp_iface_skel.c, htp_iface_stub.c, htp_iface.h  | 
 | 40 | +[2/16] Building C object CMakeFiles/ggml-htp-v73.dir/hvx-sigmoid.c.obj  | 
 | 41 | +[3/16] Building C object CMakeFiles/ggml-htp-v73.dir/htp-dma.c.obj  | 
 | 42 | +[4/16] Building C object CMakeFiles/ggml-htp-v73.dir/worker-pool.c.obj  | 
 | 43 | +...  | 
 | 44 | +-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v73.so  | 
 | 45 | +-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v75.so  | 
 | 46 | +...  | 
 | 47 | +```  | 
 | 48 | + | 
 | 49 | +To generate an installable "package" simply use cmake --install:  | 
 | 50 | + | 
 | 51 | +```  | 
 | 52 | +[d]/workspace> cmake --install build-snapdragon --prefix pkg-adb/llama.cpp  | 
 | 53 | +-- Install configuration: "Release"  | 
 | 54 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-cpu.so  | 
 | 55 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-opencl.so  | 
 | 56 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-hexagon.so  | 
 | 57 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v73.so  | 
 | 58 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v75.so  | 
 | 59 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v79.so  | 
 | 60 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml-htp-v81.so  | 
 | 61 | +-- Installing: /workspace/pkg-adb/llama.cpp/lib/libggml.so  | 
 | 62 | +...  | 
 | 63 | +-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-bench  | 
 | 64 | +-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-cli  | 
 | 65 | +...  | 
 | 66 | +```  | 
 | 67 | + | 
 | 68 | +## How to Install  | 
 | 69 | + | 
 | 70 | +For this step, your device needs to be configured for on-device development.  | 
 | 71 | +Please see https://developer.android.com/studio/debug/dev-options for details.  | 
 | 72 | + | 
 | 73 | +Once ADB is enabled, use `adb push` to install `pkg-snapdragon` on the device.  | 
 | 74 | +**Note that the toolchain Docker image doesn't have ADB and doesn't set up the ADB bridge. Please use native ADB on the host.**  | 
 | 75 | + | 
 | 76 | +```  | 
 | 77 | +~/src/llama.cpp$ adb push pkg-adb/llama.cpp /data/local/tmp/  | 
 | 78 | +pkg-adb/llama.cpp/bin/: 67 files pushed, 0 skipped. 190.2 MB/s (919095042 bytes in 4.607s)  | 
 | 79 | +pkg-adb/llama.cpp/include/: 19 files pushed, 0 skipped. 20.5 MB/s (255173 bytes in 0.012s)  | 
 | 80 | +pkg-adb/llama.cpp/lib/: 16 files pushed, 0 skipped. 144.4 MB/s (43801382 bytes in 0.289s)  | 
 | 81 | +102 files pushed, 0 skipped. 186.9 MB/s (963151597 bytes in 4.914s)  | 
 | 82 | +```  | 
 | 83 | + | 
 | 84 | +At this point, you should also install some models:  | 
 | 85 | + | 
 | 86 | +```  | 
 | 87 | +~/src/llama.cpp$ wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf  | 
 | 88 | +...  | 
 | 89 | +2025-10-11 12:04:52 (10.7 MB/s) - ‘Llama-3.2-1B-Instruct-Q4_0.gguf’ saved [773025920/773025920]  | 
 | 90 | +
  | 
 | 91 | +~/src/llama.cpp$ adb push Llama-3.2-1B-Instruct-Q4_0.gguf /data/local/tmp/gguf  | 
 | 92 | +Llama-3.2-1B-Instruct-Q4_0.gguf: 1 file pushed, 0 skipped. 38.3 MB/s (773025920 bytes in 19.250s)  | 
 | 93 | +```  | 
 | 94 | + | 
 | 95 | +## How to Run  | 
 | 96 | + | 
 | 97 | +The easiest way to run llama.cpp cli tools is using provided wrapper scripts that properly set up all required environment variables.  | 
 | 98 | + | 
 | 99 | +llama.cpp supports three backends on Snapdragon-based devices: CPU, Adreno GPU (GPUOpenCL), and Hexagon NPU (HTP0-4).  | 
 | 100 | +You can select which backend to run the model on using the `D=` variable, which maps to the `--device` option.  | 
 | 101 | + | 
 | 102 | +Hexagon NPU behaves as a "GPU" device when it comes to `-ngl` and other offload-related options.  | 
 | 103 | + | 
 | 104 | +Here are some examples of running various llama.cpp tools via ADB.  | 
 | 105 | + | 
 | 106 | +Simple question for Llama-3.2-1B  | 
 | 107 | + | 
 | 108 | +```  | 
 | 109 | +~/src/llama.cpp$ M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -no-cnv -p "what is the most popular cookie in the world?"  | 
 | 110 | +...  | 
 | 111 | +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1  | 
 | 112 | +ggml-hex: Hexagon Arch version v79  | 
 | 113 | +ggml-hex: allocating new session: HTP0  | 
 | 114 | +ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb4000072c7955e50  | 
 | 115 | +...  | 
 | 116 | +load_tensors: offloading output layer to GPU  | 
 | 117 | +load_tensors: offloaded 17/17 layers to GPU  | 
 | 118 | +load_tensors:          CPU model buffer size =   225.49 MiB  | 
 | 119 | +load_tensors:         HTP0 model buffer size =     0.26 MiB  | 
 | 120 | +load_tensors:  HTP0-REPACK model buffer size =   504.00 MiB  | 
 | 121 | +...  | 
 | 122 | +I hope this helps you understand the world's most popular cookies! [end of text]  | 
 | 123 | +...  | 
 | 124 | +llama_perf_sampler_print:    sampling time =      30.08 ms /   487 runs   (    0.06 ms per token, 16191.77 tokens per second)  | 
 | 125 | +llama_perf_context_print:        load time =     617.94 ms  | 
 | 126 | +llama_perf_context_print: prompt eval time =      80.76 ms /    11 tokens (    7.34 ms per token,   136.21 tokens per second)  | 
 | 127 | +llama_perf_context_print:        eval time =    9210.59 ms /   475 runs   (   19.39 ms per token,    51.57 tokens per second)  | 
 | 128 | +llama_perf_context_print:       total time =    9454.92 ms /   486 tokens  | 
 | 129 | +llama_perf_context_print:    graphs reused =        473  | 
 | 130 | +llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |  | 
 | 131 | +llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |  | 
 | 132 | +llama_memory_breakdown_print: |   - Host               |                  439 =   225 +     136 +      77                |  | 
 | 133 | +llama_memory_breakdown_print: |   - HTP0-REPACK        |                  504 =   504 +       0 +       0                |  | 
 | 134 | +```  | 
 | 135 | + | 
 | 136 | +Summary request for OLMoE-1B-7B. This is a large model that requires two HTP sessions/devices  | 
 | 137 | + | 
 | 138 | +```  | 
 | 139 | +~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-cli.sh -f surfing.txt -no-cnv  | 
 | 140 | +...  | 
 | 141 | +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1  | 
 | 142 | +ggml-hex: Hexagon Arch version v81  | 
 | 143 | +ggml-hex: allocating new session: HTP0  | 
 | 144 | +ggml-hex: allocating new session: HTP1  | 
 | 145 | +...  | 
 | 146 | +load_tensors: offloading output layer to GPU  | 
 | 147 | +load_tensors: offloaded 17/17 layers to GPU  | 
 | 148 | +load_tensors:          CPU model buffer size =   143.86 MiB  | 
 | 149 | +load_tensors:         HTP1 model buffer size =     0.23 MiB  | 
 | 150 | +load_tensors:  HTP1-REPACK model buffer size =  1575.00 MiB  | 
 | 151 | +load_tensors:         HTP0 model buffer size =     0.28 MiB  | 
 | 152 | +load_tensors:  HTP0-REPACK model buffer size =  2025.00 MiB  | 
 | 153 | +...  | 
 | 154 | +llama_context:        CPU  output buffer size =     0.19 MiB  | 
 | 155 | +llama_kv_cache:       HTP1 KV buffer size =   238.00 MiB  | 
 | 156 | +llama_kv_cache:       HTP0 KV buffer size =   306.00 MiB  | 
 | 157 | +llama_kv_cache: size =  544.00 MiB (  8192 cells,  16 layers,  1/1 seqs), K (q8_0):  272.00 MiB, V (q8_0):  272.00 MiB  | 
 | 158 | +llama_context:       HTP0 compute buffer size =    15.00 MiB  | 
 | 159 | +llama_context:       HTP1 compute buffer size =    15.00 MiB  | 
 | 160 | +llama_context:        CPU compute buffer size =    24.56 MiB  | 
 | 161 | +...  | 
 | 162 | +llama_perf_context_print: prompt eval time =    1730.57 ms /   212 tokens (    8.16 ms per token,   122.50 tokens per second)  | 
 | 163 | +llama_perf_context_print:        eval time =    5624.75 ms /   257 runs   (   21.89 ms per token,    45.69 tokens per second)  | 
 | 164 | +llama_perf_context_print:       total time =    7377.33 ms /   469 tokens  | 
 | 165 | +llama_perf_context_print:    graphs reused =        255  | 
 | 166 | +llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |  | 
 | 167 | +llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |  | 
 | 168 | +llama_memory_breakdown_print: |   - HTP1 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |  | 
 | 169 | +llama_memory_breakdown_print: |   - Host               |                  742 =   144 +     544 +      54                |  | 
 | 170 | +llama_memory_breakdown_print: |   - HTP1-REPACK        |                 1575 =  1575 +       0 +       0                |  | 
 | 171 | +llama_memory_breakdown_print: |   - HTP0-REPACK        |                 2025 =  2025 +       0 +       0                |  | 
 | 172 | +```  | 
 | 173 | + | 
 | 174 | +Op test for MUL_MAT  | 
 | 175 | + | 
 | 176 | +```  | 
 | 177 | +~/src/llama.cpp$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT  | 
 | 178 | +...  | 
 | 179 | +Backend 2/3: HTP0  | 
 | 180 | +Device description: Hexagon  | 
 | 181 | +Device memory: 2048 MB (2048 MB free)  | 
 | 182 | +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK  | 
 | 183 | +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK  | 
 | 184 | +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK  | 
 | 185 | +
  | 
 | 186 | +~/src/llama.cpp-hexagon$ M=Llama-3.2-1B-Instruct-Q4_0.gguf ./scripts/snapdragon/adb/run-bench.sh -p 128 -n 64  | 
 | 187 | +...  | 
 | 188 | +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1  | 
 | 189 | +ggml-hex: Hexagon Arch version v79  | 
 | 190 | +ggml-hex: allocating new session: HTP0  | 
 | 191 | +ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007d4b231090  | 
 | 192 | +| model          |       size | params | backend    | ngl | threads | n_batch | mmap |  test |           t/s |  | 
 | 193 | +| ---------------| ---------: | -----: | ---------- | --: | ------: | ------: | ---: | ----: | ------------: |  | 
 | 194 | +| llama 1B Q4_0  | 729.75 MiB | 1.24 B | HTP        |  99 |       4 |     128 |    0 | pp128 | 169.42 ± 1.75 |  | 
 | 195 | +| llama 1B Q4_0  | 729.75 MiB | 1.24 B | HTP        |  99 |       4 |     128 |    0 |  tg64 |  51.54 ± 1.13 |  | 
 | 196 | +
  | 
 | 197 | +build: 6a8cf8914 (6733)  | 
 | 198 | +```  | 
 | 199 | + | 
 | 200 | +## Environment variables  | 
 | 201 | + | 
 | 202 | +- `GGML_HEXAGON_NDEV=1`  | 
 | 203 | +  Controls the number of devices/sessions to allocate. The default is 1.  | 
 | 204 | +  Most quantized models under 4B fit into a single session; an 8B model needs two, and a 20B model needs four.  | 
 | 205 | + | 
 | 206 | +- `GGML_HEXAGON_NHVX=0`  | 
 | 207 | +  Controls the number of HVX hardware threads to use. The default is all (actual number varies depending on the hardware version).  | 
 | 208 | + | 
 | 209 | +- `GGML_HEXAGON_HOSTBUF=1`  | 
 | 210 | +  Controls whether the Hexagon backend allocates host buffers. By default, all buffers except for REPACK are host buffers.  | 
 | 211 | +  This option is required for testing Ops that require REPACK buffers (MUL_MAT and MUL_MAT_ID).  | 
 | 212 | + | 
 | 213 | +- `GGML_HEXAGON_VERBOSE=1`  | 
 | 214 | +  Enables verbose logging of Ops from the backend. Example output:  | 
 | 215 | + | 
 | 216 | +  ```  | 
 | 217 | +  ggml-hex: HTP0 graph-compute n_nodes 2  | 
 | 218 | +  ggml-hex: HTP0 matmul : blk.27.ffn_up.weight x ffn_norm-27 -> ffn_up-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x1  | 
 | 219 | +  ggml-hex: HTP0 matmul : blk.27.ffn_gate.weight x ffn_norm-27 -> ffn_gate-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x3  | 
 | 220 | +  ggml-hex: HTP0 graph-compute n_nodes 1  | 
 | 221 | +  ggml-hex: HTP0 matmul : blk.27.ffn_down.weight x ffn_gate_par-27 -> ffn_out-27 : 8192:3072 x 8192:1 -> 3072:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x0  | 
 | 222 | +  ggml-hex: HTP0 get-tensor result_output : data 0x7592487000 offset 0 size 513024  | 
 | 223 | +  ```  | 
 | 224 | + | 
 | 225 | +- `GGML_HEXAGON_PROFILE=1`  | 
 | 226 | +  Generates a host-side profile for the ggml-hexagon Ops.  | 
 | 227 | + | 
 | 228 | +- `GGML_HEXAGON_OPMASK=0x0`  | 
 | 229 | +  Allows enabling specific stages of the processing pipeline:  | 
 | 230 | + | 
 | 231 | +  - `0x1` Enable Op Queue (i.e., queuing Ops into NPU)  | 
 | 232 | +  - `0x2` Enable Dynamic Quantizer (if needed for the Op)  | 
 | 233 | +  - `0x4` Enable Op Compute (MUL_MAT, etc.)  | 
 | 234 | + | 
 | 235 | +  Examples:  | 
 | 236 | + | 
 | 237 | +      `GGML_HEXAGON_OPMASK=0x1 llama-cli ...` - Ops are enqueued but NPU-side processing is stubbed out  | 
 | 238 | +      `GGML_HEXAGON_OPMASK=0x3 llama-cli ...` - NPU performs dynamic quantization and skips the rest  | 
 | 239 | +      `GGML_HEXAGON_OPMASK=0x7 llama-cli ...` - Full queuing and processing of Ops (default)  | 
0 commit comments