You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/android.md
+55-28Lines changed: 55 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,55 +2,82 @@
2
2
# Android
3
3
4
4
## Build on Android using Termux
5
-
[Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).
5
+
6
+
[Termux](https://termux.dev/en/) is an Android terminal emulator and Linux environment app (no root required). As of writing, Termux is available experimentally in the Google Play Store; otherwise, it may be obtained directly from the project repo or on F-Droid.
7
+
8
+
With Termux, you can install and run `llama.cpp` as if the environment were Linux. Once in the Termux shell:
9
+
10
+
```
11
+
$ apt update && apt upgrade -y
12
+
$ apt install git cmake
13
+
```
14
+
15
+
Then, follow the [build instructions](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md), specifically for CMake.
16
+
17
+
Once the binaries are built, download your model of choice (e.g., from Hugging Face). It's recommended to place it in the `~/` directory for best performance:
18
+
6
19
```
7
-
apt update && apt upgrade -y
8
-
apt install git make cmake
20
+
$ curl -L {model-url} -o ~/{model}.gguf
9
21
```
10
22
11
-
It's recommended to move your model inside the `~/` directory for best performance:
23
+
Then, if you are not already in the repo directory, `cd` into `llama.cpp` and:
[Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
29
+
Here, we show `llama-simple`, but any of the executables under `examples` should work, in theory. Be sure to set `context-size` to a reasonable number (say, 4096) to start with; otherwise, memory could spike and kill your terminal.
30
+
31
+
To see what it might look like visually, here's an old demo of an interactive session running on a Pixel 5 phone:
It's possible to build `llama.cpp` for Android on your host system via CMake and the Android NDK. If you are interested in this path, ensure you already have an environment prepared to cross-compile programs for Android (i.e., install the Android SDK). Note that, unlike desktop environments, the Android environment ships with a limited set of native libraries, and so only those libraries are available to CMake when building with the Android NDK (see: https://developer.android.com/ndk/guides/stable_apis.)
18
37
19
-
## Building the Project using Android NDK
20
-
Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
38
+
Once you're ready and have cloned `llama.cpp`, invoke the following in the project directory:
21
39
22
-
Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
52
+
Notes:
53
+
- While later versions of Android NDK ship with OpenMP, it must still be installed by CMake as a dependency, which is not supported at this time
54
+
-`llamafile` does not appear to support Android devices (see: https://github.com/Mozilla-Ocho/llamafile/issues/325)
55
+
56
+
The above command should configure `llama.cpp` with the most performant options for modern devices. Even if your device is not running `armv8.7a`, `llama.cpp` includes runtime checks for available CPU features it can use.
32
57
33
-
Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
58
+
Feel free to adjust the Android ABI for your target. Once the project is configured:
34
59
35
-
(Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
Download model [llama-2-7b-chat.Q4_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf), and push it to `/sdcard/llama.cpp/`, then move it to `/data/data/com.termux/files/home/model/`
65
+
After installing, go ahead and download the model of your choice to your host system. Then:
Be aware that Android will not find the library path `lib` on its own, so we must specify `LD_LIBRARY_PATH` in order to run the installed executables. Android does support `RPATH` in later API levels, so this could change in the future. Refer to the previous section for information about `context-size` (very important!) and running other `examples`.
|`--slot-save-path PATH`| path to save slot kv cache (default: disabled) |
154
158
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted:<br/>https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
155
159
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
@@ -380,8 +384,6 @@ node index.js
380
384
381
385
`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`
382
386
383
-
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
384
-
385
387
`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["top_k", "tfs_z", "typical_p", "top_p", "min_p", "temperature"]` - these are all the available values.
386
388
387
389
**Response format**
@@ -519,34 +521,41 @@ Requires a reranker model (such as [bge-reranker-v2-m3](https://huggingface.co/B
519
521
520
522
Takes a prefix and a suffix and returns the predicted completion as stream.
521
523
522
-
*Options:*
524
+
*Options:*
523
525
524
-
`input_prefix`: Set the prefix of the code to infill.
526
+
-`input_prefix`: Set the prefix of the code to infill.
527
+
-`input_suffix`: Set the suffix of the code to infill.
525
528
526
-
`input_suffix`: Set the suffix of the code to infill.
529
+
It also accepts all the options of `/completion` except `stream` and `prompt`.
527
530
528
-
It also accepts all the options of `/completion` except `stream` and `prompt`.
531
+
### **GET**`/props`: Get server global properties.
529
532
530
-
-**GET**`/props`: Return current server settings.
533
+
This endpoint is public (no API key check). By default, it is read-only. To make POST request to change global properties, you need to start server with `--props`
531
534
532
535
**Response format**
533
536
534
537
```json
535
538
{
536
-
"assistant_name": "",
537
-
"user_name": "",
539
+
"system_prompt": "",
538
540
"default_generation_settings": { ... },
539
541
"total_slots": 1,
540
542
"chat_template": ""
541
543
}
542
544
```
543
545
544
-
-`assistant_name` - the required assistant name to generate the prompt in case you have specified a system prompt for all slots.
545
-
-`user_name` - the required anti-prompt to generate the prompt in case you have specified a system prompt for all slots.
546
+
-`system_prompt` - the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
546
547
-`default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
547
548
-`total_slots` - the total number of slots for process requests (defined by `--parallel` option)
548
549
-`chat_template` - the model's original Jinja2 prompt template
549
550
551
+
### POST `/props`: Change server global properties.
552
+
553
+
To use this endpoint with POST method, you need to start server with `--props`
554
+
555
+
*Options:*
556
+
557
+
-`system_prompt`: Change the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
558
+
550
559
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
551
560
552
561
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
@@ -813,28 +822,6 @@ To know the `id` of the adapter, use GET `/lora-adapters`
813
822
814
823
## More examples
815
824
816
-
### Change system prompt on runtime
817
-
818
-
To use the server example to serve multiple chat-type clients while keeping the same system prompt, you can utilize the option `system_prompt`. This only needs to be used once.
819
-
820
-
`prompt`: Specify a context that you want all connecting clients to respect.
821
-
822
-
`anti_prompt`: Specify the word you want to use to instruct the model to stop. This must be sent to each client through the `/props` endpoint.
823
-
824
-
`assistant_name`: The bot's name is necessary for each customer to generate the prompt. This must be sent to each client through the `/props` endpoint.
825
-
826
-
```json
827
-
{
828
-
"system_prompt": {
829
-
"prompt": "Transcript of a never ending dialog, where the User interacts with an Assistant.\nThe Assistant is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.\nUser: Recommend a nice restaurant in the area.\nAssistant: I recommend the restaurant \"The Golden Duck\". It is a 5 star restaurant with a great view of the city. The food is delicious and the service is excellent. The prices are reasonable and the portions are generous. The restaurant is located at 123 Main Street, New York, NY 10001. The phone number is (212) 555-1234. The hours are Monday through Friday from 11:00 am to 10:00 pm. The restaurant is closed on Saturdays and Sundays.\nUser: Who is Richard Feynman?\nAssistant: Richard Feynman was an American physicist who is best known for his work in quantum mechanics and particle physics. He was awarded the Nobel Prize in Physics in 1965 for his contributions to the development of quantum electrodynamics. He was a popular lecturer and author, and he wrote several books, including \"Surely You're Joking, Mr. Feynman!\" and \"What Do You Care What Other People Think?\".\nUser:",
830
-
"anti_prompt": "User:",
831
-
"assistant_name": "Assistant:"
832
-
}
833
-
}
834
-
```
835
-
836
-
**NOTE**: You can do this automatically when starting the server by simply creating a .json file with these options and using the CLI option `-spf FNAME` or `--system-prompt-file FNAME`.
0 commit comments