Skip to content

Commit a15280b

Browse files
authored
Merge pull request #1340 from annietllnd/executorch-android-llama32
Llama3.2 on Android with Executorch
2 parents 638cfb6 + 806b917 commit a15280b

File tree

6 files changed

+92
-157
lines changed

6 files changed

+92
-157
lines changed

content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/1-dev-env-setup.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,23 @@ layout: learningpathall
88

99
## Set up your development environment
1010

11-
In this Learning Path, you will learn how to build and deploy a simple LLM-based chat app to an Android device using ExecuTorch and XNNPACK. You will learn how to build the ExecuTorch runtime for Llama models, build JNI libraries for the Android application, and use the libraries in the application.
11+
In this Learning Path, you will learn how to build and deploy a simple LLM-based chat app to an Android device using ExecuTorch and XNNPACK with [KleidiAI](https://gitlab.arm.com/kleidi/kleidiai). Arm has worked with the Meta team to integrate KleidiAI into ExecuTorch through XNNPACK. These improvements increase the throughput of quantized LLMs running on Arm chips that contain the i8mm (8-bit integer matrix multiply) processor feature. You will learn how to build the ExecuTorch runtime for Llama models with KleidiAI, build JNI libraries for the Android application, and use the libraries in the application.
1212

1313
The first step is to prepare a development environment with the required software:
1414

1515
- Android Studio (latest version recommended).
16-
- Android NDK version 25.0.8775105.
16+
- Android NDK version 28.0.12433566.
1717
- Java 17 JDK.
1818
- Git.
19-
- Python 3.10.
19+
- Python 3.10 or later (these instructions have been tested with 3.10 and 3.12)
2020

2121
The instructions assume macOS with Apple Silicon, an x86 Debian, or Ubuntu Linux machine with at least 16GB of RAM.
2222

2323
## Install Android Studio and Android NDK
2424

2525
Follow these steps to install and configure Android Studio:
2626

27-
1. Download and install the latest version of [Android Studio](https://developer.android.com/studio/).
27+
1. Download and install the latest version of [Android Studio](https://developer.android.com/studio/).
2828

2929
2. Start Android Studio and open the `Settings` dialog.
3030

@@ -49,53 +49,55 @@ curl https://dl.google.com/android/repository/commandlinetools-mac-11076708_late
4949
Unzip the Android command line tools:
5050

5151
```
52-
unzip commandlinetools.zip
52+
unzip commandlinetools.zip -d android-sdk
5353
```
5454

55-
Install the NDK in the directory that Android Studio installed the SDK. This is generally `~/Library/Android/sdk` by default:
55+
Install the NDK in the directory that Android Studio installed the SDK. This is generally `~/Library/Android/sdk` by default. Set the requirement environment variables:
5656

5757
```
5858
export ANDROID_HOME="$(realpath ~/Library/Android/sdk)"
59-
./cmdline-tools/bin/sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;25.0.8775105"
59+
export PATH=$ANDROID_HOME/cmdline-tools/bin/:$PATH
60+
sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;28.0.12433566"
61+
export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/
6062
```
6163

6264
## Install Java 17 JDK
6365

6466
Open the [Java SE 17 Archive Downloads](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) page in your browser.
6567

66-
Select an appropriate download for your development machine operating system.
68+
Select an appropriate download for your development machine operating system.
6769

6870
Downloads are available for macOS as well as Linux.
6971

70-
## Install Git
72+
## Install Git and cmake
7173

7274
For macOS use [Homebrew](https://brew.sh/):
73-
75+
7476
``` bash
75-
brew install git
77+
brew install git cmake
7678
```
7779

7880
For Linux, use the package manager for your distribution:
79-
81+
8082
``` bash
81-
sudo apt install git-all
83+
sudo apt install git-all cmake
8284
```
8385

8486
## Install Python 3.10
8587

8688
For macOS:
87-
89+
8890
``` bash
8991
brew install [email protected]
9092
```
9193

9294
For Linux:
93-
95+
9496
``` bash
9597
sudo apt update
96-
udo apt install software-properties-common -y
98+
sudo apt install software-properties-common -y
9799
sudo add-apt-repository ppa:deadsnakes/ppa
98-
sudo apt install Python3.10
100+
sudo apt install Python3.10 python3.10-venv
99101
```
100102

101103
You now have the required development tools installed.

content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/2-executorch-setup.md

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -34,23 +34,16 @@ conda activate executorch
3434

3535
### Clone ExecuTorch and install the required dependencies
3636

37-
From within the conda environment, run the commands below to download the ExecuTorch repository and install the required packages:
37+
From within the conda environment, run the commands below to download the ExecuTorch repository and install the required packages:
3838

3939
``` bash
40-
# Clone the ExecuTorch repo from GitHub
4140
git clone https://github.com/pytorch/executorch.git
4241
cd executorch
43-
44-
# Update and pull submodules
4542
git submodule sync
4643
git submodule update --init
47-
48-
# Install ExecuTorch pip package and its dependencies, as well as
49-
# development tools like CMake.
44+
./install_requirements.sh
5045
./install_requirements.sh --pybind xnnpack
51-
52-
# Install a few more dependencies
53-
./examples/models/llama2/install_requirements.sh
46+
./examples/models/llama/install_requirements.sh
5447
```
5548

56-
You are now ready to start building the application.
49+
When these scripts finish successfully, ExecuTorch is all set up. That means it's time to dive into the world of Llama models!

content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/3-Understanding-LLaMA-models.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,11 @@ As Llama 2 and Llama 3 models require at least 4-bit quantization due to the con
3030

3131
## Quantization
3232

33-
One way to create models that fit in smartphone memory is to employ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. *Dynamic quantization* refers to quantizing activations dynamically, such that quantization parameters for activations are calculated, from the min/max range, at runtime. Furthermore, weights are statically quantized. In this case, weights are per-channel groupwise quantized with 4-bit signed integers.
33+
One way to create models that fit in smartphone memory is to employ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. *Dynamic quantization* refers to quantizing activations dynamically, such that quantization parameters for activations are calculated, from the min/max range, at runtime. Furthermore, weights are statically quantized. In this case, weights are per-channel groupwise quantized with 4-bit signed integers.
3434

3535
For further information, refer to [torchao: PyTorch Architecture Optimization](https://github.com/pytorch-labs/ao/).
3636

37-
The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness).
37+
The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness).
3838

3939
The results are for two different groupsizes, with max_seq_len 2048, and 1000 samples:
4040

@@ -43,9 +43,9 @@ The results are for two different groupsizes, with max_seq_len 2048, and 1000 sa
4343
|Llama 2 7B | 9.2 | 10.2 | 10.7
4444
|Llama 3 8B | 7.9 | 9.4 | 9.7
4545

46-
Note that groupsize less than 128 was not enabled, since such a model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way.
46+
Note that groupsize less than 128 was not enabled, since such a model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way.
4747

4848
What this implies for model size is:
4949

50-
1. Embedding table is in FP32.
50+
1. Embedding table is in FP32.
5151
2. Quantized weights scales are FP32.

content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/4-Prepare-LLaMA-models.md

Lines changed: 25 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -6,121 +6,48 @@ weight: 5
66
layout: learningpathall
77
---
88

9-
## Download and export the Llama 3 8B model
9+
## Download and export the Llama 3.2 1B model
1010

11-
To get started with Llama 3, you obtain the pre-trained parameters by visiting [Meta's Llama Downloads](https://llama.meta.com/llama-downloads/) page. Request the access by filling out your details and read through and accept the Responsible Use Guide. This grants you a license and a download link which is valid for 24 hours. The Llama 3 8B model is used for this part, but the same instructions apply for other options as well with minimal modification.
11+
To get started with Llama 3, you obtain the pre-trained parameters by visiting [Meta's Llama Downloads](https://llama.meta.com/llama-downloads/) page. Request the access by filling out your details and read through and accept the Responsible Use Guide. This grants you a license and a download link which is valid for 24 hours. The Llama 3.2 1B model is used for this part, but the same instructions apply for other options as well with minimal modification.
1212

13-
Install the following requirements using a package manager of your choice, for example apt-get:
13+
Install the `llama-stack` package from `pip`.
1414
```bash
15-
apt-get install md5sum wget
15+
pip install llama-stack
1616
```
17-
18-
Clone the Llama models Git repository and install the dependencies:
19-
20-
```bash
21-
git clone https://github.com/meta-llama/llama-models
22-
cd llama-models
23-
pip install -e .
24-
pip install buck
25-
```
26-
Run the script to download, and paste the download link from the email when prompted.
17+
Run the command to download, and paste the download link from the email when prompted.
2718
```bash
28-
cd models/llama3_1
29-
./download.sh
19+
llama model download --source meta --model-id Llama3.2-1B
3020
```
31-
You will be asked which models you would like to download. Enter `meta-llama-3.1-8b`.
21+
22+
When the download is finished, the installation path is printed as output.
3223
```output
33-
**** Model list ***
34-
- meta-llama-3.1-405b
35-
- meta-llama-3.1-70b
36-
- meta-llama-3.1-8b
37-
- meta-llama-guard-3-8b
38-
- prompt-guard
24+
Successfully downloaded model to /<path-to-home>/.llama/checkpoints/Llama3.2-1B
3925
```
40-
When the download is finished, you should see the following files in the new folder
26+
27+
Verify by viewing the downloaded files under this path:
4128

4229
```bash
43-
$ ls Meta-Llama-3.1-8B
44-
consolidated.00.pth params.json tokenizer.model
30+
ls $HOME/.llama/checkpoints/Llama3.2-1B
31+
checklist.chk consolidated.00.pth params.json tokenizer.model
4532
```
4633

47-
{{% notice Note %}}
48-
1. If you encounter the error "Sorry, we could not process your request at this moment", it might mean you have initiated two license processes simultaneously. Try modifying the affiliation field to work around it.
49-
2. You may have to run the `download.sh` script as root, or modify the execution privileges with `chmod`.
34+
{{% notice Working Directory %}}
35+
The rest of the instructions should be executed from the ExecuTorch base directory.
5036
{{% /notice %}}
5137

52-
Export model and generate `.pte` file. Run the Python command to export the model:
38+
Export model and generate `.pte` file. Run the Python command to export the model to your current directory.
5339

5440
```bash
55-
python -m examples.models.llama2.export_llama --checkpoint llama-models/models/llama3_1/Meta-Llama-3.1-8B/consolidated.00.pth -p llama-models/models/llama3_1/Meta-Llama-3.1-8B/params.json -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
41+
python3 -m examples.models.llama.export_llama \
42+
--checkpoint $HOME/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \
43+
--params $HOME/.llama/checkpoints/Llama3.2-1B/params.json \
44+
-kv --use_sdpa_with_kv_cache -X --xnnpack-extended-ops -qmode 8da4w \
45+
--group_size 256 -d fp32 \
46+
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001, 128006, 128007]}' \
47+
--embedding-quantize 4,32 \
48+
--output_name="llama3_1B_kv_sdpa_xnn_qe_4_128_1024_embedding_4bit.pte" \
49+
--max_seq_length 1024
5650
```
5751

5852
Due to the larger vocabulary size of Llama 3, you should quantize the embeddings with `--embedding-quantize 4,32` to further reduce the model size.
5953

60-
## Optional: Evaluate Llama 3 model accuracy
61-
62-
You can evaluate model accuracy using the same arguments as above:
63-
64-
``` bash
65-
python -m examples.models.llama2.eval_llama -c llama-models/models/llama3_1/Meta-Llama-3.1-8B/consolidated.00.pth -p llama-models/models/llama3_1/Meta-Llama-3.1-8B/params.json -t llama-models/models/llama3_1/Meta-Llama-3.1-8B/tokenizer.model -d fp32 --max_seq_len 2048 --limit 1000
66-
```
67-
68-
{{% notice Warning %}}
69-
Model evaluation without a GPU will take a long time. On a MacBook with an M3 chip and 18GB RAM this took 10+ hours.
70-
{{% /notice %}}
71-
72-
## Validate models on the development machine
73-
74-
Before running models on a smartphone, you can validate them on your development computer.
75-
76-
Follow the steps below to build ExecuTorch and the Llama runner to run models.
77-
78-
1. Build executorch with optimized CPU performance:
79-
80-
``` bash
81-
cmake -DPYTHON_EXECUTABLE=python \
82-
-DCMAKE_INSTALL_PREFIX=cmake-out \
83-
-DEXECUTORCH_ENABLE_LOGGING=1 \
84-
-DCMAKE_BUILD_TYPE=Release \
85-
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
86-
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
87-
-DEXECUTORCH_BUILD_XNNPACK=ON \
88-
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
89-
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
90-
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
91-
-Bcmake-out .
92-
93-
cmake --build cmake-out -j16 --target install --config Release
94-
```
95-
96-
The CMake build options are available on [GitHub](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
97-
98-
2. Build the Llama runner:
99-
100-
{{% notice Note %}}
101-
For Llama 3, add `-DEXECUTORCH_USE_TIKTOKEN=ON` option.
102-
{{% /notice %}}
103-
104-
Run cmake:
105-
106-
``` bash
107-
cmake -DPYTHON_EXECUTABLE=python \
108-
-DCMAKE_INSTALL_PREFIX=cmake-out \
109-
-DCMAKE_BUILD_TYPE=Release \
110-
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
111-
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
112-
-DEXECUTORCH_BUILD_XNNPACK=ON \
113-
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
114-
-Bcmake-out/examples/models/llama2 \
115-
examples/models/llama2
116-
117-
cmake --build cmake-out/examples/models/llama2 -j16 --config Release
118-
```
119-
120-
3. Run the model:
121-
122-
``` bash
123-
cmake-out/examples/models/llama2/llama_main --model_path=llama3_kv_sdpa_xnn_qe_4_32.pte --tokenizer_path=llama-models/models/llama3_1/Meta-Llama-3.1-8B/tokenizer.model --prompt=<prompt>
124-
```
125-
126-
The run options are available on [GitHub](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L18-L40).

0 commit comments

Comments
 (0)