Merge pull request #1340 from annietllnd/executorch-android-llama32

pareenaverma · web-flow · commit a15280bfbf53 · 2024-10-21T16:24:02.000-04:00
Llama3.2 on Android with Executorch
diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/1-dev-env-setup.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/1-dev-env-setup.md
@@ -8,23 +8,23 @@ layout: learningpathall
 
 ## Set up your development environment
 
-In this Learning Path, you will learn how to build and deploy a simple LLM-based chat app to an Android device using ExecuTorch and XNNPACK. You will learn how to build the ExecuTorch runtime for Llama models, build JNI libraries for the Android application, and use the libraries in the application.
+In this Learning Path, you will learn how to build and deploy a simple LLM-based chat app to an Android device using ExecuTorch and XNNPACK with [KleidiAI](https://gitlab.arm.com/kleidi/kleidiai). Arm has worked with the Meta team to integrate KleidiAI into ExecuTorch through XNNPACK. These improvements increase the throughput of quantized LLMs running on Arm chips that contain the i8mm (8-bit integer matrix multiply) processor feature. You will learn how to build the ExecuTorch runtime for Llama models with KleidiAI, build JNI libraries for the Android application, and use the libraries in the application.
 
 The first step is to prepare a development environment with the required software:
 
 - Android Studio (latest version recommended).
-- Android NDK version 25.0.8775105.
+- Android NDK version 28.0.12433566.
 - Java 17 JDK.
 - Git.
-- Python 3.10.
+- Python 3.10 or later (these instructions have been tested with 3.10 and 3.12)
 
 The instructions assume macOS with Apple Silicon, an x86 Debian, or Ubuntu Linux machine with at least 16GB of RAM.
 
 ## Install Android Studio and Android NDK
 
 Follow these steps to install and configure Android Studio:
 
-1. Download and install the latest version of [Android Studio](https://developer.android.com/studio/). 
+1. Download and install the latest version of [Android Studio](https://developer.android.com/studio/).
 
 2. Start Android Studio and open the `Settings` dialog.
 
@@ -49,53 +49,55 @@ curl https://dl.google.com/android/repository/commandlinetools-mac-11076708_late
 Unzip the Android command line tools:
 
 ```
-unzip commandlinetools.zip
+unzip commandlinetools.zip -d android-sdk
 ```
 
-Install the NDK in the directory that Android Studio installed the SDK. This is generally `~/Library/Android/sdk` by default:
+Install the NDK in the directory that Android Studio installed the SDK. This is generally `~/Library/Android/sdk` by default. Set the requirement environment variables:
 
 ```
 export ANDROID_HOME="$(realpath ~/Library/Android/sdk)"
-./cmdline-tools/bin/sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;25.0.8775105"
+export PATH=$ANDROID_HOME/cmdline-tools/bin/:$PATH
+sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;28.0.12433566"
+export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/
 ```
 
 ## Install Java 17 JDK
 
 Open the [Java SE 17 Archive Downloads](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) page in your browser.
 
-Select an appropriate download for your development machine operating system. 
+Select an appropriate download for your development machine operating system.
 
 Downloads are available for macOS as well as Linux.
 
-## Install Git
+## Install Git and cmake
 
 For macOS use [Homebrew](https://brew.sh/):
-  
+
 ``` bash
-brew install git
+brew install git cmake
 ```
 
 For Linux, use the package manager for your distribution:
-  
+
 ``` bash
-sudo apt install git-all
+sudo apt install git-all cmake
 ```
 
 ## Install Python 3.10
 
 For macOS:
-  
+
 ``` bash
 brew install python@3.10
 ```
 
 For Linux:
-  
+
 ``` bash
 sudo apt update
-udo apt install software-properties-common -y
+sudo apt install software-properties-common -y
 sudo add-apt-repository ppa:deadsnakes/ppa
-sudo apt install Python3.10
+sudo apt install Python3.10 python3.10-venv
 ```
 
 You now have the required development tools installed.
diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/2-executorch-setup.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/2-executorch-setup.md
@@ -34,23 +34,16 @@ conda activate executorch
 
 ### Clone ExecuTorch and install the required dependencies
 
-From within the conda environment, run the commands below to download the ExecuTorch repository and install the required packages: 
+From within the conda environment, run the commands below to download the ExecuTorch repository and install the required packages:
 
 ``` bash
-# Clone the ExecuTorch repo from GitHub
 git clone https://github.com/pytorch/executorch.git
 cd executorch
-
-# Update and pull submodules
 git submodule sync
 git submodule update --init
-
-# Install ExecuTorch pip package and its dependencies, as well as
-# development tools like CMake.
+./install_requirements.sh
 ./install_requirements.sh --pybind xnnpack
-
-# Install a few more dependencies
-./examples/models/llama2/install_requirements.sh
+./examples/models/llama/install_requirements.sh
 ```
 
-You are now ready to start building the application. 
+When these scripts finish successfully, ExecuTorch is all set up. That means it's time to dive into the world of Llama models!
diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/3-Understanding-LLaMA-models.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/3-Understanding-LLaMA-models.md
@@ -30,11 +30,11 @@ As Llama 2 and Llama 3 models require at least 4-bit quantization due to the con
 
 ## Quantization
 
-One way to create models that fit in smartphone memory is to employ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. *Dynamic quantization* refers to quantizing activations dynamically, such that quantization parameters for activations are calculated, from the min/max range, at runtime. Furthermore, weights are statically quantized. In this case, weights are per-channel groupwise quantized with 4-bit signed integers. 
+One way to create models that fit in smartphone memory is to employ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. *Dynamic quantization* refers to quantizing activations dynamically, such that quantization parameters for activations are calculated, from the min/max range, at runtime. Furthermore, weights are statically quantized. In this case, weights are per-channel groupwise quantized with 4-bit signed integers.
 
 For further information, refer to [torchao: PyTorch Architecture Optimization](https://github.com/pytorch-labs/ao/).
 
-The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). 
+The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness).
 
 The results are for two different groupsizes, with max_seq_len 2048, and 1000 samples:
 
@@ -43,9 +43,9 @@ The results are for two different groupsizes, with max_seq_len 2048, and 1000 sa
 |Llama 2 7B | 9.2 | 10.2 | 10.7
 |Llama 3 8B | 7.9 | 9.4 | 9.7
 
-Note that groupsize less than 128 was not enabled, since such a model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way. 
+Note that groupsize less than 128 was not enabled, since such a model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way.
 
 What this implies for model size is:
 
-1. Embedding table is in FP32. 
+1. Embedding table is in FP32.
 2. Quantized weights scales are FP32.
diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/4-Prepare-LLaMA-models.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/4-Prepare-LLaMA-models.md
@@ -6,121 +6,48 @@ weight: 5
 layout: learningpathall
 ---
 
-## Download and export the Llama 3 8B model
+## Download and export the Llama 3.2 1B model
 
-To get started with Llama 3, you obtain the pre-trained parameters by visiting [Meta's Llama Downloads](https://llama.meta.com/llama-downloads/) page. Request the access by filling out your details and read through and accept the Responsible Use Guide. This grants you a license and a download link which is valid for 24 hours. The Llama 3 8B model is used for this part, but the same instructions apply for other options as well with minimal modification.
+To get started with Llama 3, you obtain the pre-trained parameters by visiting [Meta's Llama Downloads](https://llama.meta.com/llama-downloads/) page. Request the access by filling out your details and read through and accept the Responsible Use Guide. This grants you a license and a download link which is valid for 24 hours. The Llama 3.2 1B model is used for this part, but the same instructions apply for other options as well with minimal modification.
 
-Install the following requirements using a package manager of your choice, for example apt-get:
+Install the `llama-stack` package from `pip`.
 ```bash
-apt-get install md5sum wget
+pip install llama-stack
 ```
-
-Clone the Llama models Git repository and install the dependencies:
-
-```bash
-git clone https://github.com/meta-llama/llama-models
-cd llama-models
-pip install -e .
-pip install buck
-```
-Run the script to download, and paste the download link from the email when prompted.
+Run the command to download, and paste the download link from the email when prompted.
 ```bash
-cd models/llama3_1
-./download.sh
+llama model download --source meta --model-id Llama3.2-1B
 ```
-You will be asked which models you would like to download. Enter `meta-llama-3.1-8b`.
+
+When the download is finished, the installation path is printed as output.
 ```output
- **** Model list ***
- -  meta-llama-3.1-405b
- -  meta-llama-3.1-70b
- -  meta-llama-3.1-8b
- -  meta-llama-guard-3-8b
- -  prompt-guard
+Successfully downloaded model to /<path-to-home>/.llama/checkpoints/Llama3.2-1B
 ```
-When the download is finished, you should see the following files in the new folder
+
+Verify by viewing the downloaded files under this path:
 
 ```bash
-$ ls Meta-Llama-3.1-8B
-consolidated.00.pth  params.json  tokenizer.model
+ls $HOME/.llama/checkpoints/Llama3.2-1B
+checklist.chk           consolidated.00.pth     params.json             tokenizer.model
 ```
 
-{{% notice Note %}}
-1. If you encounter the error "Sorry, we could not process your request at this moment", it might mean you have initiated two license processes simultaneously. Try modifying the affiliation field to work around it.
-2. You may have to run the `download.sh` script as root, or modify the execution privileges with `chmod`.
+{{% notice Working Directory %}}
+The rest of the instructions should be executed from the ExecuTorch base directory.
 {{% /notice %}}
 
-Export model and generate `.pte` file. Run the Python command to export the model:
+Export model and generate `.pte` file. Run the Python command to export the model to your current directory.
 
 ```bash
-python -m examples.models.llama2.export_llama --checkpoint llama-models/models/llama3_1/Meta-Llama-3.1-8B/consolidated.00.pth -p llama-models/models/llama3_1/Meta-Llama-3.1-8B/params.json -kv --use_sdpa_with_kv_cache -X -qmode 8da4w  --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
+python3 -m examples.models.llama.export_llama \
+--checkpoint $HOME/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \
+--params $HOME/.llama/checkpoints/Llama3.2-1B/params.json \
+-kv --use_sdpa_with_kv_cache -X --xnnpack-extended-ops -qmode 8da4w \
+--group_size 256 -d fp32 \
+--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001, 128006, 128007]}' \
+--embedding-quantize 4,32 \
+--output_name="llama3_1B_kv_sdpa_xnn_qe_4_128_1024_embedding_4bit.pte" \
+--max_seq_length 1024
 ```
 
 Due to the larger vocabulary size of Llama 3, you should quantize the embeddings with `--embedding-quantize 4,32` to further reduce the model size.
 
-## Optional: Evaluate Llama 3 model accuracy
-
-You can evaluate model accuracy using the same arguments as above:
-
-``` bash
-python -m examples.models.llama2.eval_llama -c llama-models/models/llama3_1/Meta-Llama-3.1-8B/consolidated.00.pth -p llama-models/models/llama3_1/Meta-Llama-3.1-8B/params.json -t llama-models/models/llama3_1/Meta-Llama-3.1-8B/tokenizer.model -d fp32 --max_seq_len 2048 --limit 1000
-```
-
-{{% notice Warning %}}
-Model evaluation without a GPU will take a long time. On a MacBook with an M3 chip and 18GB RAM this took 10+ hours.
-{{% /notice %}}
-
-## Validate models on the development machine
-
-Before running models on a smartphone, you can validate them on your development computer.
-
-Follow the steps below to build ExecuTorch and the Llama runner to run models.
-
-1. Build executorch with optimized CPU performance:
-
-    ``` bash
-    cmake -DPYTHON_EXECUTABLE=python \
-        -DCMAKE_INSTALL_PREFIX=cmake-out \
-        -DEXECUTORCH_ENABLE_LOGGING=1 \
-        -DCMAKE_BUILD_TYPE=Release \
-        -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
-        -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
-        -DEXECUTORCH_BUILD_XNNPACK=ON \
-        -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-        -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-        -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
-        -Bcmake-out .
-
-    cmake --build cmake-out -j16 --target install --config Release
-    ```
-
-    The CMake build options are available on [GitHub](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
-
-2. Build the Llama runner:
-
-{{% notice Note %}}
-For Llama 3, add `-DEXECUTORCH_USE_TIKTOKEN=ON` option.
-{{% /notice %}}
-
-Run cmake:
-
-``` bash
-    cmake -DPYTHON_EXECUTABLE=python \
-        -DCMAKE_INSTALL_PREFIX=cmake-out \
-        -DCMAKE_BUILD_TYPE=Release \
-        -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
-        -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-        -DEXECUTORCH_BUILD_XNNPACK=ON \
-        -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-        -Bcmake-out/examples/models/llama2 \
-        examples/models/llama2
-
-    cmake --build cmake-out/examples/models/llama2 -j16 --config Release
-```
-
-3. Run the model:
-
-    ``` bash
-    cmake-out/examples/models/llama2/llama_main --model_path=llama3_kv_sdpa_xnn_qe_4_32.pte --tokenizer_path=llama-models/models/llama3_1/Meta-Llama-3.1-8B/tokenizer.model --prompt=<prompt>
-    ```
-
-    The run options are available on [GitHub](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L18-L40).
diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/5-Run-Benchmark-on-Android.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/5-Run-Benchmark-on-Android.md
diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/6-Build-Android-Chat-App.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/6-Build-Android-Chat-App.md