diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_index.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_index.md new file mode 100644 index 0000000000..dba8f68506 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_index.md @@ -0,0 +1,65 @@ +--- +title: Customer Support Chatbot with Llama and ExecuTorch on Arm-Based Mobile Devices (with Agentic AI Capabilities) +minutes_to_complete: 60 + +who_is_this_for: This learning plan is designed for developers with basic knowledge of Python, Mobile development, and machine learning concepts.It guides you through creating an on-device customer support chatbot using Meta's Llama models deployed via PyTorch's ExecuTorch runtime.The focus is on Arm-based Android devices.The chatbot will handle common customer queries (e.g., product info, troubleshooting) with low latency, privacy (no cloud dependency), and optimized performance.Incorporates agentic AI capabilities, transforming the chatbot from reactive (simple Q&A) to proactive and autonomous. Agentic AI enables the bot to plan multi-step actions, use external tools,reason over user intent, and adapt responses dynamically. This is achieved by extending the core LLM with tool-calling mechanisms and multi-agent orchestration. + +learning_objectives: + - Explain the architecture and capabilities of Llama models (e.g., Llama 3.2 1B/3B) for mobile use. + - Master the process of quantizing LLMs (e.g., 4-bit PTQ) to reduce model size and enable efficient inference on resource-constrained mobile devices. + - Gain proficiency in using ExecuTorch to export PyTorch models to .pte format for on-device deployment. + - Learn to leverage Arm-specific optimizations (e.g., XNNPACK, KleidiAI) to achieve 2-3x faster inference on Arm-based Android devices. + - Implement real-time inference with Llama models, enabling seamless customer support interactions (e.g., handling FAQs, troubleshooting). + +prerequisites: + - Basic Understanding of Machine Learning & Deep Learning (Familiarity with concepts like supervised learning, neural networks, transfer learning and Understanding of model training, validation, & overfitting concepts). + - Familiarity with Deep Learning Frameworks (Experience with PyTorch for building, training neural networks and Knowledge of Hugging Face Transformers for working with pre-trained LLMs. + - An Arm-powered smartphone with the i8mm feature running Android, with 16GB of RAM. + - A USB cable to connect your smartphone to your development machine. + - An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server or Arm based laptop. + - Android Debug Bridge (adb) installed on your device. Follow the steps in [adb](https://developer.android.com/tools/adb) to install Android SDK Platform Tools. The adb tool is included in this package. + - Java 17 JDK. Follow the steps in [Java 17 JDK](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) to download and install JDK for host. + - Android Studio. Follow the steps in [Android Studio](https://developer.android.com/studio) to download and install Android Studio for host. + - Python 3.10. + +author: Parichay Das + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Neoverse + +tools_software_languages: + - LLM + - GenAI + - Python + - PyTorch + - ExecuTorch +operatingsystems: + - Linux + - Windows + - Android + + +further_reading: + - resource: + title: Hugging Face Documentation + link: https://huggingface.co/docs + type: documentation + - resource: + title: PyTorch Documentation + link: https://pytorch.org/docs/stable/index.html + type: documentation + - resource: + title: Android + link: https://www.android.com/ + type: website + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_next-steps.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/example-picture.png b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/example-picture.png new file mode 100644 index 0000000000..c69844bed4 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/example-picture.png differ diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-1.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-1.md new file mode 100644 index 0000000000..224fa4013e --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-1.md @@ -0,0 +1,34 @@ +--- +title: Overview +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Understanding Llama: Meta’s Large Language Model +Llama is a family of large language models trained using publicly available datasets. These models demonstrate strong performance across a range of natural language processing (NLP) tasks, including language translation, question answering, and text summarization. + +In addition to their analytical capabilities, Llama models can generate human-like, coherent, and contextually relevant text, making them highly effective for applications that rely on natural language generation. Consequently, they serve as powerful tools in areas such as chatbots, virtual assistants, and language translation, as well as in creative and content-driven domains where producing natural and engaging text is essential. + +Please note that the models are subject to the [acceptable use policy](https://github.com/meta-llama/llama/blob/main/USE_POLICY.md) and this [responsible use guide](https://github.com/meta-llama/llama/blob/main/RESPONSIBLE_USE_GUIDE.md) . + + + +## Quantization +A practical approach to make models fit within smartphone memory constraints is through 4-bit groupwise per-token dynamic quantization of all linear layers. In this technique, dynamic quantization is applied to activations—meaning the quantization parameters are computed at runtime based on the observed minimum and maximum activation values. Meanwhile, the model weights are statically quantized, where each channel is quantized in groups using 4-bit signed integers. This method significantly reduces memory usage while maintaining model performance for on-device inference. + +This method ensures efficient memory usage while maintaining model performance on resource-constrained devices. + +For further information, refer to [torchao: PyTorch Architecture Optimization](https://github.com/pytorch-labs/ao/). + +The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). + +The results are for two different groupsizes, with max_seq_len 2048, and 1000 samples: + +|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256) +|--------|-----------------| ---------------------- | --------------- +|Llama 2 7B | 9.2 | 10.2 | 10.7 +|Llama 3 8B | 7.9 | 9.4 | 9.7 + +Note that groupsize less than 128 was not enabled in this example, since the model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way. diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-2.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-2.md new file mode 100644 index 0000000000..195ad5f4cc --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-2.md @@ -0,0 +1,79 @@ +--- +title: Environment Setup +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Android NDK and Android Studio - Environment Setup + +#### Plartform Required +- An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server or Arm based laptop. +- An Arm-powered smartphone with the i8mm feature running Android, with 16GB of RAM. +- A USB cable to connect your smartphone to your development machine. + +The installation and configuration of Android Studio can be accomplished through the following steps: +1. Download and install the latest version of [Android Studio](https://developer.android.com/studio). +2. Launch Android Studio and access the Settings dialog. +3. Navigate to Languages & Frameworks → Android SDK. +4. Under the SDK Platforms tab, ensure that Android 14.0 (“UpsideDownCake”) is selected. + +Next, proceed to install the required version of the Android NDK by first setting up the Android Command Line Tools. +Linux: +```bash +curl https://dl.google.com/android/repository/commandlinetools-linux-11076708_latest.zip -o commandlinetools.zip +unzip commandlinetools.zip +./commandlinetools/bin/sdkmanager --install "ndk;26.1.10909697" +``` +Install the NDK in the same directory where Android Studio has installed the SDK, which is typically located at ~/Library/Android/sdk by default. Then, configure the necessary environment variables as follows: +```bash +export ANDROID_HOME="$(realpath ~/Library/Android/sdk)" +export PATH=$ANDROID_HOME/cmdline-tools/bin/:$PATH +sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;28.0.12433566" +export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/ +``` + +#### Install Java 17 JDK +1. Open the Java SE 17 Archive [Downloads](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) page in your browser. +2. Select an appropriate download for your development machine operating system. + +#### Install Git and cmake +```bash +sudo apt-get install git cmake +``` + +#### Install Python 3.10 +```bash +sudo apt-get install python3.10 +``` + +#### Set up ExecuTorch +ExecuTorch is an end-to-end framework designed to facilitate on-device inference across a wide range of mobile and edge platforms, including wearables, embedded systems, and microcontrollers. As a component of the PyTorch Edge ecosystem, it streamlines the efficient deployment of PyTorch models on edge devices. For further details, refer to the [ExecuTorch Overview](https://pytorch.org/executorch/stable/overview/). + +It is recommended to create an isolated Python environment to install the ExecuTorch dependencies. Instructions are available for setting up either a Python virtual environment or a Conda virtual environment—you only need to choose one of these options. + +##### Install Required Tools ( Python environment setup) +```python +python3 -m venv exec_env +source exec_env/bin/activate +pip install torch torchvision torchaudio +pip install executorch +``` +##### Clone Required Repositories +```bash +git clone https://github.com/pytorch/executorch.git +git clone https://github.com/pytorch/text.git +``` +##### Download Pretrained Model (Llama 3.1 Instruct) +Download the quantized model weights optimized for mobile deployment from either the Meta AI Hub or Hugging Face. +``` +git lfs install +git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct +``` + +##### Verify Arm SDK Path +``` +ANDROID_SDK_ROOT=/Users//Library/Android/sdk +ANDROID_NDK_HOME=$ANDROID_SDK_ROOT/ndk/26.1.10909125 +``` \ No newline at end of file diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-3.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-3.md new file mode 100644 index 0000000000..c05ddda728 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-3.md @@ -0,0 +1,161 @@ +--- +title: Model Preparation and Conversion +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +To begin working with Llama 3, the pre-trained model parameters can be accessed through Meta’s Llama Downloads page. Users are required to request access by submitting their details and reviewing and accepting the Responsible Use Guide. Upon approval, a license and a download link—valid for 24 hours—are provided. For this exercise, the Llama 3.2 1B Instruct model is utilized; however, the same procedures can be applied to other available variants with only minor modifications. + +Convert the model into an ExecuTorch-compatible format optimized for Arm devices +## Script the Model + +```python +import torch +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.float16) +scripted_model = torch.jit.script(model) +scripted_model.save("llama_exec.pt") + +``` + +Install the llama-stack package from pip. +```python +pip install llama-stack +``` + +Run the command to download, and paste the download link from the email when prompted. +```python +llama model download --source meta --model-id Llama3.2-1B-Instruct +``` + +When the download is finished, the installation path is printed as output. +```python +Successfully downloaded model to //.llama/checkpoints/Llama3.2-1B-Instruct +``` +Verify by viewing the downloaded files under this path: +``` +ls $HOME/.llama/checkpoints/Llama3.2-1B-Instruct +checklist.chk consolidated.00.pth params.json tokenizer.model + +``` + +Export the model and generate a .pte file by running the appropriate Python command. This command will export the model and save the resulting file in your current working directory. +```python +python3 -m examples.models.llama.export_llama \ +--checkpoint $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth \ +--params $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/params.json \ +-kv --use_sdpa_with_kv_cache -X --xnnpack-extended-ops -qmode 8da4w \ +--group_size 64 -d fp32 \ +--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001, 128006, 128007]}' \ +--embedding-quantize 4,32 \ +--output_name="llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte" \ +--max_seq_length 1024 \ +--max_context_length 1024 +``` + +Because Llama 3 has a larger vocabulary size, it is recommended to quantize the embeddings using the parameter --embedding-quantize 4,32. This helps to further optimize memory usage and reduce the overall model size. + + +###### Load a pre-fine-tuned model (from Hugging Face) +- Example: meta-llama/Llama-3-8B-Instruct or a customer-support fine-tuned variant + +###### Model Optimization for ARM (Understanding Quantization) +- Reduces model precision (e.g., 32-bit → 8-bit) +- Decreases memory footprint (~4x reduction) +- Speeds up inference on CPU +- Minimal accuracy loss for most tasks + +###### Apply Dynamic Quantization +- Create optimize_model.py + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer +from torch.quantization import quantize_dynamic +import time +import os + +def load_base_model(model_name): + """Load the base model""" + print(f"Loading base model: {model_name}") + + tokenizer = AutoTokenizer.from_pretrained(model_name) + tokenizer.pad_token = tokenizer.eos_token + + model = AutoModelForCausalLM.from_pretrained( + model_name, + torch_dtype=torch.float32, + device_map=None, + low_cpu_mem_usage=True + ) + model.eval() + + return model, tokenizer + +def apply_quantization(model): + """Apply dynamic quantization""" + print("Applying dynamic quantization...") + + quantized_model = quantize_dynamic( + model, + {torch.nn.Linear}, # Quantize linear layers + dtype=torch.qint8 + ) + + return quantized_model + +def test_model(model, tokenizer, prompt): + """Test model with a sample prompt""" + inputs = tokenizer(prompt, return_tensors="pt") + + start_time = time.time() + with torch.no_grad(): + outputs = model.generate( + inputs.input_ids, + max_new_tokens=100, + do_sample=False, + pad_token_id=tokenizer.eos_token_id + ) + inference_time = time.time() - start_time + + response = tokenizer.decode(outputs[0], skip_special_tokens=True) + + return response, inference_time + +def main(): + model_name = "meta-llama/Meta-Llama-3-8B-Instruct" + + # Load base model + base_model, tokenizer = load_base_model(model_name) + + # Test base model + test_prompt = "How do I track my order?" + print("\nTesting base model...") + response, base_time = test_model(base_model, tokenizer, test_prompt) + print(f"Base model inference time: {base_time:.2f}s") + + # Apply quantization + quantized_model = apply_quantization(base_model) + + # Test quantized model + print("\nTesting quantized model...") + response, quant_time = test_model(quantized_model, tokenizer, test_prompt) + print(f"Quantized model inference time: {quant_time:.2f}s") + print(f"Speedup: {base_time / quant_time:.2f}x") + + # Save quantized model + save_dir = "./models/quantized_llama3" + os.makedirs(save_dir, exist_ok=True) + + torch.save(quantized_model.state_dict(), f"{save_dir}/model.pt") + tokenizer.save_pretrained(save_dir) + + print(f"\nQuantized model saved to: {save_dir}") + +if __name__ == "__main__": + main() + +``` \ No newline at end of file diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-4.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-4.md new file mode 100644 index 0000000000..e2291f4795 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-4.md @@ -0,0 +1,31 @@ +--- +title: Building the Chatbot Logic + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Conversation Framework (Python prototype) +```python +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") + +def generate_response(model, query, context): + prompt = f"### Context:\n{context}\n### User Query:\n{query}\n### Assistant Response:" + inputs = tokenizer(prompt, return_tensors="pt") + outputs = model.generate(**inputs, max_new_tokens=200) + return tokenizer.decode(outputs[0], skip_special_tokens=True) +``` + +###### Context Memory (Simple JSON Store) + +```python +import json + +def update_memory(user_id, query, response): + memory = json.load(open("chat_memory.json", "r")) + memory[user_id].append({"query": query, "response": response}) + json.dump(memory, open("chat_memory.json", "w")) + +``` + diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-5.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-5.md new file mode 100644 index 0000000000..b155e4245b --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-5.md @@ -0,0 +1,42 @@ +--- +title: Adding Agentic AI Capabilities +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +Enable the chatbot to perform reasoning, make decisions, and execute actions autonomously + +## Define Agentic Loop +```python +class AgenticChatbot: + def __init__(self, model): + self.model = model + + def observe(self, input): + return f"User said: {input}" + + def think(self, observation): + return f"Decide best next step based on intent." + + def act(self, decision): + if "refund" in decision: + return "Processing refund..." + elif "troubleshoot" in decision: + return "Let's check your device settings." + else: + return "Connecting you with an agent." + + def respond(self, query): + obs = self.observe(query) + thought = self.think(obs) + action = self.act(thought) + return f"Reasoning: {thought}\nAction: {action}" +``` +## Integrate Llama with Reasoning Loop +```python +def generate_agentic_response(query, context): + reasoning = agent.respond(query) + model_response = generate_response(model, query, context) + return reasoning + "\n\n" + model_response +``` \ No newline at end of file diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-6.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-6.md new file mode 100644 index 0000000000..dbd78eff20 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-6.md @@ -0,0 +1,107 @@ +--- +title: Android Integration +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +#### Integration +Build the Llama Runner Binary for Android +Cross-compile the Llama Runner to enable execution on Android by following the steps outlined below. + +#### Android NDK +Configure the environment variable to reference the Android NDK +``` +export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/ +``` + +Ensure that $ANDROID_NDK/build/cmake/android.toolchain.cmake is accessible so CMake can perform cross-compilation. + +#### Use KleidiAI to build ExecuTorch and the required libraries for Android deployment +build ExecuTorch for Android, leveraging the performance optimizations offered by [KleidiAI](https://gitlab.arm.com/kleidi/kleidiai) kernels + +Use cmake to cross-compile ExecuTorch: +``` +cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ + -DANDROID_ABI=arm64-v8a \ + -DANDROID_PLATFORM=android-23 \ + -DCMAKE_INSTALL_PREFIX=cmake-out-android \ + -DEXECUTORCH_ENABLE_LOGGING=1 \ + -DCMAKE_BUILD_TYPE=Release \ + -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ + -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ + -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ + -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \ + -DEXECUTORCH_BUILD_XNNPACK=ON \ + -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ + -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ + -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ + -DEXECUTORCH_BUILD_KERNELS_LLM=ON \ + -DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \ + -DEXECUTORCH_BUILD_EXTENSION_LLM=ON \ + -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \ + -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \ + -DXNNPACK_ENABLE_ARM_BF16=OFF \ + -DBUILD_TESTING=OFF \ + -Bcmake-out-android . + +cmake --build cmake-out-android -j7 --target install --config Release +``` +Beginning with ExecuTorch version 0.7 beta, KleidiAI is enabled by default. The option -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON is active, providing built-in support for KleidiAI kernels within ExecuTorch when using XNNPack. + +#### Build Llama runner for Android +``` +cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ + -DANDROID_ABI=arm64-v8a \ + -DANDROID_PLATFORM=android-23 \ + -DCMAKE_INSTALL_PREFIX=cmake-out-android \ + -DCMAKE_BUILD_TYPE=Release \ + -DPYTHON_EXECUTABLE=python \ + -DEXECUTORCH_BUILD_XNNPACK=ON \ + -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ + -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ + -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ + -DSUPPORT_REGEX_LOOKAHEAD=ON \ + -DBUILD_TESTING=OFF \ + -Bcmake-out-android/examples/models/llama \ + examples/models/llama + +cmake --build cmake-out-android/examples/models/llama -j16 --config Release + +``` +Execute on Android using adb shell. +You will need an Arm-based Android smartphone with the i8mm feature and at least 16GB of RAM. The steps below were validated on a Google Pixel 8 Pro. +#### Create New Android Project +Open Android Studio → New Project → Empty Activity + +#### Add ExecuTorch Runtime to build.gradle +``` +dependencies { + implementation files('libs/executorch.aar') +} +``` + +#### Android phone connection +Connect your Android device to your computer using a USB cable. + +Ensure that USB debugging is enabled on your device. You can follow the Configure on-device developer options guide to enable it. + +After enabling USB debugging and connecting the device via USB, run the following command: +``` +adb devices +``` + +#### model, tokenizer, and Llama runner +``` +adb shell mkdir -p /data/local/tmp/llama +adb push llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte /data/local/tmp/llama/ +adb push $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/tokenizer.model /data/local/tmp/llama/ +adb push cmake-out-android/examples/models/llama/llama_main /data/local/tmp/llama/ + +``` + +#### Model Running +``` +adb shell "cd /data/local/tmp/llama && ./llama_main --model_path llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte --tokenizer_path tokenizer.model --prompt '<|start_header_id|>system<|end_header_id|>\nYour name is Cookie. you are helpful, polite, precise, concise, honest, good at writing. You always give precise and brief answers up to 32 words<|eot_id|><|start_header_id|>user<|end_header_id|>\nHey Cookie! how are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>' --warmup=1 --cpu_threads=5" +``` \ No newline at end of file diff --git a/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-7.md b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-7.md new file mode 100644 index 0000000000..422f8c8d78 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/customer-support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-7.md @@ -0,0 +1,45 @@ +--- +title: Run, Testing and Benchmarking +weight: 8 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +#### Build the Android (AAR) +You can use the Android demo application included in the ExecuTorch repository, [LlamaDemo](https://github.com/pytorch/executorch/tree/main/examples/android/LlamaDemo), to showcase local inference with ExecuTorch + +Open a terminal and navigate to the root directory of the ExecuTorch repository.Then, set the following environment variables: + +```bash +export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/ +export ANDROID_ABI=arm64-v8a +``` +Run the following commands to set up the required JNI library: +```bash +pushd extension/android +./gradlew build +popd +pushd examples/demo-apps/android/LlamaDemo +./gradlew :app:setup +popd +``` +Check if the files are available on the phone: +```bash +adb shell "ls -la /data/local/tmp/llama/" +``` +If not, copy them: +``` +adb shell mkdir -p /data/local/tmp/llama +adb push /data/local/tmp/llama/ +adb push /data/local/tmp/llama/ +``` + +#### Build the Android Package Kit using Android Studio +- Open Android Studio and choose Open an existing Android Studio project. +- Navigate to examples/demo-apps/android/LlamaDemo and open it. +- Run the app (^R) to build and launch it on your connected Android device. + +#### Measure Inference Latency +- adb shell am start -n com.example.chatbot/.MainActivity +- adb shell dumpsys meminfo com.example.chatbot