diff --git a/docs/source/_static/img/chat.png b/docs/source/_static/img/chat.png new file mode 100644 index 00000000000..e7ed934519d Binary files /dev/null and b/docs/source/_static/img/chat.png differ diff --git a/docs/source/_static/img/chat_response.png b/docs/source/_static/img/chat_response.png new file mode 100644 index 00000000000..714265276fe Binary files /dev/null and b/docs/source/_static/img/chat_response.png differ diff --git a/docs/source/_static/img/llava_example.png b/docs/source/_static/img/llava_example.png new file mode 100644 index 00000000000..ccac335ee65 Binary files /dev/null and b/docs/source/_static/img/llava_example.png differ diff --git a/docs/source/_static/img/load_complete_and_start_prompt.png b/docs/source/_static/img/load_complete_and_start_prompt.png new file mode 100644 index 00000000000..43d81f10d00 Binary files /dev/null and b/docs/source/_static/img/load_complete_and_start_prompt.png differ diff --git a/docs/source/_static/img/logs.png b/docs/source/_static/img/logs.png new file mode 100644 index 00000000000..e35227a1c0c Binary files /dev/null and b/docs/source/_static/img/logs.png differ diff --git a/docs/source/_static/img/mtk_changes_to_shell_file.png b/docs/source/_static/img/mtk_changes_to_shell_file.png new file mode 100644 index 00000000000..7fa4e461863 Binary files /dev/null and b/docs/source/_static/img/mtk_changes_to_shell_file.png differ diff --git a/docs/source/_static/img/mtk_output.png b/docs/source/_static/img/mtk_output.png new file mode 100644 index 00000000000..e41d54c3561 Binary files /dev/null and b/docs/source/_static/img/mtk_output.png differ diff --git a/docs/source/_static/img/opening_the_app_details.png b/docs/source/_static/img/opening_the_app_details.png new file mode 100644 index 00000000000..60494ecc69d Binary files /dev/null and b/docs/source/_static/img/opening_the_app_details.png differ diff --git a/docs/source/_static/img/settings_menu.png b/docs/source/_static/img/settings_menu.png new file mode 100644 index 00000000000..028e6b55cd7 Binary files /dev/null and b/docs/source/_static/img/settings_menu.png differ diff --git a/docs/source/llm/llama-demo-android.md b/docs/source/llm/llama-demo-android.md index 023f82baf33..ce2d25a4a89 100644 --- a/docs/source/llm/llama-demo-android.md +++ b/docs/source/llm/llama-demo-android.md @@ -1,2 +1,141 @@ -```{include} ../../../examples/demo-apps/android/LlamaDemo/README.md +# ExecuTorch Llama Android Demo App + +We’re excited to share that the newly revamped Android demo app is live and includes many new updates to provide a more intuitive and smoother user experience with a chat use case! The primary goal of this app is to showcase how easily ExecuTorch can be integrated into an Android demo app and how to exercise the many features ExecuTorch and Llama models have to offer. + +This app serves as a valuable resource to inspire your creativity and provide foundational code that you can customize and adapt for your particular use case. + +Please dive in and start exploring our demo app today! We look forward to any feedback and are excited to see your innovative ideas. + + +## Key Concepts +From this demo app, you will learn many key concepts such as: +* How to prepare Llama models, build the ExecuTorch library, and model inferencing across delegates +* Expose the ExecuTorch library via JNI layer +* Familiarity with current ExecuTorch app-facing capabilities + +The goal is for you to see the type of support ExecuTorch provides and feel comfortable with leveraging it for your use cases. + +## Supporting Models +As a whole, the models that this app supports are (varies by delegate): +* Llama 3.1 8B +* Llama 3 8B +* Llama 2 7B +* LLaVA-1.5 vision model (only XNNPACK) + + +## Building the APK +First it’s important to note that currently ExecuTorch provides support across 3 delegates. Once you identify the delegate of your choice, select the README link to get a complete end-to-end instructions for environment set-up to exporting the models to build ExecuTorch libraries and apps to run on device: + +| Delegate | Resource | +| ------------- | ------------- | +| XNNPACK (CPU-based library) | [link](docs/delegates/xnnpack_README.md) | +| QNN (Qualcomm AI Accelerators) | [link](docs/delegates/qualcomm_README.md) | +| MediaTek (MediaTek AI Accelerators) | [link](docs/delegates/mediatek_README.md) | + +## How to Use the App + +This section will provide the main steps to use the app, along with a code snippet of the ExecuTorch API. + +For loading the app, development, and running on device we recommend Android Studio: +1. Open Android Studio and select "Open an existing Android Studio project" to open examples/demo-apps/android/LlamaDemo. +2. Run the app (^R). This builds and launches the app on the phone. + +### Opening the App + +Below are the UI features for the app. + +Select the settings widget to get started with picking a model, its parameters and any prompts. +

+ +

+ + + +### Select Models and Parameters + +Once you've selected the model, tokenizer, and model type you are ready to click on "Load Model" to have the app load the model and go back to the main Chat activity. +

+ +

+ + + +Optional Parameters: +* Temperature: Defaulted to 0, you can adjust the temperature for the model as well. The model will reload upon any adjustments. +* System Prompt: Without any formatting, you can enter in a system prompt. For example, "you are a travel assistant" or "give me a response in a few sentences". +* User Prompt: More for the advanced user, if you would like to manually input a prompt then you can do so by modifying the `{{user prompt}}`. You can also modify the special tokens as well. Once changed then go back to the main Chat activity to send. + +> [!TIP] +> Helpful ExecuTorch API in app + +```java +// Upon returning to the Main Chat Activity +mModule = new LlamaModule( + ModelUtils.getModelCategory(mCurrentSettingsFields.getModelType()), + modelPath, + tokenizerPath, + temperature); +int loadResult = mModule.load(); ``` + +* `modelCategory`: Indicate whether it’s a text-only or vision model +* `modePath`: path to the .pte file +* `tokenizerPath`: path to the tokenizer .bin file +* `temperature`: model parameter to adjust the randomness of the model’s output + + +### User Prompt +Once model is successfully loaded then enter any prompt and click the send (i.e. generate) button to send it to the model. +

+ +

+ +You can provide it more follow-up questions as well. +

+ +

+ +> [!TIP] +> Helpful ExecuTorch API in app +```java +mModule.generate(prompt,sequence_length, MainActivity.this); +``` +* `prompt`: User formatted prompt +* `sequence_length`: Number of tokens to generate in response to a prompt +* `MainActivity.this`: Indicate that the callback functions (OnResult(), OnStats()) are present in this class. + +[*LLaVA-1.5: Only for XNNPACK delegate*] + +For LLaVA-1.5 implementation, select the exported LLaVA .pte and tokenizer file in the Settings menu and load the model. After this you can send an image from your gallery or take a live picture along with a text prompt to the model. + +

+ +

+ + +### Output Generated +To show completion of the follow-up question, here is the complete detailed response from the model. +

+ +

+ +> [!TIP] +> Helpful ExecuTorch API in app + +Ensure you have the following functions in your callback class that you provided in the `mModule.generate()`. For this example, it is `MainActivity.this`. +```java + @Override + public void onResult(String result) { + //...result contains token from response + //.. onResult will continue to be invoked until response is complete + } + + @Override + public void onStats(float tps) { + //...tps (tokens per second) stats is provided by framework + } + +``` + +## Reporting Issues +If you encountered any bugs or issues following this tutorial please file a bug/issue here on [Github](https://github.com/pytorch/executorch/issues/new). diff --git a/examples/demo-apps/android/ExecuTorchDemo/README.md b/examples/demo-apps/android/ExecuTorchDemo/README.md index 9af1f5266eb..a60307dd90f 100644 --- a/examples/demo-apps/android/ExecuTorchDemo/README.md +++ b/examples/demo-apps/android/ExecuTorchDemo/README.md @@ -78,6 +78,7 @@ cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ + -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -Bcmake-android-out @@ -120,6 +121,7 @@ cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \ -DQNN_SDK_ROOT="${QNN_SDK_ROOT}" \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ + -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -Bcmake-android-out diff --git a/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md b/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md index 1accc28a937..18cd32877ac 100644 --- a/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md +++ b/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md @@ -1,6 +1,6 @@ # Building ExecuTorch Android Demo App for Llama running Qualcomm -This tutorial covers the end to end workflow for building an android demo app using Qualcomm AI accelerators on device. +This tutorial covers the end to end workflow for building an android demo app using Qualcomm AI accelerators on device. More specifically, it covers: 1. Export and quantization of Llama models against the Qualcomm backend. 2. Building and linking libraries that are required to inference on-device for Android platform using Qualcomm AI accelerators. @@ -11,10 +11,10 @@ Verified on Linux CentOS, QNN SDK [v2.26](https://softwarecenter.qualcomm.com/ap Phone verified: OnePlus 12, Samsung 24+, Samsung 23 ## Prerequisites -* Download and unzip QNN SDK [v2.26](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip) +* Download and unzip QNN SDK [v2.26](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip) * Download and unzip Android SDK [r27](https://developer.android.com/ndk/downloads) * Android phone with Snapdragon8 Gen3 (SM8650) or Gen2 (SM8550). Gen 1 and lower SoC might be supported but not fully validated. -* Desired Llama model weights in .PTH format. You can download them on HuggingFace ([Example](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)). +* Desired Llama model weights in .PTH format. You can download them on HuggingFace ([Example](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)). ## Setup ExecuTorch In this section, we will need to set up the ExecuTorch repo first with Conda environment management. Make sure you have Conda available in your system (or follow the instructions to install it [here](https://anaconda.org/anaconda/conda)). The commands below are running on Linux (CentOS). @@ -37,7 +37,7 @@ Install dependencies ./install_requirements.sh ``` -## Setup QNN +## Setup QNN ``` # Set these variables correctly for your environment export ANDROID_NDK_ROOT=$HOME/android-ndk-r27 # Download android SDK and unzip to home directory @@ -71,8 +71,8 @@ cmake --build cmake-out -j16 --target install --config Release -### Setup Llama Runner -Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPack. +### Setup Llama Runner +Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPack. ``` sh examples/models/llama2/install_requirements.sh @@ -103,7 +103,7 @@ Examples: # 4 bits weight only quantize python -m examples.models.llama2.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte” ``` -If the model is really big, it may require model sharding because the Qualcomm DSP is a 32bit system and has a 4GB size limit . For example for Llama 3 8B models, we need to shard the model into 4, but ExecuTorch still packages it into one PTE file. Here is an example: +If the model is really big, it may require model sharding because the Qualcomm DSP is a 32bit system and has a 4GB size limit . For example for Llama 3 8B models, we need to shard the model into 4, but ExecuTorch still packages it into one PTE file. Here is an example: ``` # 8 bits quantization with 4 shards python -m examples.models.llama2.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_8a8w -d fp32 --num_sharding 4 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte” @@ -113,7 +113,7 @@ Note: if you encountered issues below [ERROR] [Qnn ExecuTorch]: Cannot Open QNN library libQnnHtp.so, with error: libc++.so.1: cannot open shared object file: No such file or directory ``` -Resolve by: +Resolve by: * Install older QNN such as 2.23 or below and copy it from ${QNN_SDK_ROOT}/lib/x86_64-linux-clang * Install it with apt-get by yourself @@ -124,9 +124,9 @@ You could refer to [QNN SDK document](https://docs.qualcomm.com/bundle/publicres conda install -c conda-forge libcxx=14.0.0 ``` -After installment, you will need to check libc++.so.1 in your LD_LIBRARY_PATH or system lib. Refer to this [PR](https://github.com/pytorch/executorch/issues/5120) for more detail. +After installment, you will need to check libc++.so.1 in your LD_LIBRARY_PATH or system lib. Refer to this [PR](https://github.com/pytorch/executorch/issues/5120) for more detail. -You may also wonder what the "--metadata" flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily. +You may also wonder what the "--metadata" flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily. Convert tokenizer for Llama 2 ``` @@ -179,7 +179,7 @@ Set the following environment variables: export ANDROID_NDK= export ANDROID_ABI=arm64-v8a ``` -Note: is the root for the NDK, which is usually under ~/Library/Android/sdk/ndk/XX.Y.ZZZZZ for macOS, and contains NOTICE and README.md. We use /build/cmake/android.toolchain.cmake for CMake to cross-compile. +Note: is the root for the NDK, which is usually under ~/Library/Android/sdk/ndk/XX.Y.ZZZZZ for macOS, and contains NOTICE and README.md. We use /build/cmake/android.toolchain.cmake for CMake to cross-compile. Build the Android Java extension code: ``` pushd extension/android @@ -194,7 +194,7 @@ popd ``` Alternative you can also just run the shell script directly as in the root directory: ``` -sh examples/demo-apps/android/LlamaDemo/setup-with-qnn.sh +sh examples/demo-apps/android/LlamaDemo/setup-with-qnn.sh ``` This is running the shell script which configures the required core ExecuTorch, Llama2/3, and Android libraries, builds them, and copies them to jniLibs. Note: If you are building the Android app mentioned in the next section on a separate machine (i.e. MacOS but building and exporting for QNN backend on Linux), make sure you copy the aar file generated from setup-with-qnn script to “examples/demo-apps/android/LlamaDemo/app/libs” before building the Android app.