Skip to content

Commit fe53d41

Browse files
shewu-quicfacebook-github-bot
authored andcommitted
Qualcomm AI Engine Direct - Add the tutorial to deploy llama3 8B Instruct (#5335)
Summary: Pull Request resolved: #5335 Reviewed By: kirklandsign Differential Revision: D62619069 Pulled By: cccclai fbshipit-source-id: dff3e0ef7bc2929619ddd663c6a9c719961dc688
1 parent fcbbef4 commit fe53d41

File tree

3 files changed

+132
-6
lines changed

3 files changed

+132
-6
lines changed

docs/source/build-run-qualcomm-ai-engine-direct-backend.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,7 @@ This example is verified with SM8550 and SM8450.
5959
- Click the "Get Software" button to download a version of QNN SDK.
6060
- However, at the moment of updating this tutorial, the above website doesn't provide QNN SDK newer than 2.22.6.
6161
- The below is public links to download various QNN versions. Hope they can be publicly discoverable soon.
62-
- [QNN 2.25.0](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.25.0.240728.zip)
63-
- [QNN 2.24.0](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.24.0.240626.zip)
64-
- [QNN 2.23.0](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.23.0.24.06.24.zip)
62+
- [QNN 2.26.0](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip)
6563

6664
The directory with installed Qualcomm AI Engine Direct SDK looks like:
6765
```
@@ -356,7 +354,7 @@ Please refer to `$EXECUTORCH_ROOT/examples/qualcomm/scripts/` and `EXECUTORCH_RO
356354

357355
## What is coming?
358356

359-
- [llama2 and llama3](https://github.com/pytorch/executorch/pull/4030). Note that at the moment of writing, we still suffer from the quantization issue in llama2-7B and llama3-8B cases. Only storiesllama works well.
357+
- Improve the performance for llama3-8B-Instruct and support batch prefill.
360358
- We will support pre-compiled binaries from [Qualcomm AI Hub](https://aihub.qualcomm.com/).
361359

362360
## FAQ
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Building and Running Llama 3 8B Instruct with Qualcomm AI Engine Direct Backend
2+
3+
This tutorial demonstrates how to export Llama 3 8B Instruct for Qualcomm AI Engine Direct Backend and running the model on a Qualcomm device.
4+
5+
## Prerequisites
6+
7+
- Set up your ExecuTorch repo and environment if you haven’t done so by following [the Setting up ExecuTorch](../getting-started-setup.md) to set up the repo and dev environment.
8+
- Read [the Building and Running ExecuTorch with Qualcomm AI Engine Direct Backend page](../build-run-qualcomm-ai-engine-direct-backend.md) to understand how to export and run a model with Qualcomm AI Engine Direct Backend on Qualcomm device.
9+
- Follow [the README for executorch llama](https://github.com/pytorch/executorch/tree/main/examples/models/llama2) to know how to run a llama model on mobile via ExecuTorch.
10+
- A Qualcomm device with 16GB RAM
11+
- We are continuing to optimize our memory usage to ensure compatibility with lower memory devices.
12+
- The version of [Qualcomm AI Engine Direct SDK](https://developer.qualcomm.com/software/qualcomm-ai-engine-direct-sdk) is 2.26.0 or above.
13+
14+
## Instructions
15+
16+
### Step1: Prepare the checkpoint of the model and optimized matrix from [Spin Quant](https://github.com/facebookresearch/SpinQuant)
17+
18+
1. For Llama 3 tokenizer and checkpoint, please refer to https://github.com/meta-llama/llama-models/blob/main/README.md for further instructions on how to download `tokenizer.model`, `consolidated.00.pth` and `params.json`.
19+
2. To get the optimized matrix, please refer to [SpinQuant on GitHub](https://github.com/facebookresearch/SpinQuant). You can download the optimized rotation matrices in the Quantized Models section. Please choose **LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0**.
20+
21+
### Step2: Export to ExecuTorch with Qualcomm AI Engine Direct Backend
22+
Deploying large language models like Llama 3 on-device presents the following challenges:
23+
24+
1. The model size is too large to fit in device memory for inference.
25+
2. High model loading and inference time.
26+
3. Difficulty in quantization.
27+
28+
To address these challenges, we have implemented the following solutions:
29+
1. Using `--pt2e_quantize qnn_16a4w` to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference.
30+
2. Using `--num_sharding 8` to shard the model into sub-parts.
31+
3. Performing graph transformations to convert or decompose operations into more accelerator-friendly operations.
32+
4. Using `--optimized_rotation_path <path_to_optimized_matrix>` to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy.
33+
5. Using `--calibration_data "<|start_header_id|>system<|end_header_id|..."` to ensure that during the quantization of Llama 3 8B instruct, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to [the model card of meta llama3 instruct](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/).
34+
35+
To export Llama 3 8B instruct with the Qualcomm AI Engine Direct Backend, ensure the following:
36+
37+
1. The host machine has more than 100GB of memory (RAM + swap space).
38+
2. The entire process takes a few hours.
39+
40+
```bash
41+
# Please note that calibration_data must include the prompt template for special tokens.
42+
python -m examples.models.llama2.export_llama -t <path_to_tokenizer.model>
43+
llama3/Meta-Llama-3-8B-Instruct/tokenizer.model -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
44+
```
45+
46+
### Step3: Invoke the Runtime on an Android smartphone with Qualcomm SoCs
47+
1. Build executorch with Qualcomm AI Engine Direct Backend for android
48+
```bash
49+
cmake \
50+
-DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake" \
51+
-DANDROID_ABI=arm64-v8a \
52+
-DANDROID_PLATFORM=android-23 \
53+
-DCMAKE_INSTALL_PREFIX=cmake-android-out \
54+
-DCMAKE_BUILD_TYPE=Release \
55+
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
56+
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
57+
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
58+
-DEXECUTORCH_BUILD_QNN=ON \
59+
-DQNN_SDK_ROOT=${QNN_SDK_ROOT} \
60+
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
61+
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
62+
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
63+
-Bcmake-android-out .
64+
65+
cmake --build cmake-android-out -j16 --target install --config Release
66+
```
67+
2. Build llama runner for android
68+
```bash
69+
cmake \
70+
-DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}"/build/cmake/android.toolchain.cmake \
71+
-DANDROID_ABI=arm64-v8a \
72+
-DANDROID_PLATFORM=android-23 \
73+
-DCMAKE_INSTALL_PREFIX=cmake-android-out \
74+
-DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
75+
-DEXECUTORCH_BUILD_QNN=ON \
76+
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
77+
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
78+
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
79+
-Bcmake-android-out/examples/models/llama2 examples/models/llama2
80+
81+
cmake --build cmake-android-out/examples/models/llama2 -j16 --config Release
82+
```
83+
3. Run on Android via adb shell
84+
*Pre-requisite*: Make sure you enable USB debugging via developer options on your phone
85+
86+
**3.1 Connect your android phone**
87+
88+
**3.2 We need to push required QNN libraries to the device.**
89+
```bash
90+
# make sure you have write-permission on below path.
91+
DEVICE_DIR=/data/local/tmp/llama
92+
adb shell mkdir -p ${DEVICE_DIR}
93+
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR}
94+
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR}
95+
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR}
96+
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR}
97+
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR}
98+
adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR}
99+
adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR}
100+
adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR}
101+
```
102+
103+
**3.3 Upload model, tokenizer and llama runner binary to phone**
104+
```bash
105+
adb push <model.pte> ${DEVICE_DIR}
106+
adb push <tokenizer.model> ${DEVICE_DIR}
107+
adb push cmake-android-out/lib/libqnn_executorch_backend.so ${DEVICE_DIR}
108+
adb push cmake-out-android/examples/models/llama2/llama_main ${DEVICE_DIR}
109+
```
110+
111+
**3.4 Run model**
112+
```bash
113+
adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"
114+
```
115+
You should see the message:
116+
```
117+
<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello! I'd be delighted to chat with you about Facebook. Facebook is a social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while he was a student at Harvard University. It was initially called "Facemaker" but later changed to Facebook, which is a combination of the words "face" and "book". The platform was initially intended for people to share their thoughts and share information with their friends, but it quickly grew to become one of the
118+
```
119+
120+
## What is coming?
121+
- Improve the performance for Llama 3 Instruct
122+
- Reduce the memory pressure during inference to support 12GB Qualcomm devices
123+
- Support more LLMs
124+
125+
## FAQ
126+
127+
If you encounter any issues while reproducing the tutorial, please file a github
128+
issue on ExecuTorch repo and tag use `#qcom_aisw` tag

docs/source/llm/getting-started.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -587,8 +587,8 @@ I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a ver
587587
The delegated model should be noticeably faster compared to the non-delegated model.
588588

589589
For more information regarding backend delegateion, see the ExecuTorch guides
590-
for the [XNNPACK Backend](../tutorial-xnnpack-delegate-lowering.md) and [Core ML
591-
Backend](../build-run-coreml.md).
590+
for the [XNNPACK Backend](../tutorial-xnnpack-delegate-lowering.md), [Core ML
591+
Backend](../build-run-coreml.md) and [Qualcomm AI Engine Direct Backend](build-run-llama3-qualcomm-ai-engine-direct-backend.md).
592592

593593
## Quantization
594594

0 commit comments

Comments
 (0)