Skip to content

Commit 14ff6b3

Browse files
committed
[ET-VK][docs] Update to the new template
1 parent b577bd8 commit 14ff6b3

File tree

7 files changed

+763
-51
lines changed

7 files changed

+763
-51
lines changed
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device
2+
3+
This tutorial assumes that you have a working local copy of the ExecuTorch repo,
4+
and have gone through the steps to install the executorch pip package or have
5+
installed it by building from source.
6+
7+
This tutorial also assumes that you have the Android SDK tools installed and
8+
that you are able to connect to an Android device via `adb`.
9+
10+
## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer
11+
12+
The model checkpoint and tokenizer can be downloaded from the
13+
[Meta Llama website](https://www.llama.com/llama-downloads/).
14+
15+
The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`.
16+
17+
## Export the Llama 3.2 1B/3B model
18+
19+
First, navigate to the root of the ExecuTorch repo.
20+
21+
```shell
22+
# Navigate to executorch root
23+
cd ~/executorch
24+
```
25+
26+
Then, set some environment variables to describe how the model should be
27+
exported. Feel free to tune the values to your preferences.
28+
29+
```shell
30+
export LLM_NAME=Llama3.2 && \
31+
export LLM_SIZE=1B && \
32+
export LLM_SUFFIX="-Instruct" && \
33+
export QUANT=8da4w && \
34+
export BACKEND=vulkan && \
35+
export GROUP_SIZE=64 && \
36+
export CONTEXT_LENGTH=2048
37+
```
38+
39+
Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that
40+
that `--vulkan-force-fp16` flag is set, which will improve model inference
41+
latency at the cost of model accuracy. Feel free to remove this flag.
42+
43+
```shell
44+
mkdir $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/ && \
45+
python -m examples.models.llama.export_llama \
46+
-c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \
47+
-p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \
48+
-d fp32 --${BACKEND} --vulkan-force-fp16 \
49+
-qmode ${QUANT} -G ${GROUP_SIZE} \
50+
--max_seq_length ${CONTEXT_LENGTH} \
51+
--max_context_length ${CONTEXT_LENGTH} \
52+
-kv --use_sdpa_with_kv_cache \
53+
--metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
54+
--model "llama3_2" \
55+
--output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/llama3_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
56+
```
57+
58+
After exporting the model, push the exported `.pte` file and the tokenizer to
59+
your device.
60+
61+
```shell
62+
adb shell mkdir -p /data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/${BACKEND} && \
63+
adb push ~/.llama/checkpoints/Llama3.2-${SIZE}${SUFFIX}/tokenizer.model \
64+
/data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/tokenizer.model && \
65+
adb push ~/.llama/checkpoints/Llama3.2-${SIZE}${SUFFIX}/${BACKEND}/llama3_${QUANT}.pte \
66+
/data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/${BACKEND}/llama3_${QUANT}.pte
67+
```
68+
69+
## Build Core Executorch Components
70+
71+
To be able to run the `.pte` file on device, first the core libraries,
72+
including the Vulkan backend, must be compiled for Android.
73+
74+
```shell
75+
cmake . \
76+
-DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
77+
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
78+
-DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
79+
--preset "android-arm64-v8a" \
80+
-DANDROID_PLATFORM=android-28 \
81+
-DPYTHON_EXECUTABLE=python \
82+
-DCMAKE_BUILD_TYPE=Release \
83+
-DEXECUTORCH_PAL_DEFAULT=posix \
84+
-DEXECUTORCH_BUILD_LLAMA_JNI=ON \
85+
-DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
86+
-DEXECUTORCH_BUILD_VULKAN=ON \
87+
-DEXECUTORCH_BUILD_TESTS=OFF \
88+
-Bcmake-out-android-so && \
89+
cmake --build cmake-out-android-so -j16 --target install --config Release
90+
```
91+
92+
## Build and push the llama runner binary to Android
93+
94+
Then, build a binary that can be used to run the `.pte` file.
95+
96+
```shell
97+
cmake examples/models/llama \
98+
-DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
99+
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
100+
-DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
101+
-DEXECUTORCH_ENABLE_LOGGING=ON \
102+
-DANDROID_ABI=arm64-v8a \
103+
-DANDROID_PLATFORM=android-28 \
104+
-DCMAKE_BUILD_TYPE=Release \
105+
-DPYTHON_EXECUTABLE=python \
106+
-Bcmake-out-android-so/examples/models/llama && \
107+
cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release
108+
```
109+
110+
Once the binary is built, it can be pushed to your Android device.
111+
112+
```shell
113+
adb shell mkdir /data/local/tmp/etvk/ && \
114+
adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/
115+
```
116+
117+
## Execute the llama runner binary
118+
119+
Finally, we can execute the lowered `.pte` file on your device.
120+
121+
```shell
122+
adb shell /data/local/tmp/etvk/llama_main \
123+
--model_path=/data/local/tmp/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/llama3_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
124+
--tokenizer_path=/data/local/tmp/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \
125+
--temperature=0 --seq_len=400 \
126+
--prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\"
127+
```
128+
129+
Here is some sample output captured from a Galaxy S24:
130+
131+
```shell
132+
E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I'
133+
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
134+
135+
Here is a short poem I came up with:
136+
137+
"Moonlight whispers secrets to the night
138+
A gentle breeze that rustles the light
139+
The stars up high, a twinkling show
140+
A peaceful world, where dreams grow slow"
141+
142+
I hope you enjoy it!<|eot_id|>
143+
144+
PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
145+
Prompt Tokens: 14 Generated Tokens: 54
146+
Model Load Time: 2.277000 (seconds)
147+
Total inference time: 1.189000 (seconds) Rate: 45.416316 (tokens/second)
148+
Prompt evaluation: 0.164000 (seconds) Rate: 85.365854 (tokens/second)
149+
Generated 54 tokens: 1.025000 (seconds) Rate: 52.682927 (tokens/second)
150+
Time to first generated token: 0.164000 (seconds)
151+
Sampling time over 68 tokens: 0.019000 (seconds)
152+
```

0 commit comments

Comments
 (0)