pytorch
diff --git a/‎docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md‎
Lines changed: 152 additions & 0 deletions b/‎docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md‎
Lines changed: 152 additions & 0 deletions
@@ -0,0 +1,152 @@
+# Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device
+
+This tutorial assumes that you have a working local copy of the ExecuTorch repo,
+and have gone through the steps to install the executorch pip package or have
+installed it by building from source.
+
+This tutorial also assumes that you have the Android SDK tools installed and
+that you are able to connect to an Android device via `adb`.
+
+## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer
+
+The model checkpoint and tokenizer can be downloaded from the
+[Meta Llama website](https://www.llama.com/llama-downloads/).
+
+The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`.
+
+## Export the Llama 3.2 1B/3B model
+
+First, navigate to the root of the ExecuTorch repo.
+
+```shell
+# Navigate to executorch root
+cd ~/executorch
+```
+
+Then, set some environment variables to describe how the model should be
+exported. Feel free to tune the values to your preferences.
+
+```shell
+export LLM_NAME=Llama3.2 && \
+export LLM_SIZE=1B && \
+export LLM_SUFFIX="-Instruct" && \
+export QUANT=8da4w && \
+export BACKEND=vulkan && \
+export GROUP_SIZE=64 && \
+export CONTEXT_LENGTH=2048
+```
+
+Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that
+that `--vulkan-force-fp16` flag is set, which will improve model inference
+latency at the cost of model accuracy. Feel free to remove this flag.
+
+```shell
+mkdir $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/ && \
+python -m examples.models.llama.export_llama \
+    -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \
+    -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \
+    -d fp32 --${BACKEND} --vulkan-force-fp16 \
+    -qmode ${QUANT} -G ${GROUP_SIZE} \
+    --max_seq_length ${CONTEXT_LENGTH} \
+    --max_context_length ${CONTEXT_LENGTH} \
+    -kv --use_sdpa_with_kv_cache \
+    --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
+    --model "llama3_2" \
+    --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/llama3_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
+```
+
+After exporting the model, push the exported `.pte` file and the tokenizer to
+your device.
+
+```shell
+adb shell mkdir -p /data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/${BACKEND} && \
+adb push ~/.llama/checkpoints/Llama3.2-${SIZE}${SUFFIX}/tokenizer.model \
+  /data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/tokenizer.model && \
+adb push ~/.llama/checkpoints/Llama3.2-${SIZE}${SUFFIX}/${BACKEND}/llama3_${QUANT}.pte \
+  /data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/${BACKEND}/llama3_${QUANT}.pte
+```
+
+## Build Core Executorch Components
+
+To be able to run the `.pte` file on device, first the core libraries,
+including the Vulkan backend, must be compiled for Android.
+
+```shell
+cmake . \
+    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+    --preset "android-arm64-v8a" \
+    -DANDROID_PLATFORM=android-28 \
+    -DPYTHON_EXECUTABLE=python \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DEXECUTORCH_PAL_DEFAULT=posix \
+    -DEXECUTORCH_BUILD_LLAMA_JNI=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
+    -DEXECUTORCH_BUILD_VULKAN=ON \
+    -DEXECUTORCH_BUILD_TESTS=OFF \
+    -Bcmake-out-android-so && \
+cmake --build cmake-out-android-so -j16 --target install --config Release
+```
+
+## Build and push the llama runner binary to Android
+
+Then, build a binary that can be used to run the `.pte` file.
+
+```shell
+cmake examples/models/llama \
+    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake  \
+    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+    -DEXECUTORCH_ENABLE_LOGGING=ON \
+    -DANDROID_ABI=arm64-v8a \
+    -DANDROID_PLATFORM=android-28 \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DPYTHON_EXECUTABLE=python \
+    -Bcmake-out-android-so/examples/models/llama && \
+cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release
+```
+
+Once the binary is built, it can be pushed to your Android device.
+
+```shell
+adb shell mkdir /data/local/tmp/etvk/ && \
+adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/
+```
+
+## Execute the llama runner binary
+
+Finally, we can execute the lowered `.pte` file on your device.
+
+```shell
+adb shell /data/local/tmp/etvk/llama_main \
+  --model_path=/data/local/tmp/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/llama3_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
+  --tokenizer_path=/data/local/tmp/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \
+  --temperature=0 --seq_len=400 \
+  --prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\"
+```
+
+Here is some sample output captured from a Galaxy S24:
+
+```shell
+E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I'
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+
+Here is a short poem I came up with:
+
+"Moonlight whispers secrets to the night
+A gentle breeze that rustles the light
+The stars up high, a twinkling show
+A peaceful world, where dreams grow slow"
+
+I hope you enjoy it!<|eot_id|>
+
+PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
+        Prompt Tokens: 14    Generated Tokens: 54
+        Model Load Time:                2.277000 (seconds)
+        Total inference time:           1.189000 (seconds)               Rate:  45.416316 (tokens/second)
+                Prompt evaluation:      0.164000 (seconds)               Rate:  85.365854 (tokens/second)
+                Generated 54 tokens:    1.025000 (seconds)               Rate:  52.682927 (tokens/second)
+        Time to first generated token:  0.164000 (seconds)
+        Sampling time over 68 tokens:   0.019000 (seconds)
+```