pytorch
diff --git a/‎backends/vulkan/README.md‎
Lines changed: 3 additions & 204 deletions b/‎backends/vulkan/README.md‎
Lines changed: 3 additions & 204 deletions
diff --git a/‎docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md‎
Lines changed: 152 additions & 0 deletions b/‎docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md‎
Lines changed: 152 additions & 0 deletions
@@ -1,205 +1,4 @@
-# Vulkan Backend
+# The ExecuTorch Vulkan Backend
 
-The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
-built on top of the cross-platform Vulkan GPU API standard. It is primarily
-designed to leverage the GPU to accelerate model inference on Android devices,
-but can be used on any platform that supports an implementation of Vulkan:
-laptops, servers, and edge devices.
-
-::::{note}
-The Vulkan delegate is currently under active development, and its components
-are subject to change.
-::::
-
-## What is Vulkan?
-
-Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
-It is designed to offer developers more explicit control over GPUs compared to
-previous specifications in order to reduce overhead and maximize the
-capabilities of the modern graphics hardware.
-
-Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
-desktop and mobile) in the market support Vulkan. Vulkan is also included in
-Android from Android 7.0 onwards.
-
-**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
-provides a way to execute compute and graphics operations on a GPU, but does not
-come with a built-in library of performant compute kernels.
-
-## The Vulkan Compute Library
-
-The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
-the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
-provide GPU implementations for PyTorch operators via GLSL compute shaders.
-
-The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
-The core components of the PyTorch Vulkan backend were forked into ExecuTorch
-and adapted for an AOT graph-mode style of model inference (as opposed to
-PyTorch which adopted an eager execution style of model inference).
-
-The components of the Vulkan Compute Library are contained in the
-`executorch/backends/vulkan/runtime/` directory. The core components are listed
-and described below:
-
-```
-runtime/
-├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
-└── graph/ .................. ComputeGraph class which implements graph mode inference
-    └── ops/ ................ Base directory for operator implementations
-        ├── glsl/ ........... GLSL compute shaders
-        │   ├── *.glsl
-        │   └── conv2d.glsl
-        └── impl/ ........... C++ code to dispatch GPU compute shaders
-            ├── *.cpp
-            └── Conv2d.cpp
-```
-
-## Features
-
-The Vulkan delegate currently supports the following features:
-
-* **Memory Planning**
-  * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
-* **Capability Based Partitioning**:
-  * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
-* **Support for upper-bound dynamic shapes**:
-  * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering
-
-In addition to increasing operator coverage, the following features are
-currently in development:
-
-* **Quantization Support**
-  * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
-* **Memory Layout Management**
-  * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
-* **Selective Build**
-  * We plan to make it possible to control build size by selecting which operators/shaders you want to build with
-
-## End to End Example
-
-To further understand the features of the Vulkan Delegate and how to use it,
-consider the following end to end example with a simple single operator model.
-
-### Compile and lower a model to the Vulkan Delegate
-
-Assuming ExecuTorch has been set up and installed, the following script can be
-used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.
-
-Once ExecuTorch has been set up and installed, the following script can be used
-to generate a simple model and lower it to the Vulkan delegate.
-
-```
-# Note: this script is the same as the script from the "Setting up ExecuTorch"
-# page, with one minor addition to lower to the Vulkan backend.
-import torch
-from torch.export import export
-from executorch.exir import to_edge
-
-from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
-
-# Start with a PyTorch model that adds two input tensors (matrices)
-class Add(torch.nn.Module):
-  def __init__(self):
-    super(Add, self).__init__()
-
-  def forward(self, x: torch.Tensor, y: torch.Tensor):
-      return x + y
-
-# 1. torch.export: Defines the program with the ATen operator set.
-aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))
-
-# 2. to_edge: Make optimizations for Edge devices
-edge_program = to_edge(aten_dialect)
-# 2.1 Lower to the Vulkan backend
-edge_program = edge_program.to_backend(VulkanPartitioner())
-
-# 3. to_executorch: Convert the graph to an ExecuTorch program
-executorch_program = edge_program.to_executorch()
-
-# 4. Save the compiled .pte program
-with open("vk_add.pte", "wb") as file:
-    file.write(executorch_program.buffer)
-```
-
-Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
-using the `to_backend()` API. The Vulkan Delegate implements the
-`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
-that are supported by the Vulkan delegate, and separates compatible sections of
-the model to be executed on the GPU.
-
-This means the a model can be lowered to the Vulkan delegate even if it contains
-some unsupported operators. This will just mean that only parts of the graph
-will be executed on the GPU.
-
-
-::::{note}
-The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py#L194)
-Vulkan partitioner code can be inspected to examine which ops are currently
-implemented in the Vulkan delegate.
-::::
-
-### Build Vulkan Delegate libraries
-
-The easiest way to build and test the Vulkan Delegate is to build for Android
-and test on a local Android device. Android devices have built in support for
-Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
-compile the Vulkan Compute Library's GLSL compute shaders.
-
-The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
-when building with CMake.
-
-First, make sure that you have the Android NDK installed; any NDK version past
-NDK r19c should work. Note that the examples in this doc have been validated with
-NDK r28c. The Android SDK should also be installed so that you have access to `adb`.
-
-The instructions in this page assumes that the following environment variables
-are set.
-
-```shell
-export ANDROID_NDK=<path_to_ndk>
-# Select the appropriate Android ABI for your device
-export ANDROID_ABI=arm64-v8a
-# All subsequent commands should be performed from ExecuTorch repo root
-cd <path_to_executorch_root>
-# Make sure adb works
-adb --version
-```
-
-To build and install ExecuTorch libraries (for Android) with the Vulkan
-Delegate:
-
-```shell
-# From executorch root directory
-(rm -rf cmake-android-out && \
-  pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
-    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-    -DANDROID_ABI=$ANDROID_ABI \
-    -DEXECUTORCH_BUILD_VULKAN=ON \
-    -DPYTHON_EXECUTABLE=python \
-    -Bcmake-android-out && \
-  cmake --build cmake-android-out -j16 --target install)
-```
-
-### Run the Vulkan model on device
-
-::::{note}
-Since operator support is currently limited, only binary arithmetic operators
-will run on the GPU. Expect inference to be slow as the majority of operators
-are being executed via Portable operators.
-::::
-
-Now, the partially delegated model can be executed (partially) on your device's
-GPU!
-
-```shell
-# Build a model runner binary linked with the Vulkan delegate libs
-cmake --build cmake-android-out --target executor_runner -j32
-
-# Push model to device
-adb push vk_add.pte /data/local/tmp/vk_add.pte
-# Push binary to device
-adb push cmake-android-out/executor_runner /data/local/tmp/runner_bin
-
-# Run the model
-adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
-```
+Please see the [Vulkan Backend Overview](../../docs/source/backends/vulkan/vulkan-overview.md)
+to learn more about the ExecuTorch Vulkan Backend.
@@ -0,0 +1,152 @@
+# Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device
+
+This tutorial assumes that you have a working local copy of the ExecuTorch repo,
+and have gone through the steps to install the executorch pip package or have
+installed it by building from source.
+
+This tutorial also assumes that you have the Android SDK tools installed and
+that you are able to connect to an Android device via `adb`.
+
+## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer
+
+The model checkpoint and tokenizer can be downloaded from the
+[Meta Llama website](https://www.llama.com/llama-downloads/).
+
+The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`.
+
+## Export the Llama 3.2 1B/3B model
+
+First, navigate to the root of the ExecuTorch repo.
+
+```shell
+# Navigate to executorch root
+cd ~/executorch
+```
+
+Then, set some environment variables to describe how the model should be
+exported. Feel free to tune the values to your preferences.
+
+```shell
+export LLM_NAME=Llama3.2 && \
+export LLM_SIZE=1B && \
+export LLM_SUFFIX="-Instruct" && \
+export QUANT=8da4w && \
+export BACKEND=vulkan && \
+export GROUP_SIZE=64 && \
+export CONTEXT_LENGTH=2048
+```
+
+Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that
+that `--vulkan-force-fp16` flag is set, which will improve model inference
+latency at the cost of model accuracy. Feel free to remove this flag.
+
+```shell
+mkdir $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/ && \
+python -m examples.models.llama.export_llama \
+    -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \
+    -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \
+    -d fp32 --${BACKEND} --vulkan-force-fp16 \
+    -qmode ${QUANT} -G ${GROUP_SIZE} \
+    --max_seq_length ${CONTEXT_LENGTH} \
+    --max_context_length ${CONTEXT_LENGTH} \
+    -kv --use_sdpa_with_kv_cache \
+    --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
+    --model "llama3_2" \
+    --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/llama3_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
+```
+
+After exporting the model, push the exported `.pte` file and the tokenizer to
+your device.
+
+```shell
+adb shell mkdir -p /data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/${BACKEND} && \
+adb push ~/.llama/checkpoints/Llama3.2-${SIZE}${SUFFIX}/tokenizer.model \
+  /data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/tokenizer.model && \
+adb push ~/.llama/checkpoints/Llama3.2-${SIZE}${SUFFIX}/${BACKEND}/llama3_${QUANT}.pte \
+  /data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/${BACKEND}/llama3_${QUANT}.pte
+```
+
+## Build Core Executorch Components
+
+To be able to run the `.pte` file on device, first the core libraries,
+including the Vulkan backend, must be compiled for Android.
+
+```shell
+cmake . \
+    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+    --preset "android-arm64-v8a" \
+    -DANDROID_PLATFORM=android-28 \
+    -DPYTHON_EXECUTABLE=python \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DEXECUTORCH_PAL_DEFAULT=posix \
+    -DEXECUTORCH_BUILD_LLAMA_JNI=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
+    -DEXECUTORCH_BUILD_VULKAN=ON \
+    -DEXECUTORCH_BUILD_TESTS=OFF \
+    -Bcmake-out-android-so && \
+cmake --build cmake-out-android-so -j16 --target install --config Release
+```
+
+## Build and push the llama runner binary to Android
+
+Then, build a binary that can be used to run the `.pte` file.
+
+```shell
+cmake examples/models/llama \
+    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake  \
+    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+    -DEXECUTORCH_ENABLE_LOGGING=ON \
+    -DANDROID_ABI=arm64-v8a \
+    -DANDROID_PLATFORM=android-28 \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DPYTHON_EXECUTABLE=python \
+    -Bcmake-out-android-so/examples/models/llama && \
+cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release
+```
+
+Once the binary is built, it can be pushed to your Android device.
+
+```shell
+adb shell mkdir /data/local/tmp/etvk/ && \
+adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/
+```
+
+## Execute the llama runner binary
+
+Finally, we can execute the lowered `.pte` file on your device.
+
+```shell
+adb shell /data/local/tmp/etvk/llama_main \
+  --model_path=/data/local/tmp/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/llama3_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
+  --tokenizer_path=/data/local/tmp/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \
+  --temperature=0 --seq_len=400 \
+  --prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\"
+```
+
+Here is some sample output captured from a Galaxy S24:
+
+```shell
+E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I'
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+
+Here is a short poem I came up with:
+
+"Moonlight whispers secrets to the night
+A gentle breeze that rustles the light
+The stars up high, a twinkling show
+A peaceful world, where dreams grow slow"
+
+I hope you enjoy it!<|eot_id|>
+
+PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
+        Prompt Tokens: 14    Generated Tokens: 54
+        Model Load Time:                2.277000 (seconds)
+        Total inference time:           1.189000 (seconds)               Rate:  45.416316 (tokens/second)
+                Prompt evaluation:      0.164000 (seconds)               Rate:  85.365854 (tokens/second)
+                Generated 54 tokens:    1.025000 (seconds)               Rate:  52.682927 (tokens/second)
+        Time to first generated token:  0.164000 (seconds)
+        Sampling time over 68 tokens:   0.019000 (seconds)
+```