Skip to content

Commit d011ad8

Browse files
committed
[ET-VK][docs] Update to the new template
1 parent b577bd8 commit d011ad8

File tree

10 files changed

+925
-255
lines changed

10 files changed

+925
-255
lines changed

backends/vulkan/README.md

Lines changed: 3 additions & 204 deletions
Original file line numberDiff line numberDiff line change
@@ -1,205 +1,4 @@
1-
# Vulkan Backend
1+
# The ExecuTorch Vulkan Backend
22

3-
The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
4-
built on top of the cross-platform Vulkan GPU API standard. It is primarily
5-
designed to leverage the GPU to accelerate model inference on Android devices,
6-
but can be used on any platform that supports an implementation of Vulkan:
7-
laptops, servers, and edge devices.
8-
9-
::::{note}
10-
The Vulkan delegate is currently under active development, and its components
11-
are subject to change.
12-
::::
13-
14-
## What is Vulkan?
15-
16-
Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
17-
It is designed to offer developers more explicit control over GPUs compared to
18-
previous specifications in order to reduce overhead and maximize the
19-
capabilities of the modern graphics hardware.
20-
21-
Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
22-
desktop and mobile) in the market support Vulkan. Vulkan is also included in
23-
Android from Android 7.0 onwards.
24-
25-
**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
26-
provides a way to execute compute and graphics operations on a GPU, but does not
27-
come with a built-in library of performant compute kernels.
28-
29-
## The Vulkan Compute Library
30-
31-
The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
32-
the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
33-
provide GPU implementations for PyTorch operators via GLSL compute shaders.
34-
35-
The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
36-
The core components of the PyTorch Vulkan backend were forked into ExecuTorch
37-
and adapted for an AOT graph-mode style of model inference (as opposed to
38-
PyTorch which adopted an eager execution style of model inference).
39-
40-
The components of the Vulkan Compute Library are contained in the
41-
`executorch/backends/vulkan/runtime/` directory. The core components are listed
42-
and described below:
43-
44-
```
45-
runtime/
46-
├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
47-
└── graph/ .................. ComputeGraph class which implements graph mode inference
48-
└── ops/ ................ Base directory for operator implementations
49-
├── glsl/ ........... GLSL compute shaders
50-
│ ├── *.glsl
51-
│ └── conv2d.glsl
52-
└── impl/ ........... C++ code to dispatch GPU compute shaders
53-
├── *.cpp
54-
└── Conv2d.cpp
55-
```
56-
57-
## Features
58-
59-
The Vulkan delegate currently supports the following features:
60-
61-
* **Memory Planning**
62-
* Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
63-
* **Capability Based Partitioning**:
64-
* A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
65-
* **Support for upper-bound dynamic shapes**:
66-
* Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering
67-
68-
In addition to increasing operator coverage, the following features are
69-
currently in development:
70-
71-
* **Quantization Support**
72-
* We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
73-
* **Memory Layout Management**
74-
* Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
75-
* **Selective Build**
76-
* We plan to make it possible to control build size by selecting which operators/shaders you want to build with
77-
78-
## End to End Example
79-
80-
To further understand the features of the Vulkan Delegate and how to use it,
81-
consider the following end to end example with a simple single operator model.
82-
83-
### Compile and lower a model to the Vulkan Delegate
84-
85-
Assuming ExecuTorch has been set up and installed, the following script can be
86-
used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.
87-
88-
Once ExecuTorch has been set up and installed, the following script can be used
89-
to generate a simple model and lower it to the Vulkan delegate.
90-
91-
```
92-
# Note: this script is the same as the script from the "Setting up ExecuTorch"
93-
# page, with one minor addition to lower to the Vulkan backend.
94-
import torch
95-
from torch.export import export
96-
from executorch.exir import to_edge
97-
98-
from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
99-
100-
# Start with a PyTorch model that adds two input tensors (matrices)
101-
class Add(torch.nn.Module):
102-
def __init__(self):
103-
super(Add, self).__init__()
104-
105-
def forward(self, x: torch.Tensor, y: torch.Tensor):
106-
return x + y
107-
108-
# 1. torch.export: Defines the program with the ATen operator set.
109-
aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))
110-
111-
# 2. to_edge: Make optimizations for Edge devices
112-
edge_program = to_edge(aten_dialect)
113-
# 2.1 Lower to the Vulkan backend
114-
edge_program = edge_program.to_backend(VulkanPartitioner())
115-
116-
# 3. to_executorch: Convert the graph to an ExecuTorch program
117-
executorch_program = edge_program.to_executorch()
118-
119-
# 4. Save the compiled .pte program
120-
with open("vk_add.pte", "wb") as file:
121-
file.write(executorch_program.buffer)
122-
```
123-
124-
Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
125-
using the `to_backend()` API. The Vulkan Delegate implements the
126-
`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
127-
that are supported by the Vulkan delegate, and separates compatible sections of
128-
the model to be executed on the GPU.
129-
130-
This means the a model can be lowered to the Vulkan delegate even if it contains
131-
some unsupported operators. This will just mean that only parts of the graph
132-
will be executed on the GPU.
133-
134-
135-
::::{note}
136-
The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py#L194)
137-
Vulkan partitioner code can be inspected to examine which ops are currently
138-
implemented in the Vulkan delegate.
139-
::::
140-
141-
### Build Vulkan Delegate libraries
142-
143-
The easiest way to build and test the Vulkan Delegate is to build for Android
144-
and test on a local Android device. Android devices have built in support for
145-
Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
146-
compile the Vulkan Compute Library's GLSL compute shaders.
147-
148-
The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
149-
when building with CMake.
150-
151-
First, make sure that you have the Android NDK installed; any NDK version past
152-
NDK r19c should work. Note that the examples in this doc have been validated with
153-
NDK r28c. The Android SDK should also be installed so that you have access to `adb`.
154-
155-
The instructions in this page assumes that the following environment variables
156-
are set.
157-
158-
```shell
159-
export ANDROID_NDK=<path_to_ndk>
160-
# Select the appropriate Android ABI for your device
161-
export ANDROID_ABI=arm64-v8a
162-
# All subsequent commands should be performed from ExecuTorch repo root
163-
cd <path_to_executorch_root>
164-
# Make sure adb works
165-
adb --version
166-
```
167-
168-
To build and install ExecuTorch libraries (for Android) with the Vulkan
169-
Delegate:
170-
171-
```shell
172-
# From executorch root directory
173-
(rm -rf cmake-android-out && \
174-
pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
175-
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
176-
-DANDROID_ABI=$ANDROID_ABI \
177-
-DEXECUTORCH_BUILD_VULKAN=ON \
178-
-DPYTHON_EXECUTABLE=python \
179-
-Bcmake-android-out && \
180-
cmake --build cmake-android-out -j16 --target install)
181-
```
182-
183-
### Run the Vulkan model on device
184-
185-
::::{note}
186-
Since operator support is currently limited, only binary arithmetic operators
187-
will run on the GPU. Expect inference to be slow as the majority of operators
188-
are being executed via Portable operators.
189-
::::
190-
191-
Now, the partially delegated model can be executed (partially) on your device's
192-
GPU!
193-
194-
```shell
195-
# Build a model runner binary linked with the Vulkan delegate libs
196-
cmake --build cmake-android-out --target executor_runner -j32
197-
198-
# Push model to device
199-
adb push vk_add.pte /data/local/tmp/vk_add.pte
200-
# Push binary to device
201-
adb push cmake-android-out/executor_runner /data/local/tmp/runner_bin
202-
203-
# Run the model
204-
adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
205-
```
3+
Please see the [Vulkan Backend Overview](../../docs/source/backends/vulkan/vulkan-overview.md)
4+
to learn more about the ExecuTorch Vulkan Backend.
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device
2+
3+
This tutorial assumes that you have a working local copy of the ExecuTorch repo,
4+
and have gone through the steps to install the executorch pip package or have
5+
installed it by building from source.
6+
7+
This tutorial also assumes that you have the Android SDK tools installed and
8+
that you are able to connect to an Android device via `adb`.
9+
10+
## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer
11+
12+
The model checkpoint and tokenizer can be downloaded from the
13+
[Meta Llama website](https://www.llama.com/llama-downloads/).
14+
15+
The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`.
16+
17+
## Export the Llama 3.2 1B/3B model
18+
19+
First, navigate to the root of the ExecuTorch repo.
20+
21+
```shell
22+
# Navigate to executorch root
23+
cd ~/executorch
24+
```
25+
26+
Then, set some environment variables to describe how the model should be
27+
exported. Feel free to tune the values to your preferences.
28+
29+
```shell
30+
export LLM_NAME=Llama3.2 && \
31+
export LLM_SIZE=1B && \
32+
export LLM_SUFFIX="-Instruct" && \
33+
export QUANT=8da4w && \
34+
export BACKEND=vulkan && \
35+
export GROUP_SIZE=64 && \
36+
export CONTEXT_LENGTH=2048
37+
```
38+
39+
Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that
40+
that `--vulkan-force-fp16` flag is set, which will improve model inference
41+
latency at the cost of model accuracy. Feel free to remove this flag.
42+
43+
```shell
44+
mkdir $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/ && \
45+
python -m examples.models.llama.export_llama \
46+
-c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \
47+
-p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \
48+
-d fp32 --${BACKEND} --vulkan-force-fp16 \
49+
-qmode ${QUANT} -G ${GROUP_SIZE} \
50+
--max_seq_length ${CONTEXT_LENGTH} \
51+
--max_context_length ${CONTEXT_LENGTH} \
52+
-kv --use_sdpa_with_kv_cache \
53+
--metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
54+
--model "llama3_2" \
55+
--output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/llama3_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
56+
```
57+
58+
After exporting the model, push the exported `.pte` file and the tokenizer to
59+
your device.
60+
61+
```shell
62+
adb shell mkdir -p /data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/${BACKEND} && \
63+
adb push ~/.llama/checkpoints/Llama3.2-${SIZE}${SUFFIX}/tokenizer.model \
64+
/data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/tokenizer.model && \
65+
adb push ~/.llama/checkpoints/Llama3.2-${SIZE}${SUFFIX}/${BACKEND}/llama3_${QUANT}.pte \
66+
/data/local/tmp/Llama3.2-${SIZE}${SUFFIX}/${BACKEND}/llama3_${QUANT}.pte
67+
```
68+
69+
## Build Core Executorch Components
70+
71+
To be able to run the `.pte` file on device, first the core libraries,
72+
including the Vulkan backend, must be compiled for Android.
73+
74+
```shell
75+
cmake . \
76+
-DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
77+
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
78+
-DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
79+
--preset "android-arm64-v8a" \
80+
-DANDROID_PLATFORM=android-28 \
81+
-DPYTHON_EXECUTABLE=python \
82+
-DCMAKE_BUILD_TYPE=Release \
83+
-DEXECUTORCH_PAL_DEFAULT=posix \
84+
-DEXECUTORCH_BUILD_LLAMA_JNI=ON \
85+
-DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
86+
-DEXECUTORCH_BUILD_VULKAN=ON \
87+
-DEXECUTORCH_BUILD_TESTS=OFF \
88+
-Bcmake-out-android-so && \
89+
cmake --build cmake-out-android-so -j16 --target install --config Release
90+
```
91+
92+
## Build and push the llama runner binary to Android
93+
94+
Then, build a binary that can be used to run the `.pte` file.
95+
96+
```shell
97+
cmake examples/models/llama \
98+
-DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
99+
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
100+
-DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
101+
-DEXECUTORCH_ENABLE_LOGGING=ON \
102+
-DANDROID_ABI=arm64-v8a \
103+
-DANDROID_PLATFORM=android-28 \
104+
-DCMAKE_BUILD_TYPE=Release \
105+
-DPYTHON_EXECUTABLE=python \
106+
-Bcmake-out-android-so/examples/models/llama && \
107+
cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release
108+
```
109+
110+
Once the binary is built, it can be pushed to your Android device.
111+
112+
```shell
113+
adb shell mkdir /data/local/tmp/etvk/ && \
114+
adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/
115+
```
116+
117+
## Execute the llama runner binary
118+
119+
Finally, we can execute the lowered `.pte` file on your device.
120+
121+
```shell
122+
adb shell /data/local/tmp/etvk/llama_main \
123+
--model_path=/data/local/tmp/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${BACKEND}/llama3_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
124+
--tokenizer_path=/data/local/tmp/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \
125+
--temperature=0 --seq_len=400 \
126+
--prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\"
127+
```
128+
129+
Here is some sample output captured from a Galaxy S24:
130+
131+
```shell
132+
E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I'
133+
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
134+
135+
Here is a short poem I came up with:
136+
137+
"Moonlight whispers secrets to the night
138+
A gentle breeze that rustles the light
139+
The stars up high, a twinkling show
140+
A peaceful world, where dreams grow slow"
141+
142+
I hope you enjoy it!<|eot_id|>
143+
144+
PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
145+
Prompt Tokens: 14 Generated Tokens: 54
146+
Model Load Time: 2.277000 (seconds)
147+
Total inference time: 1.189000 (seconds) Rate: 45.416316 (tokens/second)
148+
Prompt evaluation: 0.164000 (seconds) Rate: 85.365854 (tokens/second)
149+
Generated 54 tokens: 1.025000 (seconds) Rate: 52.682927 (tokens/second)
150+
Time to first generated token: 0.164000 (seconds)
151+
Sampling time over 68 tokens: 0.019000 (seconds)
152+
```

0 commit comments

Comments
 (0)