diff --git a/.gitignore b/.gitignore
index b166f8c9512..54572407274 100644
--- a/.gitignore
+++ b/.gitignore
@@ -62,7 +62,6 @@ xcuserdata/
/include/
/share/
/version.py
-*.csv
*_etdump
# Android
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 71e097042d7..ec616371fea 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -24,8 +24,8 @@ For Apple, please refer to the [iOS documentation](docs/source/using-executorch-
executorch
├── backends - Backend delegate implementations for various hardware targets. Each backend uses partitioner to split the graph into subgraphs that can be executed on specific hardware, quantizer to optimize model precision, and runtime components to execute the graph on target hardware. For details refer to the backend documentation and the Export and Lowering tutorial for more information.
│ ├── apple - Apple-specific backends.
-│ │ ├── coreml - CoreML backend for Apple devices. See doc.
-│ │ └── mps - Metal Performance Shaders backend for Apple devices. See doc.
+│ │ ├── coreml - CoreML backend for Apple devices. See doc.
+│ │ └── mps - Metal Performance Shaders backend for Apple devices. See doc.
│ ├── arm - ARM architecture backends. See doc.
│ ├── cadence - Cadence-specific backends. See doc.
│ ├── example - Example backend implementations.
@@ -33,8 +33,8 @@ executorch
│ ├── openvino - OpenVINO backend for Intel hardware.
│ ├── qualcomm - Qualcomm-specific backends. See doc.
│ ├── transforms - Transformations for backend optimization.
-│ ├── vulkan - Vulkan backend for cross-platform GPU support. See doc.
-│ └── xnnpack - XNNPACK backend for optimized neural network operations. See doc.
+│ ├── vulkan - Vulkan backend for cross-platform GPU support. See doc.
+│ └── xnnpack - XNNPACK backend for optimized neural network operations. See doc.
├── codegen - Tooling to autogenerate bindings between kernels and the runtime.
├── configurations - Configuration files.
├── devtools - Model profiling, debugging, and inspection. Please refer to the tools documentation for more information.
diff --git a/README-wheel.md b/README-wheel.md
index 7ae9b0aa2e0..719f753039f 100644
--- a/README-wheel.md
+++ b/README-wheel.md
@@ -11,8 +11,8 @@ The `executorch` pip package is in beta.
The prebuilt `executorch.runtime` module included in this package provides a way
to run ExecuTorch `.pte` files, with some restrictions:
* Only [core ATen operators](docs/source/ir-ops-set-definition.md) are linked into the prebuilt module
-* Only the [XNNPACK backend delegate](docs/source/backends-xnnpack.md) is linked into the prebuilt module.
-* \[macOS only] [Core ML](docs/source/backends-coreml.md) and [MPS](docs/source/backends-mps.md) backend
+* Only the [XNNPACK backend delegate](docs/source/backends/xnnpack/xnnpack-overview.md) is linked into the prebuilt module.
+* \[macOS only] [Core ML](docs/source/backends/coreml/coreml-overview.md) and [MPS](docs/source/backends/mps/mps-overview.md) backend
are also linked into the prebuilt module.
Please visit the [ExecuTorch website](https://pytorch.org/executorch) for
diff --git a/README.md b/README.md
index c7053431813..87bb50b93a1 100644
--- a/README.md
+++ b/README.md
@@ -104,16 +104,14 @@ outputs = method.execute([torch.randn(1, 3, 224, 224)])
Module module("model.pte");
auto tensor = make_tensor_ptr({2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
-auto outputs = module.forward(tensor);
+auto outputs = module.forward({tensor});
```
**[Swift (iOS)](https://docs.pytorch.org/executorch/main/ios-section.html)**
```swift
-import ExecuTorch
-
let module = Module(filePath: "model.pte")
-let input = Tensor([1.0, 2.0, 3.0, 4.0], shape: [2, 2])
-let outputs = try module.forward(input)
+let input = Tensor([1.0, 2.0, 3.0, 4.0])
+let outputs: [Value] = try module.forward([input])
```
**[Kotlin (Android)](https://docs.pytorch.org/executorch/main/android-section.html)**
@@ -153,8 +151,6 @@ runner->generate("Hello, how are you?", config);
**[Swift (iOS)](https://docs.pytorch.org/executorch/main/llm/run-on-ios.html)**
```swift
-import ExecuTorchLLM
-
let runner = TextRunner(modelPath: "llama.pte", tokenizerPath: "tiktoken.bin")
try runner.generate("Hello, how are you?", Config {
$0.sequenceLength = 128
@@ -202,7 +198,7 @@ ExecuTorch powers on-device AI at scale across Meta's family of apps, VR/AR devi
**LLMs:** [Llama 3.2/3.1/3](examples/models/llama/README.md), [Qwen 3](examples/models/qwen3/README.md), [Phi-4-mini](examples/models/phi_4_mini/README.md), [LiquidAI LFM2](examples/models/lfm2/README.md)
-**Multimodal:** [Llava](examples/models/llava/README.md) (vision-language), [Voxtral](examples/models/voxtral/README.md) (audio-language)
+**Multimodal:** [Llava](examples/models/llava/README.md) (vision-language), [Voxtral](examples/models/voxtral/README.md) (audio-language), [Gemma](examples/models/gemma3) (vision-language)
**Vision/Speech:** [MobileNetV2](https://github.com/meta-pytorch/executorch-examples/tree/main/mv2), [DeepLabV3](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3), [Whisper](https://github.com/meta-pytorch/executorch-examples/tree/main/whisper/android/WhisperApp)
diff --git a/backends/apple/coreml/README.md b/backends/apple/coreml/README.md
index d063dfc8b71..d72f04da1a1 100644
--- a/backends/apple/coreml/README.md
+++ b/backends/apple/coreml/README.md
@@ -1,7 +1,7 @@
# ExecuTorch Core ML Delegate
This subtree contains the Core ML Delegate implementation for ExecuTorch.
-Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends-coreml.md).
+Core ML is an optimized framework for running machine learning models on Apple devices. The delegate is the mechanism for leveraging the Core ML framework to accelerate operators when running on Apple devices. To learn how to use the CoreML delegate, see the [documentation](https://github.com/pytorch/executorch/blob/main/docs/source/backends/coreml/coreml-overview.md).
## Layout
- `compiler/` : Lowers a module to Core ML backend.
diff --git a/backends/cadence/build_cadence_fusionG3.sh b/backends/cadence/build_cadence_fusionG3.sh
index 93295bc9aa5..ec973401af9 100644
--- a/backends/cadence/build_cadence_fusionG3.sh
+++ b/backends/cadence/build_cadence_fusionG3.sh
@@ -9,7 +9,7 @@ set -euo pipefail
unset CMAKE_PREFIX_PATH
unset XTENSA_CORE
-export XTENSA_CORE=FCV_FG3GP
+export XTENSA_CORE=VANILLA_G3
git submodule sync
git submodule update --init
./backends/cadence/install_requirements.sh
diff --git a/backends/cadence/build_cadence_hifi4.sh b/backends/cadence/build_cadence_hifi4.sh
index 33078b7ff2f..d6c2f3be6d8 100644
--- a/backends/cadence/build_cadence_hifi4.sh
+++ b/backends/cadence/build_cadence_hifi4.sh
@@ -9,7 +9,7 @@ set -euo pipefail
unset CMAKE_PREFIX_PATH
unset XTENSA_CORE
-export XTENSA_CORE=nxp_rt600_RI23_11_newlib
+export XTENSA_CORE=VANILLA_HIFI
git submodule sync
git submodule update --init
./backends/cadence/install_requirements.sh
diff --git a/backends/nxp/README.md b/backends/nxp/README.md
index 8b76d1e276b..b27c054e7c1 100644
--- a/backends/nxp/README.md
+++ b/backends/nxp/README.md
@@ -5,14 +5,14 @@ This subtree contains the ExecuTorch Backend implementation for the
The eIQ® Neutron NPU is a highly scalable accelerator core architecture providing machine learning (ML) acceleration,
able to support common and critical tasks for edge AI such as anomaly detection, speech recognition,
-image classification, object detection, facial recognition, image segmentation, and generative AI use cases like
+image classification, object detection, facial recognition, image segmentation, and generative AI use cases like
large and small language models (LLMs & SLMs) and text-to-speech (TTS).
-The architecture provides power and performance optimized NPUs integrated with NXP's broad portfolio of
+The architecture provides power and performance optimized NPUs integrated with NXP's broad portfolio of
microcontrollers and applications processors.
-The eIQ Neutron NPUs offer support for a wide variety of neural network types such as CNN, RNN, TCN and Transformer
+The eIQ Neutron NPUs offer support for a wide variety of neural network types such as CNN, RNN, TCN and Transformer
networks, as well as the ability to adapt and scale to new model architectures, topologies and layer types introduced
-to AI workloads. ML application development with the eIQ Neutron NPU is fully supported by the
+to AI workloads. ML application development with the eIQ Neutron NPU is fully supported by the
[eIQ machine learning software development environment](https://www.nxp.com/design/design-center/software/eiq-ml-development-environment/eiq-toolkit-for-end-to-end-model-development-and-deployment:EIQ-TOOLKIT).
The eIQ AI SW Stack provides a streamlined development experience for developers and end-users of NXP products.
@@ -22,7 +22,7 @@ At this moment following eIQ® Neutron NPU variants and NXP platforms are suppor
* **eIQ Neutron N3-64**, available on [i.MX RT700](https://www.nxp.com/products/i.MX-RT700)
-In the future the NXP eIQ Neutron Backend will be extended to support [i.MX 9 Application Processors](https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-9-processors:IMX9-PROCESSORS)
+In the future the NXP eIQ Neutron Backend will be extended to support [i.MX 9 Application Processors](https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-9-processors:IMX9-PROCESSORS)
with eIQ Neutron NPU, like the [i.MX 95](https://www.nxp.com/products/iMX95).
@@ -33,7 +33,7 @@ The eIQ Neutron NPU Backend should be considered as prototype quality at this mo
improvements. NXP and the ExecuTorch community is actively developing this codebase.
## Neutron Backend implementation and SW architecture
-Neutron Backend uses the eIQ Neutron Converter as ML compiler to compile the delegated subgraph to Neutron microcode.
+Neutron Backend uses the eIQ Neutron Converter as ML compiler to compile the delegated subgraph to Neutron microcode.
The Neutron Converter accepts the ML model in LiteRT format, for the **eIQ Neutron N3** class therefore the Neutron Backend
uses the LiteRT flatbuffers format as IR between the ExecuTorch and Neutron Converter ML compiler.
@@ -44,10 +44,10 @@ uses the LiteRT flatbuffers format as IR between the ExecuTorch and Neutron Conv
`node_conveters` is structured as single module for each Edge operator.
* `backend/ir/lib` - automatically generated handlers from LiteRT flatbuffers schema.
* `backend/ir/tflite_generator` and `backend/ir/tflite_optimizer` handle the serialization
- of the in-memory built subgraph for delegation into LiteRT/TFLite flatbuffers
+ of the in-memory built subgraph for delegation into LiteRT/TFLite flatbuffers
representation. Code taken from the onnx2tflite tool.
-* `edge_passes` - Various passes operating on Edge dialect level.
-* `quantizer` - Neutron Backend quantizer implementation.
+* `edge_passes` - Various passes operating on Edge dialect level.
+* `quantizer` - Neutron Backend quantizer implementation.
* `runtime` - Neutron Backend runtime implementation. For running compiled on device.
* `tests/` - Unit tests for Neutron backend.
* `tests/converter/node_converter` - Operator level unit tests.
diff --git a/backends/vulkan/README.md b/backends/vulkan/README.md
index e0a953d05fe..b51a736c7df 100644
--- a/backends/vulkan/README.md
+++ b/backends/vulkan/README.md
@@ -1,205 +1,4 @@
-# Vulkan Backend
+# The ExecuTorch Vulkan Backend
-The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
-built on top of the cross-platform Vulkan GPU API standard. It is primarily
-designed to leverage the GPU to accelerate model inference on Android devices,
-but can be used on any platform that supports an implementation of Vulkan:
-laptops, servers, and edge devices.
-
-::::{note}
-The Vulkan delegate is currently under active development, and its components
-are subject to change.
-::::
-
-## What is Vulkan?
-
-Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
-It is designed to offer developers more explicit control over GPUs compared to
-previous specifications in order to reduce overhead and maximize the
-capabilities of the modern graphics hardware.
-
-Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
-desktop and mobile) in the market support Vulkan. Vulkan is also included in
-Android from Android 7.0 onwards.
-
-**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
-provides a way to execute compute and graphics operations on a GPU, but does not
-come with a built-in library of performant compute kernels.
-
-## The Vulkan Compute Library
-
-The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
-the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
-provide GPU implementations for PyTorch operators via GLSL compute shaders.
-
-The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
-The core components of the PyTorch Vulkan backend were forked into ExecuTorch
-and adapted for an AOT graph-mode style of model inference (as opposed to
-PyTorch which adopted an eager execution style of model inference).
-
-The components of the Vulkan Compute Library are contained in the
-`executorch/backends/vulkan/runtime/` directory. The core components are listed
-and described below:
-
-```
-runtime/
-├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
-└── graph/ .................. ComputeGraph class which implements graph mode inference
- └── ops/ ................ Base directory for operator implementations
- ├── glsl/ ........... GLSL compute shaders
- │ ├── *.glsl
- │ └── conv2d.glsl
- └── impl/ ........... C++ code to dispatch GPU compute shaders
- ├── *.cpp
- └── Conv2d.cpp
-```
-
-## Features
-
-The Vulkan delegate currently supports the following features:
-
-* **Memory Planning**
- * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
-* **Capability Based Partitioning**:
- * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
-* **Support for upper-bound dynamic shapes**:
- * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering
-
-In addition to increasing operator coverage, the following features are
-currently in development:
-
-* **Quantization Support**
- * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
-* **Memory Layout Management**
- * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
-* **Selective Build**
- * We plan to make it possible to control build size by selecting which operators/shaders you want to build with
-
-## End to End Example
-
-To further understand the features of the Vulkan Delegate and how to use it,
-consider the following end to end example with a simple single operator model.
-
-### Compile and lower a model to the Vulkan Delegate
-
-Assuming ExecuTorch has been set up and installed, the following script can be
-used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.
-
-Once ExecuTorch has been set up and installed, the following script can be used
-to generate a simple model and lower it to the Vulkan delegate.
-
-```
-# Note: this script is the same as the script from the "Setting up ExecuTorch"
-# page, with one minor addition to lower to the Vulkan backend.
-import torch
-from torch.export import export
-from executorch.exir import to_edge
-
-from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
-
-# Start with a PyTorch model that adds two input tensors (matrices)
-class Add(torch.nn.Module):
- def __init__(self):
- super(Add, self).__init__()
-
- def forward(self, x: torch.Tensor, y: torch.Tensor):
- return x + y
-
-# 1. torch.export: Defines the program with the ATen operator set.
-aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))
-
-# 2. to_edge: Make optimizations for Edge devices
-edge_program = to_edge(aten_dialect)
-# 2.1 Lower to the Vulkan backend
-edge_program = edge_program.to_backend(VulkanPartitioner())
-
-# 3. to_executorch: Convert the graph to an ExecuTorch program
-executorch_program = edge_program.to_executorch()
-
-# 4. Save the compiled .pte program
-with open("vk_add.pte", "wb") as file:
- file.write(executorch_program.buffer)
-```
-
-Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
-using the `to_backend()` API. The Vulkan Delegate implements the
-`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
-that are supported by the Vulkan delegate, and separates compatible sections of
-the model to be executed on the GPU.
-
-This means the a model can be lowered to the Vulkan delegate even if it contains
-some unsupported operators. This will just mean that only parts of the graph
-will be executed on the GPU.
-
-
-::::{note}
-The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py#L194)
-Vulkan partitioner code can be inspected to examine which ops are currently
-implemented in the Vulkan delegate.
-::::
-
-### Build Vulkan Delegate libraries
-
-The easiest way to build and test the Vulkan Delegate is to build for Android
-and test on a local Android device. Android devices have built in support for
-Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
-compile the Vulkan Compute Library's GLSL compute shaders.
-
-The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
-when building with CMake.
-
-First, make sure that you have the Android NDK installed; any NDK version past
-NDK r19c should work. Note that the examples in this doc have been validated with
-NDK r27b. The Android SDK should also be installed so that you have access to `adb`.
-
-The instructions in this page assumes that the following environment variables
-are set.
-
-```shell
-export ANDROID_NDK=
-# Select the appropriate Android ABI for your device
-export ANDROID_ABI=arm64-v8a
-# All subsequent commands should be performed from ExecuTorch repo root
-cd
-# Make sure adb works
-adb --version
-```
-
-To build and install ExecuTorch libraries (for Android) with the Vulkan
-Delegate:
-
-```shell
-# From executorch root directory
-(rm -rf cmake-android-out && \
- pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
- -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
- -DANDROID_ABI=$ANDROID_ABI \
- -DEXECUTORCH_BUILD_VULKAN=ON \
- -DPYTHON_EXECUTABLE=python \
- -Bcmake-android-out && \
- cmake --build cmake-android-out -j16 --target install)
-```
-
-### Run the Vulkan model on device
-
-::::{note}
-Since operator support is currently limited, only binary arithmetic operators
-will run on the GPU. Expect inference to be slow as the majority of operators
-are being executed via Portable operators.
-::::
-
-Now, the partially delegated model can be executed (partially) on your device's
-GPU!
-
-```shell
-# Build a model runner binary linked with the Vulkan delegate libs
-cmake --build cmake-android-out --target executor_runner -j32
-
-# Push model to device
-adb push vk_add.pte /data/local/tmp/vk_add.pte
-# Push binary to device
-adb push cmake-android-out/executor_runner /data/local/tmp/runner_bin
-
-# Run the model
-adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
-```
+Please see the [Vulkan Backend Overview](../../docs/source/backends/vulkan/vulkan-overview.md)
+to learn more about the ExecuTorch Vulkan Backend.
diff --git a/backends/vulkan/docs/android_demo.md b/backends/vulkan/docs/android_demo.md
deleted file mode 100644
index ff84938b06f..00000000000
--- a/backends/vulkan/docs/android_demo.md
+++ /dev/null
@@ -1,128 +0,0 @@
-# Building and Running ExecuTorch with the Vulkan Backend
-
-The [ExecuTorch Vulkan Delegate](../../../docs/source/native-delegates-executorch-vulkan-delegate.md)
-is a native GPU delegate for ExecuTorch.
-
-
-::::{grid} 2
-:::{grid-item-card} What you will learn in this tutorial:
-:class-card: card-content
-* How to export the Llama3.2-1B parameter model with partial GPU delegation
-* How to execute the partially delegated model on Android
-:::
-:::{grid-item-card} Prerequisites:
-:class-card: card-prerequisites
-* Follow [**Setting up ExecuTorch**](../../../docs/source/getting-started-setup.rst)
-* It is also recommended that you read through [**ExecuTorch Vulkan Delegate**](../../../docs/source/native-delegates-executorch-vulkan-delegate.md) and follow the example in that page
-:::
-::::
-
-## Prerequisites
-
-Note that all the steps below should be performed from the ExecuTorch repository
-root directory, and assumes that you have gone through the steps of setting up
-ExecuTorch.
-
-It is also assumed that the Android NDK and Android SDK is installed, and the
-following environment examples are set.
-
-```shell
-export ANDROID_NDK=
-# Select an appropriate Android ABI for your device
-export ANDROID_ABI=arm64-v8a
-# All subsequent commands should be performed from ExecuTorch repo root
-cd
-# Make sure adb works
-adb --version
-```
-
-## Lowering the Llama3.2-1B model to Vulkan
-
-::::{note}
-The resultant model will only be partially delegated to the Vulkan backend. In
-particular, only binary arithmetic operators (`aten.add`, `aten.sub`,
-`aten.mul`, `aten.div`), matrix multiplication operators (`aten.mm`, `aten.bmm`),
-and linear layers (`aten.linear`) will be executed on the GPU via the Vulkan
-delegate. The rest of the model will be executed using Portable operators.
-
-Operator support for LLaMA models is currently in active development; please
-check out the `main` branch of the ExecuTorch repo for the latest capabilities.
-::::
-
-First, obtain the `consolidated.00.pth`, `params.json` and `tokenizer.model`
-files for the `Llama3.2-1B` model from the [Llama website](https://www.llama.com/llama-downloads/).
-
-Once the files have been downloaded, the `export_llama` script can be used to
-partially lower the Llama model to Vulkan.
-
-```shell
-# The files will usually be downloaded to ~/.llama
-python -m examples.models.llama.export_llama \
- --disable_dynamic_shape --vulkan -kv --use_sdpa_with_kv_cache -d fp32 \
- --model "llama3_2" \
- -c ~/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \
- -p ~/.llama/checkpoints/Llama3.2-1B/params.json \
- --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
-```
-
-A `vulkan_llama2.pte` file should have been created as a result of running the
-script.
-
-Push the tokenizer binary and `vulkan_llama2.pte` onto your Android device:
-
-```shell
-adb push ~/.llama/tokenizer.model /data/local/tmp/
-adb push vulkan_llama2.pte /data/local/tmp/
-```
-
-## Build and Run the LLaMA runner binary on Android
-
-First, build and install ExecuTorch libraries, then build the LLaMA runner
-binary using the Android NDK toolchain.
-
-```shell
-./install_executorch.sh --clean
-(mkdir cmake-android-out && \
- cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
- -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
- -DANDROID_ABI=$ANDROID_ABI \
- -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
- -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
- -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
- -DEXECUTORCH_BUILD_VULKAN=ON \
- -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
- -DEXECUTORCH_BUILD_KERNELS_LLM=ON \
- -DPYTHON_EXECUTABLE=python \
- -Bcmake-android-out && \
- cmake --build cmake-android-out -j16 --target install)
-
-# Build LLaMA Runner library
-(rm -rf cmake-android-out/examples/models/llama && \
- cmake examples/models/llama \
- -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
- -DANDROID_ABI=$ANDROID_ABI \
- -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
- -DEXECUTORCH_BUILD_KERNELS_LLM=ON \
- -DCMAKE_INSTALL_PREFIX=cmake-android-out \
- -DPYTHON_EXECUTABLE=python \
- -Bcmake-android-out/examples/models/llama && \
- cmake --build cmake-android-out/examples/models/llama -j16)
-```
-
-Finally, push and run the llama runner binary on your Android device. Note that
-your device must have sufficient GPU memory to execute the model.
-
-```shell
-adb push cmake-android-out/examples/models/llama/llama_main /data/local/tmp/llama_main
-
-adb shell /data/local/tmp/llama_main \
- --model_path=/data/local/tmp/vulkan_llama2.pte \
- --tokenizer_path=/data/local/tmp/tokenizer.model \
- --prompt "Hello"
-```
-
-Note that currently model inference will be very slow due to the high amount of
-delegate blobs in the lowered graph, which requires a transfer to and from the
-GPU for each sub graph. Performance is expected to improve drastically as more
-of the model can be lowered to the Vulkan delegate, and techniques such as
-quantization are supported.
diff --git a/backends/xnnpack/README.md b/backends/xnnpack/README.md
index 6e6be7ddb4c..7c6a7ccbc33 100644
--- a/backends/xnnpack/README.md
+++ b/backends/xnnpack/README.md
@@ -134,4 +134,4 @@ create an issue on [github](https://www.github.com/pytorch/executorch/issues).
## See Also
For more information about the XNNPACK Backend, please check out the following resources:
- [XNNPACK Backend](https://pytorch.org/executorch/main/backends-xnnpack)
-- [XNNPACK Backend Internals](https://pytorch.org/executorch/main/backend-delegates-xnnpack-reference)
+- [XNNPACK Backend Internals](https://pytorch.org/executorch/main/backends/xnnpack/backend-delegates-xnnpack-reference)
diff --git a/docs/source/_static/img/swiftpm_xcode1.png b/docs/source/_static/img/swiftpm_xcode1.png
index 4e624ed43df..b9acb23847b 100644
Binary files a/docs/source/_static/img/swiftpm_xcode1.png and b/docs/source/_static/img/swiftpm_xcode1.png differ
diff --git a/docs/source/android-backends.md b/docs/source/android-backends.md
index d506813990b..d4da0966ed9 100644
--- a/docs/source/android-backends.md
+++ b/docs/source/android-backends.md
@@ -16,7 +16,7 @@ Available hardware acceleration backends for Android deployment.
- {doc}`android-qualcomm` — Qualcomm AI Engine (NPU)
- {doc}`android-mediatek` — MediaTek NPU acceleration
- {doc}`android-arm-vgf` — ARM VGF Backend
-- {doc}`android-samsung-exynos` — Samsung Exynos NPU
+- {doc}`backends/samsung/samsung-overview` — Samsung Exynos NPU
```{toctree}
:hidden:
@@ -25,4 +25,4 @@ android-vulkan
android-qualcomm
android-mediatek
android-arm-vgf
-android-samsung-exynos
+backends/samsung/samsung-overview
diff --git a/docs/source/android-examples.md b/docs/source/android-examples.md
index 65580870c57..057fd48bc55 100644
--- a/docs/source/android-examples.md
+++ b/docs/source/android-examples.md
@@ -1,7 +1,7 @@
# Examples & Demos
-- [Working with LLMs - Android Examples](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android)
-- [Demo Apps](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo#executorch-android-demo-app)
+- [Working with LLMs - Android Examples](https://github.com/meta-pytorch/executorch-examples/blob/main/llm/android/LlamaDemo/README.md) - ExecuTorch Llama Android Demo App
+- [Demo Apps](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo#executorch-android-demo-app) - DeepLab v3 model for image segmentation
- {doc}`tutorial-arm-vgf` — Export a simple PyTorch model for the ExecuTorch VGF backend
```{toctree}
diff --git a/docs/source/android-vulkan.md b/docs/source/android-vulkan.md
index 6399ac4ec7c..aa987835989 100644
--- a/docs/source/android-vulkan.md
+++ b/docs/source/android-vulkan.md
@@ -1 +1 @@
-```{include} backends-vulkan.md
+```{include} backends/vulkan/vulkan-overview.md
diff --git a/docs/source/android-xnnpack.md b/docs/source/android-xnnpack.md
index 315dd747006..4a85dec946b 100644
--- a/docs/source/android-xnnpack.md
+++ b/docs/source/android-xnnpack.md
@@ -1 +1 @@
-```{include} backends-xnnpack.md
+```{include} backends/xnnpack/xnnpack-overview.md
diff --git a/docs/source/archive/backends-cadence-legacy.md b/docs/source/archive/backends-cadence-legacy.md
new file mode 100644
index 00000000000..21f60477c63
--- /dev/null
+++ b/docs/source/archive/backends-cadence-legacy.md
@@ -0,0 +1,238 @@
+# Cadence Xtensa Backend (Legacy / Outdated)
+
+```{warning}
+**⚠️ THIS DOCUMENTATION IS OUTDATED AND NO LONGER MAINTAINED**
+
+**For current Cadence backend documentation and support:**
+- Please refer to the up-to-date documentation in [backends-cadence.md](../backends-cadence.md)
+```
+
+---
+# Cadence Xtensa Backend
+
+
+In this tutorial we will walk you through the process of getting setup to build ExecuTorch for an Xtensa HiFi4 DSP and running a simple model on it.
+
+[Cadence](https://www.cadence.com/en_US/home.html) is both a hardware and software vendor, providing solutions for many computational workloads, including to run on power-limited embedded devices. The [Xtensa HiFi4 DSP](https://www.cadence.com/en_US/home/tools/ip/tensilica-ip/hifi-dsps/hifi-4.html) is a Digital Signal Processor (DSP) that is optimized for running audio based neural networks such as wake word detection, Automatic Speech Recognition (ASR), etc.
+
+In addition to the chip, the HiFi4 Neural Network Library ([nnlib](https://github.com/foss-xtensa/nnlib-hifi4)) offers an optimized set of library functions commonly used in NN processing that we utilize in this example to demonstrate how common operations can be accelerated.
+
+On top of being able to run on the Xtensa HiFi4 DSP, another goal of this tutorial is to demonstrate how portable ExecuTorch is and its ability to run on a low-power embedded device such as the Xtensa HiFi4 DSP. This workflow does not require any delegates, it uses custom operators and compiler passes to enhance the model and make it more suitable to running on Xtensa HiFi4 DSPs. A custom [quantizer](https://pytorch.org/tutorials/prototype/quantization_in_pytorch_2_0_export_tutorial.html) is used to represent activations and weights as `uint8` instead of `float`, and call appropriate operators. Finally, custom kernels optimized with Xtensa intrinsics provide runtime acceleration.
+
+::::{grid} 2
+:::{grid-item-card} What you will learn in this tutorial:
+:class-card: card-prerequisites
+* In this tutorial you will learn how to export a quantized model with a linear operation targeted for the Xtensa HiFi4 DSP.
+* You will also learn how to compile and deploy the ExecuTorch runtime with the kernels required for running the quantized model generated in the previous step on the Xtensa HiFi4 DSP.
+:::
+:::{grid-item-card} Tutorials we recommend you complete before this:
+:class-card: card-prerequisites
+* [Introduction to ExecuTorch](intro-how-it-works.md)
+* [Getting Started](getting-started.md)
+* [Building ExecuTorch with CMake](using-executorch-building-from-source.md)
+:::
+::::
+
+```{note}
+The linux part of this tutorial has been designed and tested on Ubuntu 22.04 LTS, and requires glibc 2.34. Workarounds are available for other distributions, but will not be covered in this tutorial.
+```
+
+## Prerequisites (Hardware and Software)
+
+In order to be able to succesfully build and run ExecuTorch on a Xtensa HiFi4 DSP you'll need the following hardware and software components.
+
+### Hardware
+ - [i.MX RT600 Evaluation Kit](https://www.nxp.com/design/development-boards/i-mx-evaluation-and-development-boards/i-mx-rt600-evaluation-kit:MIMXRT685-EVK)
+
+### Software
+ - x86-64 Linux system (For compiling the DSP binaries)
+ - [MCUXpresso IDE](https://www.nxp.com/design/software/development-software/mcuxpresso-software-and-tools-/mcuxpresso-integrated-development-environment-ide:MCUXpresso-IDE)
+ - This IDE is supported on multiple platforms including MacOS. You can use it on any of the supported platforms as you'll only be using this to flash the board with the DSP images that you'll be building later on in this tutorial.
+- [J-Link](https://www.segger.com/downloads/jlink/)
+ - Needed to flash the board with the firmware images. You can install this on the same platform that you installed the MCUXpresso IDE on.
+ - Note: depending on the version of the NXP board, another probe than JLink might be installed. In any case, flashing is done using the MCUXpresso IDE in a similar way.
+ - [MCUXpresso SDK](https://mcuxpresso.nxp.com/en/select?device=EVK-MIMXRT685)
+ - Download this SDK to your Linux machine, extract it and take a note of the path where you store it. You'll need this later.
+- [Xtensa compiler](https://tensilicatools.com/platform/i-mx-rt600/)
+ - Download this to your Linux machine. This is needed to build ExecuTorch for the HiFi4 DSP.
+- For cases with optimized kernels, the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4).
+
+## Setting up Developer Environment
+
+Step 1. In order to be able to successfully install all the software components specified above users will need to go through the NXP tutorial linked below. Although the tutorial itself walks through a Windows setup, most of the steps translate over to a Linux installation too.
+
+[NXP tutorial on setting up the board and dev environment](https://www.nxp.com/document/guide/getting-started-with-i-mx-rt600-evaluation-kit:GS-MIMXRT685-EVK?section=plug-it-in)
+
+```{note}
+Before proceeding forward to the next section users should be able to succesfullly flash the **dsp_mu_polling_cm33** sample application from the tutorial above and notice output on the UART console indicating that the Cortex-M33 and HiFi4 DSP are talking to each other.
+```
+
+Step 2. Make sure you have completed the ExecuTorch setup tutorials linked to at the top of this page.
+
+## Working Tree Description
+
+The working tree is:
+
+```
+executorch
+├── backends
+│ └── cadence
+│ ├── aot
+│ ├── ops_registration
+│ ├── tests
+│ ├── utils
+│ ├── hifi
+│ │ ├── kernels
+│ │ ├── operators
+│ │ └── third-party
+│ │ └── hifi4-nnlib
+│ └── [other cadence DSP families]
+│ ├── kernels
+│ ├── operators
+│ └── third-party
+│ └── [any required lib]
+└── examples
+ └── cadence
+ ├── models
+ └── operators
+```
+
+***AoT (Ahead-of-Time) Components***:
+
+The AoT folder contains all of the python scripts and functions needed to export the model to an ExecuTorch `.pte` file. In our case, [export_example.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/export_example.py) is an API that takes a model (nn.Module) and representative inputs and runs it through the quantizer (from [quantizer.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/quantizer/quantizer.py)). Then a few compiler passes, also defined in [quantizer.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/quantizer/quantizer.py), will replace operators with custom ones that are supported and optimized on the chip. Any operator needed to compute things should be defined in [ops_registrations.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/ops_registrations.py) and have corresponding implemetations in the other folders.
+
+***Operators***:
+
+The operators folder contains two kinds of operators: existing operators from the [ExecuTorch portable library](https://github.com/pytorch/executorch/tree/main/kernels/portable/cpu) and new operators that define custom computations. The former is simply dispatching the operator to the relevant ExecuTorch implementation, while the latter acts as an interface, setting up everything needed for the custom kernels to compute the outputs.
+
+***Kernels***:
+
+The kernels folder contains the optimized kernels that will run on the HiFi4 chip. They use Xtensa intrinsics to deliver high performance at low-power.
+
+## Build
+
+In this step, you will generate the ExecuTorch program from different models. You'll then use this Program (the `.pte` file) during the runtime build step to bake this Program into the DSP image.
+
+***Simple Model***:
+
+The first, simple model is meant to test that all components of this tutorial are working properly, and simply does an add operation. The generated file is called `add.pte`.
+
+```bash
+cd executorch
+python3 -m examples.portable.scripts.export --model_name="add"
+```
+
+***Quantized Operators***:
+
+The other, more complex model are custom operators, including:
+ - a quantized [linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) operation. The model is defined [here](https://github.com/pytorch/executorch/blob/main/examples/cadence/operators/test_quantized_linear_op.py#L30). Linear is the backbone of most Automatic Speech Recognition (ASR) models.
+ - a quantized [conv1d](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html) operation. The model is defined [here](https://github.com/pytorch/executorch/blob/main/examples/cadence/operators/test_quantized_conv1d_op.py#L40). Convolutions are important in wake word and many denoising models.
+
+In both cases the generated file is called `CadenceDemoModel.pte`.
+
+```bash
+cd executorch
+python3 -m examples.cadence.operators.quantized__op
+```
+
+***Small Model: RNNT predictor***:
+
+The torchaudio [RNNT-emformer](https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html) model is an Automatic Speech Recognition (ASR) model, comprised of three different submodels: an encoder, a predictor and a joiner.
+The [predictor](https://github.com/pytorch/executorch/blob/main/examples/cadence/models/rnnt_predictor.py) is a sequence of basic ops (embedding, ReLU, linear, layer norm) and can be exported using:
+
+```bash
+cd executorch
+python3 -m examples.cadence.models.rnnt_predictor
+```
+
+The generated file is called `CadenceDemoModel.pte`.
+
+### Runtime
+
+**Building the DSP firmware image**
+In this step, you'll be building the DSP firmware image that consists of the sample ExecuTorch runner along with the Program generated from the previous step. This image when loaded onto the DSP will run through the model that this Program consists of.
+
+***Step 1***. Configure the environment variables needed to point to the Xtensa toolchain that you have installed in the previous step. The three environment variables that need to be set include:
+```bash
+# Directory in which the Xtensa toolchain was installed
+export XTENSA_TOOLCHAIN=/home/user_name/cadence/XtDevTools/install/tools
+# The version of the toolchain that was installed. This is essentially the name of the directory
+# that is present in the XTENSA_TOOLCHAIN directory from above.
+export TOOLCHAIN_VER=RI-2021.8-linux
+# The Xtensa core that you're targeting.
+export XTENSA_CORE=nxp_rt600_RI2021_8_newlib
+```
+
+***Step 2***. Clone the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4), which contains optimized kernels and primitives for HiFi4 DSPs, with `git clone git@github.com:foss-xtensa/nnlib-hifi4.git`.
+
+***Step 3***. Run the CMake build.
+In order to run the CMake build, you need the path to the following:
+- The Program generated in the previous step
+- Path to the NXP SDK root. This should have been installed already in the [Setting up Developer Environment](#setting-up-developer-environment) section. This is the directory that contains the folders such as boards, components, devices, and other.
+
+```bash
+cd executorch
+./install_executorch.sh --clean
+mkdir cmake-out
+# prebuild and install executorch library
+cmake -DCMAKE_TOOLCHAIN_FILE=/backends/cadence/cadence.cmake \
+ -DCMAKE_INSTALL_PREFIX=cmake-out \
+ -DCMAKE_BUILD_TYPE=Debug \
+ -DPYTHON_EXECUTABLE=python3 \
+ -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \
+ -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=OFF \
+ -DEXECUTORCH_BUILD_PTHREADPOOL=OFF \
+ -DEXECUTORCH_BUILD_CPUINFO=OFF \
+ -Bcmake-out .
+
+cmake --build cmake-out -j --target install --config Debug
+# build cadence runner
+cmake -DCMAKE_BUILD_TYPE=Debug \
+ -DCMAKE_TOOLCHAIN_FILE=/examples/backends/cadence.cmake \
+ -DCMAKE_PREFIX_PATH=/cmake-out \
+ -DMODEL_PATH= \
+ -DNXP_SDK_ROOT_DIR= \
+ -DNN_LIB_BASE_DIR= \
+ -Bcmake-out/examples/cadence \
+ examples/cadence
+
+cmake --build cmake-out/examples/cadence -j8 -t cadence_executorch_example
+```
+
+After having succesfully run the above step you should see two binary files in their CMake output directory.
+```bash
+> ls cmake-xt/*.bin
+cmake-xt/dsp_data_release.bin cmake-xt/dsp_text_release.bin
+```
+
+## Deploying and Running on Device
+
+***Step 1***. You now take the DSP binary images generated from the previous step and copy them over into your NXP workspace created in the [Setting up Developer Environment](#setting-up-developer-environment) section. Copy the DSP images into the `dsp_binary` section highlighted in the image below.
+
+
+
+```{note}
+As long as binaries have been built using the Xtensa toolchain on Linux, flashing the board and running on the chip can be done only with the MCUXpresso IDE, which is available on all platforms (Linux, MacOS, Windows).
+```
+
+***Step 2***. Clean your work space
+
+***Step 3***. Click **Debug your Project** which will flash the board with your binaries.
+
+On the UART console connected to your board (at a default baud rate of 115200), you should see an output similar to this:
+
+```bash
+> screen /dev/tty.usbmodem0007288234991 115200
+Executed model
+Model executed successfully.
+First 20 elements of output 0
+0.165528 0.331055 ...
+```
+
+## Conclusion and Future Work
+
+In this tutorial, you have learned how to export a quantized operation, build the ExecuTorch runtime and run this model on the Xtensa HiFi4 DSP chip.
+
+The (quantized linear) model in this tutorial is a typical operation appearing in ASR models, and can be extended to a complete ASR model by creating the model as a new test and adding the needed operators/kernels to [operators](https://github.com/pytorch/executorch/blob/main/backends/cadence/hifi/operators) and [kernels](https://github.com/pytorch/executorch/blob/main/backends/cadence/hifi/kernels).
+
+Other models can be created following the same structure, always assuming that operators and kernels are available.
diff --git a/docs/source/backend-delegate-advanced.md b/docs/source/backend-delegate-advanced.md
index 752bd1cdc02..e82e5ee035d 100644
--- a/docs/source/backend-delegate-advanced.md
+++ b/docs/source/backend-delegate-advanced.md
@@ -6,10 +6,6 @@
- {doc}`backend-delegates-integration` — Learn how to integrate a backend delegate into ExecuTorch
-## XNNPACK Reference
-
-- {doc}`backend-delegates-xnnpack-reference` — Deep dive into XNNPACK delegate internals and implementation details
-
## Dependency Management
- {doc}`backend-delegates-dependencies` — Manage third-party dependencies for backend delegates
@@ -27,7 +23,6 @@
:maxdepth: 1
backend-delegates-integration
-backend-delegates-xnnpack-reference
backend-delegates-dependencies
compiler-delegate-and-partitioner
debug-backend-delegate
diff --git a/docs/source/backend-development.md b/docs/source/backend-development.md
index ec5ceb3b37a..40c50a8ad11 100644
--- a/docs/source/backend-development.md
+++ b/docs/source/backend-development.md
@@ -4,7 +4,6 @@
:maxdepth: 1
backend-delegates-integration
-backend-delegates-xnnpack-reference
backend-delegates-dependencies
compiler-delegate-and-partitioner
debug-backend-delegate
diff --git a/docs/source/backends-cadence.md b/docs/source/backends-cadence.md
index 9f15656d39c..667e71ea5a4 100644
--- a/docs/source/backends-cadence.md
+++ b/docs/source/backends-cadence.md
@@ -1,9 +1,12 @@
# Cadence Xtensa Backend
-In this tutorial we will walk you through the process of getting setup to build ExecuTorch for an Xtensa HiFi4 DSP and running a simple model on it.
+In this tutorial we will walk you through the process of getting setup to build ExecuTorch for Cadence Xtensa DSPs and running models on them.
-[Cadence](https://www.cadence.com/en_US/home.html) is both a hardware and software vendor, providing solutions for many computational workloads, including to run on power-limited embedded devices. The [Xtensa HiFi4 DSP](https://www.cadence.com/en_US/home/tools/ip/tensilica-ip/hifi-dsps/hifi-4.html) is a Digital Signal Processor (DSP) that is optimized for running audio based neural networks such as wake word detection, Automatic Speech Recognition (ASR), etc.
+[Cadence](https://www.cadence.com/en_US/home.html) is both a hardware and software vendor, providing solutions for many computational workloads, including to run on power-limited embedded devices. The Cadence backend supports multiple DSP families optimized for different workloads:
+- **HiFi Audio DSPs** (HiFi4/HiFi5): Optimized for audio processing, speech recognition, and wake word detection
+- **Fusion G3 DSPs**: General-purpose AI acceleration
+- **Vision P-Series DSPs**: Specialized for computer vision and CNN workloads
In addition to the chip, the HiFi4 Neural Network Library ([nnlib](https://github.com/foss-xtensa/nnlib-hifi4)) offers an optimized set of library functions commonly used in NN processing that we utilize in this example to demonstrate how common operations can be accelerated.
@@ -67,42 +70,99 @@ The working tree is:
executorch
├── backends
│ └── cadence
-│ ├── aot
-│ ├── ops_registration
-│ ├── tests
-│ ├── utils
-│ ├── hifi
+│ ├── aot # Ahead-of-Time compilation tools
+│ │ ├── compiler.py # Main compilation API
+│ │ ├── export_example.py # Export workflow example
+│ │ ├── quantizer/ # Quantization infrastructure
+│ │ │ ├── quantizer.py # Multiple quantizer implementations
+│ │ │ ├── patterns.py # Quantization patterns
+│ │ │ └── fusion_pass.py # Op fusion pass
+│ │ ├── passes.py # Graph optimization passes
+│ │ ├── functions.yaml # Generic operator definitions
+│ │ ├── functions_hifi.yaml # HiFi-specific definitions
+│ │ ├── functions_fusion_g3.yaml # Fusion G3 definitions
+│ │ └── functions_vision.yaml # Vision-specific definitions
+│ ├── runtime/ # Runtime execution infrastructure
+│ ├── utils/ # Build utilities (FACTO, header gen)
+│ ├── hifi/ # HiFi Audio DSP family (70+ ops)
+│ │ ├── kernels # Optimized HiFi4/HiFi5 kernels
+│ │ ├── operators # HiFi operator implementations
+│ │ └── third-party
+│ │ └── nnlib # Cadence NNLIB integration
+│ ├── fusion_g3/ # Fusion G3 DSP family (25+ ops)
│ │ ├── kernels
│ │ ├── operators
│ │ └── third-party
-│ │ └── hifi4-nnlib
-│ └── [other cadence DSP families]
-│ ├── kernels
-│ ├── operators
-│ └── third-party
-│ └── [any required lib]
+│ │ └── nnlib
+│ ├── vision/ # Vision P-Series DSP family (17+ ops)
+│ │ ├── kernels
+│ │ ├── operators
+│ │ └── third-party # Vision-specific library
+│ └── generic/ # Generic fallback implementations (15+ ops)
+│ └── operators
└── examples
└── cadence
- ├── models
- └── operators
+ ├── models # 9 example models
+ │ ├── rnnt_encoder.py # ASR encoder (ConvEmformer)
+ │ ├── rnnt_predictor.py # ASR predictor
+ │ ├── rnnt_joiner.py # ASR joiner
+ │ ├── wav2vec2.py # Self-supervised speech
+ │ ├── mobilenet_v2.py # Image classification
+ │ ├── resnet18.py # Image classification
+ │ ├── resnet50.py # Image classification
+ │ ├── vision_transformer.py # ViT
+ │ └── babyllama.py # Small LLM
+ └── operators # Operator test examples
+ ├── test_add_op.py # Add operation tests
+ ├── test_quantized_linear_op.py
+ ├── test_quantized_conv1d_op.py
+ ├── test_requantize_op.py
+ └── test_g3_ops.py # FACTO-based G3 tests
```
***AoT (Ahead-of-Time) Components***:
-The AoT folder contains all of the python scripts and functions needed to export the model to an ExecuTorch `.pte` file. In our case, [export_example.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/export_example.py) is an API that takes a model (nn.Module) and representative inputs and runs it through the quantizer (from [quantizer.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/quantizer/quantizer.py)). Then a few compiler passes, also defined in [quantizer.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/quantizer/quantizer.py), will replace operators with custom ones that are supported and optimized on the chip. Any operator needed to compute things should be defined in [ops_registrations.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/ops_registrations.py) and have corresponding implemetations in the other folders.
+The AoT folder contains all of the python scripts and functions needed to export the model to an ExecuTorch `.pte` file. The main components include:
+
+- **Compiler API** ([compiler.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/compiler.py)): High-level APIs for model compilation including `trace()`, `quantize_pt2()`, `export_to_edge()`, and `export_to_cadence()`.
+
+- **Quantizer** ([quantizer/quantizer.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/quantizer/quantizer.py)): Multiple quantization strategies:
+ - `CadenceDefaultQuantizer`: Standard A8W8 (8-bit asymmetric activations, 8-bit weights)
+ - `CadenceWithLayerNormQuantizer`: Adds layer normalization support
+ - `CadenceWakeWordQuantizer`: Optimized for audio wake word models
+ - `CadenceW8A32MixedQuantizer`: Experimental mixed precision (8-bit weights, 32-bit activations)
+ - `CadenceWithSoftmaxQuantizer`: Includes A16 (16-bit activation) softmax
+
+- **Compiler Passes** ([passes.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/passes.py)): Graph optimization passes including operator fusion, replacement, simplification, and reordering.
+
+- **Operator Registrations** ([ops_registrations.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/ops_registrations.py)): Registers 100+ custom Cadence operators with meta kernels for shape inference. Supports quantized operations for conv1d/2d, linear, matmul, layer norm, and more.
+
+- **Export Example** ([export_example.py](https://github.com/pytorch/executorch/blob/main/backends/cadence/aot/export_example.py)): Reference implementation demonstrating the complete export workflow from model to `.pte` file.
+
+***DSP Family-Specific Implementations***:
-***Operators***:
+Each DSP family has its own optimized operator and kernel implementations:
-The operators folder contains two kinds of operators: existing operators from the [ExecuTorch portable library](https://github.com/pytorch/executorch/tree/main/kernels/portable/cpu) and new operators that define custom computations. The former is simply dispatching the operator to the relevant ExecuTorch implementation, while the latter acts as an interface, setting up everything needed for the custom kernels to compute the outputs.
+- **HiFi**: Extensive support for quantized convolutions (1D/2D, depthwise, dilated), linear, matmul, layer norm, ReLU, add, and more. Uses Cadence NNLIB for optimized primitives.
+
+- **Fusion G3**: General-purpose operations including arithmetic (add, sub, mul, div), activations (sigmoid, tanh, softmax), layer normalization, and tensor manipulation.
+
+- **Vision**: Vision-focused operations including quantized conv, linear, matmul, im2row transformation, and softmax with custom vision library.
+
+- **Generic**: Reference implementations used as fallback when DSP-specific optimizations aren't available.
***Kernels***:
-The kernels folder contains the optimized kernels that will run on the HiFi4 chip. They use Xtensa intrinsics to deliver high performance at low-power.
+The kernels folders contain optimized implementations that use Xtensa intrinsics to deliver high performance at low power. Each DSP family has its own kernel implementations tuned for the specific architecture characteristics.
## Build
In this step, you will generate the ExecuTorch program from different models. You'll then use this Program (the `.pte` file) during the runtime build step to bake this Program into the DSP image.
+### Model Export Examples
+
+The Cadence backend provides multiple example models covering different use cases:
+
***Simple Model***:
The first, simple model is meant to test that all components of this tutorial are working properly, and simply does an add operation. The generated file is called `add.pte`.
@@ -114,28 +174,79 @@ python3 -m examples.portable.scripts.export --model_name="add"
***Quantized Operators***:
-The other, more complex model are custom operators, including:
- - a quantized [linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) operation. The model is defined [here](https://github.com/pytorch/executorch/blob/main/examples/cadence/operators/test_quantized_linear_op.py#L30). Linear is the backbone of most Automatic Speech Recognition (ASR) models.
- - a quantized [conv1d](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html) operation. The model is defined [here](https://github.com/pytorch/executorch/blob/main/examples/cadence/operators/test_quantized_conv1d_op.py#L40). Convolutions are important in wake word and many denoising models.
+Test individual quantized operations:
-In both cases the generated file is called `CadenceDemoModel.pte`.
+- **Quantized Linear**: [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) operation (32→16 features). Linear is the backbone of most ASR models.
+ ```bash
+ python3 -m examples.cadence.operators.test_quantized_linear_op
+ ```
-```bash
-cd executorch
-python3 -m examples.cadence.operators.quantized__op
-```
+- **Quantized Conv1D**: [Conv1d](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html) operation (8→16 channels). Important for wake word and denoising models.
+ ```bash
+ python3 -m examples.cadence.operators.test_quantized_conv1d_op
+ ```
-***Small Model: RNNT predictor***:
+- **Requantize Operation**: Tests dtype conversion between different quantized types.
+ ```bash
+ python3 -m examples.cadence.operators.test_requantize_op
+ ```
-The torchaudio [RNNT-emformer](https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html) model is an Automatic Speech Recognition (ASR) model, comprised of three different submodels: an encoder, a predictor and a joiner.
-The [predictor](https://github.com/pytorch/executorch/blob/main/examples/cadence/models/rnnt_predictor.py) is a sequence of basic ops (embedding, ReLU, linear, layer norm) and can be exported using:
+In all cases the generated file is called `CadenceDemoModel.pte`.
-```bash
-cd executorch
-python3 -m examples.cadence.models.rnnt_predictor
-```
+***Speech/Audio Models***:
+
+The torchaudio [RNNT-emformer](https://pytorch.org/audio/stable/tutorials/online_asr_tutorial.html) model is an Automatic Speech Recognition (ASR) model, comprised of three different submodels:
+
+- **RNNT Predictor**: Sequence of basic ops (embedding, ReLU, linear, layer norm)
+ ```bash
+ python3 -m examples.cadence.models.rnnt_predictor
+ ```
+
+- **RNNT Encoder**: ConvEmformer-based encoder with time reduction and transformer layers
+ ```bash
+ python3 -m examples.cadence.models.rnnt_encoder
+ ```
+
+- **RNNT Joiner**: Joint network combining encoder and predictor outputs
+ ```bash
+ python3 -m examples.cadence.models.rnnt_joiner
+ ```
+
+- **Wav2Vec 2.0**: Self-supervised speech representation model
+ ```bash
+ python3 -m examples.cadence.models.wav2vec2
+ ```
+
+***Computer Vision Models***:
+
+- **MobileNet V2**: Efficient image classification
+ ```bash
+ python3 -m examples.cadence.models.mobilenet_v2
+ ```
-The generated file is called `CadenceDemoModel.pte`.
+- **ResNet-18**: Image classification
+ ```bash
+ python3 -m examples.cadence.models.resnet18
+ ```
+
+- **ResNet-50**: Deeper image classification
+ ```bash
+ python3 -m examples.cadence.models.resnet50
+ ```
+
+- **Vision Transformer (ViT)**: Transformer-based vision model
+ ```bash
+ python3 -m examples.cadence.models.vision_transformer
+ ```
+
+***Language Model***:
+
+- **Baby LLaMA**: Small LLM for testing transformer operations on DSP
+ ```bash
+ python3 -m examples.cadence.models.babyllama
+ ```
+
+All model exports generate `CadenceDemoModel.pte` files ready for deployment.
### Runtime
@@ -148,9 +259,21 @@ In this step, you'll be building the DSP firmware image that consists of the sam
export XTENSA_TOOLCHAIN=/home/user_name/cadence/XtDevTools/install/tools
# The version of the toolchain that was installed. This is essentially the name of the directory
# that is present in the XTENSA_TOOLCHAIN directory from above.
-export TOOLCHAIN_VER=RI-2021.8-linux
+export TOOLCHAIN_VER=RI-2023.11-linux
# The Xtensa core that you're targeting.
-export XTENSA_CORE=nxp_rt600_RI2021_8_newlib
+# For HiFi4 (NXP RT600):
+export XTENSA_CORE=VANILLA_HIFI
+# For Fusion G3:
+# export XTENSA_CORE=VANILLA_G3
+# For Vision P6:
+# export XTENSA_CORE=VANILLA_VISION
+```
+
+```{note}
+The Cadence backend supports multiple DSP families:
+- **HiFi Audio DSPs** (HiFi4/HiFi5): Core `VANILLA_HIFI`, enable with `-DEXECUTORCH_NNLIB_OPT=ON`
+- **Fusion G3 DSPs**: Core `VANILLA_G3`, enable with `-DEXECUTORCH_FUSION_G3_OPT=ON`
+- **Vision P-Series DSPs**: Core `VANILLA_VISION`, enable with `-DEXECUTORCH_VISION_OPT=ON`
```
***Step 2***. Clone the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4), which contains optimized kernels and primitives for HiFi4 DSPs, with `git clone git@github.com:foss-xtensa/nnlib-hifi4.git`.
@@ -199,7 +322,7 @@ cmake-xt/dsp_data_release.bin cmake-xt/dsp_text_release.bin
***Step 1***. You now take the DSP binary images generated from the previous step and copy them over into your NXP workspace created in the [Setting up Developer Environment](#setting-up-developer-environment) section. Copy the DSP images into the `dsp_binary` section highlighted in the image below.
-
+
```{note}
As long as binaries have been built using the Xtensa toolchain on Linux, flashing the board and running on the chip can be done only with the MCUXpresso IDE, which is available on all platforms (Linux, MacOS, Windows).
diff --git a/docs/source/backends-overview.md b/docs/source/backends-overview.md
index bfa17bc9a9c..ddb55f2afec 100644
--- a/docs/source/backends-overview.md
+++ b/docs/source/backends-overview.md
@@ -18,20 +18,20 @@ Backends are the bridge between your exported model and the hardware it runs on.
## Choosing a Backend
-| Backend | Platform(s) | Hardware Type | Typical Use Case |
-|------------------------------------------------|---------------------|---------------|---------------------------------|
-| [XNNPACK](backends-xnnpack) | All | CPU | General-purpose, fallback |
-| [Core ML](/backends/coreml/coreml-overview.md) | iOS, macOS | NPU/GPU/CPU | Apple devices, high performance |
-| [Metal Performance Shaders](backends-mps) | iOS, macOS | GPU | Apple GPU acceleration |
-| [Vulkan ](backends-vulkan) | Android | GPU | Android GPU acceleration |
-| [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs |
-| [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs |
-| [ARM EthosU](backends-arm-ethos-u) | Embedded | NPU | ARM MCUs |
-| [ARM VGF](backends-arm-vgf) | Android | NPU | ARM platforms |
-| [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs |
-| [NXP](backends-nxp) | Embedded | NPU | NXP SoCs |
-| [Cadence](backends-cadence) | Embedded | DSP | DSP-optimized workloads |
-| [Samsung Exynos](backends-samsung-exynos) | Android | NPU | Samsung SoCs |
+| Backend | Platform(s) | Hardware Type | Typical Use Case |
+|-----------------------------------------------------------------|---------------------|---------------|---------------------------------|
+| [XNNPACK](backends/xnnpack/xnnpack-overview.md) | All | CPU | General-purpose, fallback |
+| [Core ML](/backends/coreml/coreml-overview.md) | iOS, macOS | NPU/GPU/CPU | Apple devices, high performance |
+| [Metal Performance Shaders](/backends/mps/mps-overview.md) | iOS, macOS | GPU | Apple GPU acceleration |
+| [Vulkan ](/backends/vulkan/vulkan-overview.md) | Android | GPU | Android GPU acceleration |
+| [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs |
+| [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs |
+| [ARM EthosU](backends-arm-ethos-u) | Embedded | NPU | ARM MCUs |
+| [ARM VGF](backends-arm-vgf) | Android | NPU | ARM platforms |
+| [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs |
+| [NXP](backends-nxp) | Embedded | NPU | NXP SoCs |
+| [Cadence](backends-cadence) | Embedded | DSP | DSP-optimized workloads |
+| [Samsung Exynos](/backends/samsung/samsung-overview.md) | Android | NPU | Samsung SoCs |
**Tip:** For best performance, export a `.pte` file for each backend you plan to support.
@@ -50,10 +50,10 @@ Backends are the bridge between your exported model and the hardware it runs on.
:hidden:
:caption: Backend Overview
-backends-xnnpack
+backends/xnnpack/xnnpack-overview
backends/coreml/coreml-overview
-backends-mps
-backends-vulkan
+backends/mps/mps-overview
+backends/vulkan/vulkan-overview
backends-qualcomm
backends-mediatek
backends-arm-ethos-u
@@ -61,4 +61,4 @@ backends-arm-vgf
build-run-openvino
backends-nxp
backends-cadence
-backends-samsung-exynos
+backends/samsung/samsung-overview
diff --git a/docs/source/backends-vulkan.md b/docs/source/backends-vulkan.md
deleted file mode 100644
index 3ae80950645..00000000000
--- a/docs/source/backends-vulkan.md
+++ /dev/null
@@ -1,205 +0,0 @@
-# Vulkan Backend
-
-The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
-built on top of the cross-platform Vulkan GPU API standard. It is primarily
-designed to leverage the GPU to accelerate model inference on Android devices,
-but can be used on any platform that supports an implementation of Vulkan:
-laptops, servers, and edge devices.
-
-::::{note}
-The Vulkan delegate is currently under active development, and its components
-are subject to change.
-::::
-
-## What is Vulkan?
-
-Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
-It is designed to offer developers more explicit control over GPUs compared to
-previous specifications in order to reduce overhead and maximize the
-capabilities of the modern graphics hardware.
-
-Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
-desktop and mobile) in the market support Vulkan. Vulkan is also included in
-Android from Android 7.0 onwards.
-
-**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
-provides a way to execute compute and graphics operations on a GPU, but does not
-come with a built-in library of performant compute kernels.
-
-## The Vulkan Compute Library
-
-The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
-the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
-provide GPU implementations for PyTorch operators via GLSL compute shaders.
-
-The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
-The core components of the PyTorch Vulkan backend were forked into ExecuTorch
-and adapted for an AOT graph-mode style of model inference (as opposed to
-PyTorch which adopted an eager execution style of model inference).
-
-The components of the Vulkan Compute Library are contained in the
-`executorch/backends/vulkan/runtime/` directory. The core components are listed
-and described below:
-
-```
-runtime/
-├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
-└── graph/ .................. ComputeGraph class which implements graph mode inference
- └── ops/ ................ Base directory for operator implementations
- ├── glsl/ ........... GLSL compute shaders
- │ ├── *.glsl
- │ └── conv2d.glsl
- └── impl/ ........... C++ code to dispatch GPU compute shaders
- ├── *.cpp
- └── Conv2d.cpp
-```
-
-## Features
-
-The Vulkan delegate currently supports the following features:
-
-* **Memory Planning**
- * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
-* **Capability Based Partitioning**:
- * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
-* **Support for upper-bound dynamic shapes**:
- * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering
-
-In addition to increasing operator coverage, the following features are
-currently in development:
-
-* **Quantization Support**
- * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
-* **Memory Layout Management**
- * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
-* **Selective Build**
- * We plan to make it possible to control build size by selecting which operators/shaders you want to build with
-
-## End to End Example
-
-To further understand the features of the Vulkan Delegate and how to use it,
-consider the following end to end example with a simple single operator model.
-
-### Compile and lower a model to the Vulkan Delegate
-
-Assuming ExecuTorch has been set up and installed, the following script can be
-used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.
-
-Once ExecuTorch has been set up and installed, the following script can be used
-to generate a simple model and lower it to the Vulkan delegate.
-
-```
-# Note: this script is the same as the script from the "Setting up ExecuTorch"
-# page, with one minor addition to lower to the Vulkan backend.
-import torch
-from torch.export import export
-from executorch.exir import to_edge
-
-from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
-
-# Start with a PyTorch model that adds two input tensors (matrices)
-class Add(torch.nn.Module):
- def __init__(self):
- super(Add, self).__init__()
-
- def forward(self, x: torch.Tensor, y: torch.Tensor):
- return x + y
-
-# 1. torch.export: Defines the program with the ATen operator set.
-aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))
-
-# 2. to_edge: Make optimizations for Edge devices
-edge_program = to_edge(aten_dialect)
-# 2.1 Lower to the Vulkan backend
-edge_program = edge_program.to_backend(VulkanPartitioner())
-
-# 3. to_executorch: Convert the graph to an ExecuTorch program
-executorch_program = edge_program.to_executorch()
-
-# 4. Save the compiled .pte program
-with open("vk_add.pte", "wb") as file:
- file.write(executorch_program.buffer)
-```
-
-Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
-using the `to_backend()` API. The Vulkan Delegate implements the
-`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
-that are supported by the Vulkan delegate, and separates compatible sections of
-the model to be executed on the GPU.
-
-This means the a model can be lowered to the Vulkan delegate even if it contains
-some unsupported operators. This will just mean that only parts of the graph
-will be executed on the GPU.
-
-
-::::{note}
-The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py#L194)
-Vulkan partitioner code can be inspected to examine which ops are currently
-implemented in the Vulkan delegate.
-::::
-
-### Build Vulkan Delegate libraries
-
-The easiest way to build and test the Vulkan Delegate is to build for Android
-and test on a local Android device. Android devices have built in support for
-Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
-compile the Vulkan Compute Library's GLSL compute shaders.
-
-The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
-when building with CMake.
-
-First, make sure that you have the Android NDK installed; any NDK version past
-NDK r19c should work. Note that the examples in this doc have been validated with
-NDK r27b. The Android SDK should also be installed so that you have access to `adb`.
-
-The instructions in this page assumes that the following environment variables
-are set.
-
-```shell
-export ANDROID_NDK=
-# Select the appropriate Android ABI for your device
-export ANDROID_ABI=arm64-v8a
-# All subsequent commands should be performed from ExecuTorch repo root
-cd
-# Make sure adb works
-adb --version
-```
-
-To build and install ExecuTorch libraries (for Android) with the Vulkan
-Delegate:
-
-```shell
-# From executorch root directory
-(rm -rf cmake-android-out && \
- pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
- -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
- -DANDROID_ABI=$ANDROID_ABI \
- -DEXECUTORCH_BUILD_VULKAN=ON \
- -DPYTHON_EXECUTABLE=python \
- -Bcmake-android-out && \
- cmake --build cmake-android-out -j16 --target install)
-```
-
-### Run the Vulkan model on device
-
-::::{note}
-Since operator support is currently limited, only binary arithmetic operators
-will run on the GPU. Expect inference to be slow as the majority of operators
-are being executed via Portable operators.
-::::
-
-Now, the partially delegated model can be executed (partially) on your device's
-GPU!
-
-```shell
-# Build a model runner binary linked with the Vulkan delegate libs
-cmake --build cmake-android-out --target vulkan_executor_runner -j32
-
-# Push model to device
-adb push vk_add.pte /data/local/tmp/vk_add.pte
-# Push binary to device
-adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin
-
-# Run the model
-adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
-```
diff --git a/docs/source/backends-xnnpack.md b/docs/source/backends-xnnpack.md
deleted file mode 100644
index 42e76741ec8..00000000000
--- a/docs/source/backends-xnnpack.md
+++ /dev/null
@@ -1,182 +0,0 @@
-# XNNPACK Backend
-
-The XNNPACK delegate is the ExecuTorch solution for CPU execution on mobile CPUs. [XNNPACK](https://github.com/google/XNNPACK/tree/master) is a library that provides optimized kernels for machine learning operators on Arm and x86 CPUs.
-
-## Features
-
-- Wide operator support on Arm and x86 CPUs, available on any modern mobile phone.
-- Support for a wide variety of quantization schemes and quantized operators.
-- Supports fp32 and fp16 activations.
-- Supports 8-bit quantization.
-
-## Target Requirements
-
-- ARM64 on Android, iOS, macOS, Linux, and Windows.
-- ARMv7 (with NEON) on Android.
-- ARMv6 (with VFPv2) on Linux.
-- x86 and x86-64 (up to AVX512) on Windows, Linux, Android.
-
-## Development Requirements
-
-The XNNPACK delegate does not introduce any development system requirements beyond those required by
-the core ExecuTorch runtime.
-
-----
-
-## Using the XNNPACK Backend
-
-To target the XNNPACK backend during the export and lowering process, pass an instance of the `XnnpackPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision.
-
-```python
-import torch
-import torchvision.models as models
-from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
-from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
-from executorch.exir import to_edge_transform_and_lower
-
-mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
-sample_inputs = (torch.randn(1, 3, 224, 224), )
-
-et_program = to_edge_transform_and_lower(
- torch.export.export(mobilenet_v2, sample_inputs),
- partitioner=[XnnpackPartitioner()],
-).to_executorch()
-
-with open("mv2_xnnpack.pte", "wb") as file:
- et_program.write_to_file(file)
-```
-
-### Partitioner API
-
-The XNNPACK partitioner API allows for configuration of the model delegation to XNNPACK. Passing an `XnnpackPartitioner` instance with no additional parameters will run as much of the model as possible on the XNNPACK backend. This is the most common use-case. For advanced use cases, the partitioner exposes the following options via the [constructor](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/xnnpack_partitioner.py#L31):
-
- - `configs`: Control which operators are delegated to XNNPACK. By default, all available operators all delegated. See [../config/\_\_init\_\_.py](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/config/__init__.py#L66) for an exhaustive list of available operator configs.
- - `config_precisions`: Filter operators by data type. By default, delegate all precisions. One or more of `ConfigPrecisionType.FP32`, `ConfigPrecisionType.STATIC_QUANT`, or `ConfigPrecisionType.DYNAMIC_QUANT`. See [ConfigPrecisionType](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/config/xnnpack_config.py#L24).
- - `per_op_mode`: If true, emit individual delegate calls for every operator. This is an advanced option intended to reduce memory overhead in some contexts at the cost of a small amount of runtime overhead. Defaults to false.
- - `verbose`: If true, print additional information during lowering.
-
-### Testing the Model
-
-After generating the XNNPACK-delegated .pte, the model can be tested from Python using the ExecuTorch runtime python bindings. This can be used to sanity check the model and evaluate numerical accuracy. See [Testing the Model](using-executorch-export.md#testing-the-model) for more information.
-
-----
-
-## Quantization
-
-The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. To quantize a PyTorch model for the XNNPACK backend, use the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library.
-
-### Supported Quantization Schemes
-The XNNPACK delegate supports the following quantization schemes:
-
-- 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow).
- - Supports both static and dynamic activations.
- - Supports per-channel and per-tensor schemes.
- - Supports linear, convolution, add, mul, cat, and adaptive avg pool 2d operators.
-
-Weight-only quantization is not currently supported on XNNPACK.
-
-### 8-bit Quantization using the PT2E Flow
-
-To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model:
-
-1) Create an instance of the `XnnpackQuantizer` class. Set quantization parameters.
-2) Use `torch.export.export` to prepare for quantization.
-3) Call `prepare_pt2e` to prepare the model for quantization.
-4) For static quantization, run the prepared model with representative samples to calibrate the quantized tensor activation ranges.
-5) Call `convert_pt2e` to quantize the model.
-6) Export and lower the model using the standard flow.
-
-The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques.
-
-```python
-import torch
-import torchvision.models as models
-from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
-from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import XNNPACKQuantizer, get_symmetric_quantization_config
-from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
-from executorch.exir import to_edge_transform_and_lower
-from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
-
-model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
-sample_inputs = (torch.randn(1, 3, 224, 224), )
-
-qparams = get_symmetric_quantization_config(is_per_channel=True) # (1)
-quantizer = XNNPACKQuantizer()
-quantizer.set_global(qparams)
-
-training_ep = torch.export.export(model, sample_inputs).module() # (2)
-prepared_model = prepare_pt2e(training_ep, quantizer) # (3)
-
-for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs
- prepared_model(cal_sample) # (4) Calibrate
-
-quantized_model = convert_pt2e(prepared_model) # (5)
-
-et_program = to_edge_transform_and_lower( # (6)
- torch.export.export(quantized_model, sample_inputs),
- partitioner=[XnnpackPartitioner()],
-).to_executorch()
-```
-
-See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
-
-### LLM quantization with quantize_
-
-The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK:
-
-* Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity)
-* Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity)
-
-Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch).
-
-```python
-from torchao.quantization.granularity import PerGroup, PerAxis
-from torchao.quantization.quant_api import (
- IntxWeightOnlyConfig,
- Int8DynamicActivationIntxWeightConfig,
- quantize_,
-)
-
-# Quantize embeddings with 8-bits, per channel
-embedding_config = IntxWeightOnlyConfig(
- weight_dtype=torch.int8,
- granularity=PerAxis(0),
-)
-qunatize_(
- eager_model,
- lambda m, fqn: isinstance(m, torch.nn.Embedding),
-)
-
-
-# Quatize linear layers with 8-bit dynamic activations and 4-bit weights
-linear_config = Int8DynamicActivationIntxWeightConfig(
- weight_dtype=torch.int4,
- weight_granularity=PerGroup(32),
-)
-quantize_(eager_model, linear_config)
-```
-
-----
-
-## Runtime Integration
-
-To run the model on-device, use the standard ExecuTorch runtime APIs. See [Running on Device](getting-started.md#running-on-device) for more information.
-
-The XNNPACK delegate is included by default in the published Android, iOS, and pip packages. When building from source, pass `-DEXECUTORCH_BUILD_XNNPACK=ON` when configuring the CMake build to compile the XNNPACK backend.
-
-To link against the backend, add the `xnnpack_backend` CMake target as a build dependency, or link directly against `libxnnpack_backend`. Due to the use of static registration, it may be necessary to link with whole-archive. This can typically be done by passing `"$"` to `target_link_libraries`.
-
-```
-# CMakeLists.txt
-add_subdirectory("executorch")
-...
-target_link_libraries(
- my_target
- PRIVATE executorch
- extension_module_static
- extension_tensor
- optimized_native_cpu_ops_lib
- xnnpack_backend)
-```
-
-No additional steps are necessary to use the backend beyond linking the target. Any XNNPACK-delegated .pte file will automatically run on the registered backend.
diff --git a/docs/source/backends/coreml/coreml-overview.md b/docs/source/backends/coreml/coreml-overview.md
index a08e3ce14ff..bff0cb8994e 100644
--- a/docs/source/backends/coreml/coreml-overview.md
+++ b/docs/source/backends/coreml/coreml-overview.md
@@ -10,12 +10,14 @@ Core ML delegate is the ExecuTorch solution to take advantage of Apple's [Core M
## Target Requirements
Below are the minimum OS requirements on various hardware for running a Core ML-delegated ExecuTorch model:
+
- [macOS](https://developer.apple.com/macos) >= 13.0
- [iOS](https://developer.apple.com/ios/) >= 16.0
- [iPadOS](https://developer.apple.com/ipados/) >= 16.0
- [tvOS](https://developer.apple.com/tvos/) >= 16.0
## Development Requirements
+
To develop you need:
- [macOS](https://developer.apple.com/macos) >= 13.0
@@ -61,7 +63,6 @@ See [Partitioner API](coreml-partitioner.md) for a reference on available partit
The Core ML delegate can also be used as a backend to execute quantized models. See [Core ML Quantization](coreml-quantization.md) for more information on available quantization schemes and APIs.
-
## Backward compatibility
Core ML supports backward compatibility via the [`minimum_deployment_target`](coreml-partitioner.md#coreml-compilespec) option. A model exported with a specific deployment target is guaranteed to work on all deployment targets >= the specified deployment target. For example, a model exported with `coremltools.target.iOS17` will work on iOS 17 or higher.
@@ -91,16 +92,15 @@ target_link_libraries(
No additional steps are necessary to use the backend beyond linking the target. A Core ML-delegated .pte file will automatically run on the registered backend.
-
## Reference
-**→{doc}`coreml-troubleshooting` — Debug common issues.**
+**→{doc}`/backends/coreml/coreml-troubleshooting` — Debug common issues.**
-**→{doc}`coreml-partitioner` — Partitioner options.**
+**→{doc}`/backends/coreml/coreml-partitioner` — Partitioner options.**
-**→{doc}`coreml-quantization` — Supported quantization schemes.**
+**→{doc}`/backends/coreml/coreml-quantization` — Supported quantization schemes.**
-**→{doc}`coreml-op-support` — Supported operators.**
+**→{doc}`/backends/coreml/coreml-op-support` — Supported operators.**
```{toctree}
:maxdepth: 2
diff --git a/docs/source/backends-mps.md b/docs/source/backends/mps/mps-overview.md
similarity index 60%
rename from docs/source/backends-mps.md
rename to docs/source/backends/mps/mps-overview.md
index 184bd88e3a7..a2280defad5 100644
--- a/docs/source/backends-mps.md
+++ b/docs/source/backends/mps/mps-overview.md
@@ -1,55 +1,27 @@
# MPS Backend
-In this tutorial we will walk you through the process of getting setup to build the MPS backend for ExecuTorch and running a simple model on it.
+MPS delegate is the ExecuTorch solution to take advantage of Apple's GPU for on-device ML using the [MPS Graph](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph?language=objc) framework and tuned kernels provided by [MPS](https://developer.apple.com/documentation/metalperformanceshaders?language=objc).
-The MPS backend device maps machine learning computational graphs and primitives on the [MPS Graph](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph?language=objc) framework and tuned kernels provided by [MPS](https://developer.apple.com/documentation/metalperformanceshaders?language=objc).
+## Target Requirements
-::::{grid} 2
-:::{grid-item-card} What you will learn in this tutorial:
-:class-card: card-prerequisites
-* In this tutorial you will learn how to export [MobileNet V3](https://pytorch.org/vision/main/models/mobilenetv3.html) model to the MPS delegate.
-* You will also learn how to compile and deploy the ExecuTorch runtime with the MPS delegate on macOS and iOS.
-:::
-:::{grid-item-card} Tutorials we recommend you complete before this:
-:class-card: card-prerequisites
-* [Introduction to ExecuTorch](intro-how-it-works.md)
-* [Getting Started](getting-started.md)
-* [Building ExecuTorch with CMake](using-executorch-building-from-source.md)
-* [ExecuTorch iOS Demo App](https://github.com/meta-pytorch/executorch-examples/tree/main/mv3/apple/ExecuTorchDemo)
-* [ExecuTorch LLM iOS Demo App](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple)
-:::
-::::
+Below are the minimum OS requirements on various hardware for running a MPS-delegated ExecuTorch model:
+- [macOS](https://developer.apple.com/macos) >= 12.4
+- [iOS](https://www.apple.com/ios) >= 15.4
+## Development Requirements
+To develop you need:
-## Prerequisites (Hardware and Software)
+- [Xcode](https://developer.apple.com/xcode/) >= 14.1
-In order to be able to successfully build and run a model using the MPS backend for ExecuTorch, you'll need the following hardware and software components:
+Before starting, make sure you install the Xcode Command Line Tools:
-### Hardware:
- - A [mac](https://www.apple.com/mac/) for tracing the model
-
-### Software:
-
- - **Ahead of time** tracing:
- - [macOS](https://www.apple.com/macos/) 12
-
- - **Runtime**:
- - [macOS](https://www.apple.com/macos/) >= 12.4
- - [iOS](https://www.apple.com/ios) >= 15.4
- - [Xcode](https://developer.apple.com/xcode/) >= 14.1
-
-## Setting up Developer Environment
-
-***Step 1.*** Complete the steps in [Getting Started](getting-started.md) to set up the ExecuTorch development environment.
-
-You will also need a local clone of the ExecuTorch repository. See [Building ExecuTorch from Source](using-executorch-building-from-source.html) for instructions. All commands in this document should be run from the executorch repository.
-
-## Build
+```bash
+xcode-select --install
+```
-### AOT (Ahead-of-time) Components
+## Using the MPS Backend
-**Compiling model for MPS delegate**:
-- In this step, you will generate a simple ExecuTorch program that lowers MobileNetV3 model to the MPS delegate. You'll then pass this Program (the `.pte` file) during the runtime to run it using the MPS backend.
+In this step, you will generate a simple ExecuTorch program that lowers MobileNetV3 model to the MPS delegate. You'll then pass this Program (the `.pte` file) during the runtime to run it using the MPS backend.
```bash
cd executorch
@@ -121,7 +93,7 @@ python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --generate_
python3 -m devtools.inspector.inspector_cli --etdump_path etdump.etdp --etrecord_path etrecord.bin
```
-## Deploying and Running on Device
+## Runtime integration
***Step 1***. Create the ExecuTorch core and MPS delegate frameworks to link on iOS
```bash
@@ -146,8 +118,3 @@ From the same page, include the needed libraries for the MPS delegate:
- `Metal.framework`
In this tutorial, you have learned how to lower a model to the MPS delegate, build the mps_executor_runner and run a lowered model through the MPS delegate, or directly on device using the MPS delegate static library.
-
-
-## Frequently encountered errors and resolution.
-
-If you encountered any bugs or issues following this tutorial please file a bug/issue on the [ExecuTorch repository](https://github.com/pytorch/executorch/issues), with hashtag **#mps**.
diff --git a/docs/source/backends/samsung/samsung-op-support-table.csv b/docs/source/backends/samsung/samsung-op-support-table.csv
new file mode 100644
index 00000000000..7d925c43400
--- /dev/null
+++ b/docs/source/backends/samsung/samsung-op-support-table.csv
@@ -0,0 +1,45 @@
+Operator,Quantization,Constraints
+add,static int8,
+avg_pool2d,static int8,"ceil_mode=False, divisor_override=pooling_region"
+batch_norm,static int8,
+bmm,static int8,
+cat,static int8,at most 1 constant tensor
+clamp,static int8,
+constant_pad_nd,static int8,padding_value=0.0 only
+conv2d,static int8,constant weights
+dequantize_per_channel,,
+dequantize_per_tensor,,
+div,static int8,
+embedding,static int8,
+expand_copy,,"expanding at most one axis, new dimensions must be size 1"
+gelu,static int8,
+getitem,,
+hardsigmoid,static int8,
+hardswish,static int8,
+hardtanh,static int8,
+layer_norm,static int8,norm at last axis only
+leaky_relu,static int8,
+linear,static int8,constant weights
+log_softmax,static int8,
+max_pool2d,static int8,"ceil_mode=False, indices not supported"
+maximum,,
+mean_dim,static int8,
+minimum,,
+mul,static int8,
+permute,static int8,
+pixel_shuffle,,
+quantize_per_channel,,
+quantize_per_tensor,,
+relu,static int8,
+reshape,static int8,
+rsqrt,static int8,
+select,static int8,
+slice_copy,static int8,
+softmax,static int8,
+sqrt,static int8,
+squeeze,static int8,
+sub,static int8,
+to_copy,,memory_format=contiguous only
+unsqueeze,static int8,
+upsample_bilinear2d,static int8,
+upsample_nearest2d,static int8,
diff --git a/docs/source/backends/samsung/samsung-op-support.rst b/docs/source/backends/samsung/samsung-op-support.rst
new file mode 100644
index 00000000000..ecccd565021
--- /dev/null
+++ b/docs/source/backends/samsung/samsung-op-support.rst
@@ -0,0 +1,11 @@
+================
+Operator Support
+================
+
+This page lists the PyTorch operators currently supported by the Samsung Exynos backend.
+
+.. csv-table:: Operator Support
+ :file: samsung-op-support-table.csv
+ :header-rows: 1
+ :widths: 25 15 55
+ :align: center
diff --git a/docs/source/backends/samsung/samsung-overview.md b/docs/source/backends/samsung/samsung-overview.md
new file mode 100644
index 00000000000..8b0dea0c696
--- /dev/null
+++ b/docs/source/backends/samsung/samsung-overview.md
@@ -0,0 +1,117 @@
+# Samsung Exynos Backend
+
+ExecuTorch's Samsung Exynos backend enables the execution of ExecuTorch models on
+Samsung SoCs via the NPU/DSP. The delegate is built on top of the
+[Samsung Exynos AI Litecore SDK]((https://soc-developer.semiconductor.samsung.com/global/development/ai-litecore)).
+
+## Features
+
+- Wide range of operator support
+- Supported inference precisions:
+ - FP16
+ - 8-bit statically quantized (int8/uint8)
+ - 16-bit statically quantized (int16/uint16)
+
+## Target Requirements
+
+Currently, the Samsung Exynos backend is supported only for devices with the
+following chipsets:
+
+- Exynos 2500 (E9955)
+
+## Development Requirements
+
+The [Samsung Exynos AI Litecore SDK](https://soc-developer.semiconductor.samsung.com/global/development/ai-litecore)
+is required to build the Exynos backend from source, and is also required to
+export models to the Exynos delegate.
+
+----
+
+## Using the Samsung Exynos Backend
+
+To target the Exynos backend during the export and lowering process, pass an instance of
+the `EnnPartitioner` to `to_edge_transform_and_lower`. The example below
+demonstrates this process using the MobileNet V2 model from torchvision.
+
+```python
+import torch
+import torchvision.models as models
+from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
+from executorch.backends.samsung.partition.enn_partitioner import EnnPartitioner
+from executorch.backends.samsung.serialization.compile_options import (
+ gen_samsung_backend_compile_spec,
+)
+from executorch.exir import to_edge_transform_and_lower
+
+mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
+sample_inputs = (torch.randn(1, 3, 224, 224), )
+
+chipset = "E9955"
+compile_specs = [gen_samsung_backend_compile_spec(chipset)]
+
+et_program = to_edge_transform_and_lower(
+ torch.export.export(mobilenet_v2, sample_inputs),
+ partitioner=[EnnPartitioner(compile_specs)],
+).to_executorch()
+
+with open("mv2_xnnpack.pte", "wb") as file:
+ et_program.write_to_file(file)
+```
+
+See [Partitioner API](samsung-partitioner.md) for a reference on available partitioner options.
+
+----
+
+## Quantization
+
+The Samsung Exynos backend support statically quantized models with 8-bit and 16-bit
+integral types.
+
+See [Samsung Exynos Quantization](samsung-quantization.md) for more
+information on available quantization schemes and APIs.
+
+----
+
+## Runtime Integration
+
+To run the model on-device, use the standard ExecuTorch runtime APIs.
+
+The Exynos backend is currently not available in any of ExecuTorch's published packages.
+To access it, build ExecuTorch from source. When building from source, pass
+`-DEXECUTORCH_BUILD_EXYNOS=ON` when configuring the CMake build. See [Running on Device](/getting-started.md#running-on-device)
+for more information.
+
+Then, to link against the backend, add the `executorch_backends` CMake target as a build
+dependency.
+
+```
+# CMakeLists.txt
+add_subdirectory("executorch")
+...
+target_link_libraries(
+ my_target
+ PRIVATE executorch
+ executorch_backends
+ ...
+)
+```
+
+No additional steps are necessary to use the backend beyond linking the target. Any
+Exynos delegated .pte file will automatically run on the registered backend.
+
+## Reference
+
+**→{doc}`samsung-partitioner` — Partitioner options.**
+
+**→{doc}`samsung-quantization` — Supported quantization schemes.**
+
+**→{doc}`samsung-op-support` — Supported operators.**
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+:caption: Exynos Backend
+
+samsung-partitioner
+samsung-quantization
+samsung-op-support
diff --git a/docs/source/backends/samsung/samsung-partitioner.md b/docs/source/backends/samsung/samsung-partitioner.md
new file mode 100644
index 00000000000..eb84a795551
--- /dev/null
+++ b/docs/source/backends/samsung/samsung-partitioner.md
@@ -0,0 +1,29 @@
+# Partitioner API
+
+The `EnnPartitioner` API is the primary entrypoint when exporting a model to the Samsung
+Exynos backend. The partitioner is responsible for determining which parts of the model
+should be lowered to the backend and also provides an interface for configuring the
+behaviour of the backend.
+
+Currently, the configuration options for `EnnPartitioner` can be generated automatically
+using the `gen_samsung_backend_compile_spec` API. For instance,
+
+```python
+from executorch.backends.samsung.partition.enn_partitioner import EnnPartitioner
+from executorch.backends.samsung.serialization.compile_options import (
+ gen_samsung_backend_compile_spec,
+)
+
+from executorch.exir import to_edge_transform_and_lower
+
+chipset = "E9955"
+compile_specs = [gen_samsung_backend_compile_spec(chipset)]
+
+et_program = to_edge_transform_and_lower(
+ exported_program,
+ partitioner=[EnnPartitioner(compile_specs)],
+).to_executorch()
+```
+
+At the moment, only `"E9955"` is supported as a valid chipset name, which corresponds to
+the Exynose 2500 SoC. Support for additional chipsets will be added in the future.
diff --git a/docs/source/backends/samsung/samsung-quantization.md b/docs/source/backends/samsung/samsung-quantization.md
new file mode 100644
index 00000000000..ad4b50cb93d
--- /dev/null
+++ b/docs/source/backends/samsung/samsung-quantization.md
@@ -0,0 +1,60 @@
+# Quantization
+
+The Exynos backend currently supports executing statically quantized 8-bit models.
+
+### 8-bit quantization with the PT2E quantization flow
+
+To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model:
+
+1) Create an instance of the `EnnQuantizer` class and set the desired quantization behaviour.
+2) Use `torch.export.export` to obtain a graph module representation of the source model.
+3) Use `prepare_pt2e` to prepare the model for quantization.
+4) Execute the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
+5) Use `convert_pt2e` to quantize the model.
+6) Export and lower the model using the standard export flow.
+
+The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using
+the same export flow as non-quantized models. As it is a regular PyTorch model, it can
+also be used to evaluate the accuracy of the quantized model using standard PyTorch
+techniques.
+
+The below example shows how to quantize a MobileNetV2 model using the PT2E quantization flow.
+
+```python
+import torch
+import torchvision.models as models
+from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
+
+from executorch.backends.samsung.partition.enn_partitioner import EnnPartitioner
+from executorch.backends.samsung.quantizer.quantizer import EnnQuantizer, Precision
+
+from executorch.exir import to_edge_transform_and_lower
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
+
+model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
+sample_inputs = (torch.randn(1, 3, 224, 224), )
+
+# Currently, "A8W8" is the only supported precision mode
+precision = "A8W8"
+is_per_channel = True
+is_qat = False
+
+quantizer = EnnQuantizer()
+quantizer.set_quant_params(precision, is_per_channel, is_qat) # (1)
+
+training_ep = torch.export.export(model, sample_inputs).module() # (2)
+prepared_model = prepare_pt2e(training_ep, quantizer) # (3)
+
+for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs
+ prepared_model(cal_sample) # (4) Calibrate
+
+quantized_model = convert_pt2e(prepared_model) # (5)
+
+et_program = to_edge_transform_and_lower( # (6)
+ torch.export.export(quantized_model, sample_inputs),
+ partitioner=[EnnPartitioner()],
+).to_executorch()
+```
+
+See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html)
+for more information.
diff --git a/docs/source/backends/template/README.md b/docs/source/backends/template/README.md
new file mode 100644
index 00000000000..e7cb037bd6c
--- /dev/null
+++ b/docs/source/backends/template/README.md
@@ -0,0 +1,53 @@
+# Backend Documentation Template
+
+This template provides a standardized structure and starting point for backend documentation. It is intended to provide a uniform experience for users while allowing for backends to customize their documentation as needed.
+
+## Template Structure
+
+The template includes the following files:
+
+### Required Pages
+
+- `backend-overview.md` - Main backend overview and introduction
+
+### Recommended Pages
+
+- `backend-quantization.md` - Quantization support and API documentation
+- `backend-partitioner.md` - Partitioner API reference
+- `op-support.csv` - Operator support data in CSV format
+
+### Optional Pages (and Subsections)
+
+- `backend-troubleshooting.md` - Common issues and troubleshooting guide
+- `backend-op-support.rst` - Operator support documentation (RST format)
+- `backend-arch-internals.md` - Architecture and internals documentation
+- `tutorials/backend-tutorials.md` - Tutorial sub-section
+ - Use this sub-section to provide tutorials for your backend.
+ - Tutorials should explain how a user can accomplish a task, in a step by step manner.
+ - Some examples might include:
+ - An end to end example of lowering and running a model on a specific platform.
+- `tutorials/backend-guides.md` - Guides sub-section
+ - Use this sub-section to provide guides or how-tos for backend-specific functionality.
+ - Guides should focus on providing information and building conceptual understanding, rather than giving step by step directions.
+ - Some examples might include:
+ - LLM attention management / static attention
+ - Performance optimization guide
+
+## Using the Template
+
+To use this template for a new backend:
+
+1. Copy the entire `template` directory contents to your backend's documentation directory
+2. Rename files to match your backend name (e.g., `backend-overview.md` → `mybackend-overview.md`)
+3. Populate the content for your backend.
+
+### Additional Customization
+
+You may need to:
+- Add backend-specific sections to any file
+- Remove sections that don't apply to your backend
+- Update the operator support CSV with your backend's supported operators
+- Add backend-specific images or diagrams
+- Update cross-references and links
+
+Try to keep the landing page (`backend-overview.md`) simple and straigtforward. Use the child pages and sections to provide more detailed information.
diff --git a/docs/source/backends/template/backend-arch-internals.md b/docs/source/backends/template/backend-arch-internals.md
new file mode 100644
index 00000000000..66c4a27eb4e
--- /dev/null
+++ b/docs/source/backends/template/backend-arch-internals.md
@@ -0,0 +1,8 @@
+# {BACKEND_NAME} Architecture and Internals
+
+This page covers internal implementation details of the backend, and is mainly aimed at contributors and heavy power users. This is an optional page for each backend and has no set structure.
+
+Some topics to consider:
+ * High-level design of the backend
+ * Details on the lowering flow
+ * Internal debugging tools and techniques
diff --git a/docs/source/backend-template.md b/docs/source/backends/template/backend-overview.md
similarity index 62%
rename from docs/source/backend-template.md
rename to docs/source/backends/template/backend-overview.md
index bf992c1ffab..666b70e1584 100644
--- a/docs/source/backend-template.md
+++ b/docs/source/backends/template/backend-overview.md
@@ -4,7 +4,7 @@ Provide a brief overview/description of the backend. At a high-level, what does
## Features
-List high-level features of backend, such as general operator and hardware support.
+List high-level features of backend, such as operator and hardware support.
## Target Requirements
@@ -18,27 +18,37 @@ What software and hardware is needed to create a .PTE file targeting this backen
This section describes the steps users need to take in order to generate a .PTE targeting this backend. Include a full code sample for exporting and lowering a model to this backend. Make sure relevant imports for the backend partitioner are included.
-### Partitioner API
+## Runtime Integration
-What options, if any, does the partitioner take? Are there any other export-time configurations that can be applied? Document each option.
+This section is intended to tell the user all of the steps they'll need to take to be able to run a .PTE file on-device that is targeting the given backend.
+- What CMake targets should they link to?
+- How is this backend compiled from source?
+- Is the backend bundled by default in iOS and/or Android pre-built libraries?
-### Quantization
+## Reference
-What quantization schemes does this backend support? Consider including the following, as appropriate.
-- What operators are supported?
-- Number of bits?
-- Static vs dynamic activations?
-- Weight only vs activations + weights?
-- Symmetric vs asymmetric weights?
-- Per-tensor, per-chanel, group/blockwise?
+**→{doc}`backend-partitioner` — Partitioner options.**
-If using a PT2E quantizer, document how to initialize the quantizer and all relevant configs and options.
+**→{doc}`backend-quantization` — Supported quantization schemes.**
-Include a code snippet demonstrating how to perform quantization for this backend. Document, or link to, a description of the parameters that the user can specify.
+**→{doc}`backend-troubleshooting` — Debug common issues.**
-## Runtime Integration
+**→{doc}`backend-arch-internals` — Backend internals.**
-This section is intended to tell the user all of the steps they'll need to take to be able to run a .PTE file on-device that is targeting the given backend.
-- What CMake targets should they link to?
-- How is this backend compiled from source?
-- Is the backend bundled by default in iOS and/or Android pre-built libraries?
+**→{doc}`tutorials/backend-tutorials` — Tutorials.**
+
+**→{doc}`guides/backend-guides` — Tutorials.**
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+:caption: {BACKEND} Backend
+
+backend-troubleshooting
+backend-partitioner
+backend-quantization
+backend-op-support
+backend-arch-internals
+tutorials/backend-tutorials
+guides/backend-guides
+```
diff --git a/docs/source/backends/template/backend-partitioner.rst b/docs/source/backends/template/backend-partitioner.rst
new file mode 100644
index 00000000000..981e5744aed
--- /dev/null
+++ b/docs/source/backends/template/backend-partitioner.rst
@@ -0,0 +1,25 @@
+==========================
+{BACKEND_NAME} Partitioner API
+==========================
+
+Document the partitioner API for the backend, including configuration options and compile specs.
+
+- ``option1``: Description of the option and values.
+- ``option2``: Description of the second option.
+- ``option3``: Description of the third option.
+
+{ADDITIONAL_PARTITIONER_DETAILS}
+
+================
+Operator Support
+================
+
+This page lists the operators supported by the {BACKEND_NAME} backend. Operators are the building blocks of the ML model. See `IRs `_ for more information on the PyTorch operator set.
+
+{OPERATOR_SUPPORT_NOTES}
+
+.. csv-table:: Operator Support
+ :file: op-support.csv
+ :header-rows: 1
+ :widths: 20 15 30 30
+ :align: center
diff --git a/docs/source/backends/template/backend-quantization.md b/docs/source/backends/template/backend-quantization.md
new file mode 100644
index 00000000000..4997a56e248
--- /dev/null
+++ b/docs/source/backends/template/backend-quantization.md
@@ -0,0 +1,31 @@
+# {BACKEND_NAME} Quantization
+
+Document quantization schemes and flows for the backend. This should include a description of each scheme and a code example to perform quantization. Example sections for PT2E and quantize_ are included below, to be replaced with details for the target backend.
+
+For each supported quantization scheme, include the following:
+ * What is the quantization scheme?
+ * How are weights quantized?
+ * How are activations quantized? Static or dynamic?
+ * How many bits?
+ * What is the granularity? Per-tensor, per-channel, group/block-wise?
+ * What are the steps to quantize a model with this scheme?
+ * Include a code sample.
+ * If the quantization flow only supports a small set of operators - for example, linear only - note this.
+
+### Supported Quantization Schemes
+The {BACKEND_NAME} delegate supports the following quantization schemes:
+
+- {QUANTIZATION_SCHEME_1}
+- {QUANTIZATION_SCHEME_2}
+
+### {QUANTIZATION_METHOD_1} using the PT2E Flow
+
+[Description]
+
+[Code Sample]
+
+### LLM Quantization with quantize_
+
+[Description]
+
+[Code Sample]
diff --git a/docs/source/backends/template/backend-troubleshooting.md b/docs/source/backends/template/backend-troubleshooting.md
new file mode 100644
index 00000000000..851c04f34ea
--- /dev/null
+++ b/docs/source/backends/template/backend-troubleshooting.md
@@ -0,0 +1,15 @@
+# {BACKEND_NAME} Troubleshooting
+
+This page describes common issues that you may encounter when using the {BACKEND_NAME} backend and how to debug and resolve them.
+
+## {COMMON_ISSUE_1}
+
+{ISSUE_DESCRIPTION_1}
+
+{SOLUTION_STEPS_1}
+
+## {COMMON_ISSUE_2}
+
+{ISSUE_DESCRIPTION_2}
+
+{SOLUTION_STEPS_2}
diff --git a/docs/source/backends/template/guides/backend-basic-guide.md b/docs/source/backends/template/guides/backend-basic-guide.md
new file mode 100644
index 00000000000..44f86d8bd4d
--- /dev/null
+++ b/docs/source/backends/template/guides/backend-basic-guide.md
@@ -0,0 +1,3 @@
+# Using {FEATURE} on {BACKEND_NAME}
+
+This is a placeholder guide.
diff --git a/docs/source/backends/template/guides/backend-guides.md b/docs/source/backends/template/guides/backend-guides.md
new file mode 100644
index 00000000000..dbeaf25742a
--- /dev/null
+++ b/docs/source/backends/template/guides/backend-guides.md
@@ -0,0 +1,10 @@
+# {BACKEND_NAME} Guides
+
+**→{doc}`{backend_name}-basic-guide` — Guide description.**
+
+```{toctree}
+:hidden:
+:maxdepth: 1
+
+{backend_name}-basic-guides
+```
diff --git a/docs/source/backends/template/op-support.csv b/docs/source/backends/template/op-support.csv
new file mode 100644
index 00000000000..66af56d6a44
--- /dev/null
+++ b/docs/source/backends/template/op-support.csv
@@ -0,0 +1,6 @@
+Operator,Compute DType,Quantization,Constraints
+{OPERATOR_1},{DTYPE_SUPPORT_1},{QUANTIZATION_SUPPORT_1},{CONSTRAINTS_1}
+{OPERATOR_2},{DTYPE_SUPPORT_2},{QUANTIZATION_SUPPORT_2},{CONSTRAINTS_2}
+{OPERATOR_3},{DTYPE_SUPPORT_3},{QUANTIZATION_SUPPORT_3},{CONSTRAINTS_3}
+{OPERATOR_4},{DTYPE_SUPPORT_4},{QUANTIZATION_SUPPORT_4},{CONSTRAINTS_4}
+{OPERATOR_5},{DTYPE_SUPPORT_5},{QUANTIZATION_SUPPORT_5},{CONSTRAINTS_5}
diff --git a/docs/source/backends/template/tutorials/backend-basic-tutorial.md b/docs/source/backends/template/tutorials/backend-basic-tutorial.md
new file mode 100644
index 00000000000..23d76857116
--- /dev/null
+++ b/docs/source/backends/template/tutorials/backend-basic-tutorial.md
@@ -0,0 +1,91 @@
+# Preparing a Model for {BACKEND_NAME}
+
+This is a placeholder tutorial.
+
+## Step 1: Environment Setup
+
+This tutorial is intended to be run from a {SUPPORTED_HOST_OS} and uses Conda for Python environment management. For full setup details and system requirements, see [Getting Started with ExecuTorch](/getting-started).
+
+Create a Conda environment and install the ExecuTorch Python package.
+```bash
+conda create -y --name executorch python=3.12
+conda activate executorch
+conda install executorch
+```
+
+{ADDITIONAL_SETUP_STEPS}
+
+## Step 2: Model Preparation
+
+Create a python file named `export_{model_filename}.py`. This script will be responsible for loading the {EXAMPLE_MODEL} model from {MODEL_SOURCE} and create a {BACKEND_NAME}-targeted .pte file.
+
+```py
+# export_{model_filename}.py
+from executorch.backends.{backend_name}.partition.{backend_name}_partitioner import {BackendName}Partitioner
+from executorch.exir import to_edge_transform_and_lower
+import torch
+import {MODEL_IMPORT}
+```
+
+### Model Instantiation and Example Inputs
+
+Instantiate the {EXAMPLE_MODEL} model from [{MODEL_SOURCE}]({MODEL_SOURCE_URL}). The export process also needs an example model input to trace the model. The model takes {MODEL_INPUT_DESCRIPTION}, so we'll create {INPUT_TUPLE_DESCRIPTION}.
+```py
+model = {MODEL_INSTANTIATION_CODE}
+example_inputs = ({EXAMPLE_INPUTS},)
+```
+
+### Lower the Model
+
+Next, export and lower the model to ExecuTorch. Note that the `{BackendName}Partitioner` passed to the `partitioner` parameter tells ExecuTorch to target the {BACKEND_NAME} backend.
+```py
+exported_program = torch.export.export(model, example_inputs)
+
+executorch_program = to_edge_transform_and_lower(
+ exported_program,
+ partitioner=[{BackendName}Partitioner()],
+).to_executorch()
+
+executorch_program.save("{model_filename}_{backend_name}.pte")
+```
+
+### Run the Script
+
+Save the above script to export_{model_filename}.py and run the script. You should see a file named `{model_filename}_{backend_name}.pte` in the current directory.
+```bash
+python export_{model_filename}.py
+```
+
+## Step 3: Running the Model
+
+The .pte file created in the previous step can be run on a variety of devices, including {SUPPORTED_PLATFORMS}. ExecuTorch provides runtime APIs and language bindings for a variety of platforms. This tutorial will demonstrate running the model on a desktop using the Python runtime.
+
+### Smoke Test
+
+First, we'll verify that the model loads and runs correctly by running the model with {TEST_INPUT_DESCRIPTION}. Create a new script, named `run_{model_filename}.py`, and add the following code.
+```py
+# run_{model_filename}.py
+
+from executorch.runtime import Runtime
+import torch
+
+runtime = Runtime.get()
+
+input_tensor = {TEST_INPUT_TENSOR}
+program = runtime.load_program("{model_filename}_{backend_name}.pte")
+method = program.load_method("forward")
+outputs = method.execute([input_tensor])[0]
+
+print(outputs)
+```
+
+When running the script with `python run_{model_filename}.py`, you should see {EXPECTED_OUTPUT_DESCRIPTION} printed to the console.
+```
+{EXPECTED_OUTPUT_EXAMPLE}
+```
+
+# Next Steps
+
+ - See [Edge Platforms](/edge-platforms-section) to deploy the .pte file on {SUPPORTED_PLATFORMS}.
+ - See [Model Export and Lowering](/using-executorch-export) for more information on model preparation.
+ - See [{BACKEND_NAME} Overview](/backends/{backend_name}/{backend_name}-overview) for more information about the {BACKEND_NAME} backend.
diff --git a/docs/source/backends/template/tutorials/backend-tutorials.md b/docs/source/backends/template/tutorials/backend-tutorials.md
new file mode 100644
index 00000000000..15e226dd5c5
--- /dev/null
+++ b/docs/source/backends/template/tutorials/backend-tutorials.md
@@ -0,0 +1,10 @@
+# {BACKEND_NAME} Tutorials
+
+**→{doc}`{backend_name}-basic-tutorial` — Lower and run a model on the {BACKEND_NAME} backend.**
+
+```{toctree}
+:hidden:
+:maxdepth: 1
+
+{backend_name}-basic-tutorial
+```
diff --git a/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md b/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md
new file mode 100644
index 00000000000..cb14c72331e
--- /dev/null
+++ b/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md
@@ -0,0 +1,159 @@
+# Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device
+
+This tutorial assumes that you have a working local copy of the ExecuTorch repo,
+and have gone through the steps to install the executorch pip package or have
+installed it by building from source.
+
+This tutorial also assumes that you have the Android SDK tools installed and
+that you are able to connect to an Android device via `adb`.
+
+Finally, the Android NDK should also be installed, and your environment should
+have a variable `ANDROID_NDK` that points to the root directory of the NDK.
+
+```shell
+export ANDROID_NDK=
+```
+
+## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer
+
+The model checkpoint and tokenizer can be downloaded from the
+[Meta Llama website](https://www.llama.com/llama-downloads/).
+
+The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`.
+
+## Export the Llama 3.2 1B/3B model
+
+First, navigate to the root of the ExecuTorch repo.
+
+```shell
+# Navigate to executorch root
+cd ~/executorch
+```
+
+Then, set some environment variables to describe how the model should be
+exported. Feel free to tune the values to your preferences.
+
+```shell
+export LLM_NAME=Llama3.2 && \
+export LLM_SIZE=1B && \
+export LLM_SUFFIX="-Instruct" && \
+export QUANT=8da4w && \
+export BACKEND=vulkan && \
+export GROUP_SIZE=64 && \
+export CONTEXT_LENGTH=2048
+```
+
+Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that
+that `--vulkan-force-fp16` flag is set, which will improve model inference
+latency at the cost of model accuracy. Feel free to remove this flag.
+
+```shell
+python -m examples.models.llama.export_llama \
+ -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \
+ -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \
+ -d fp32 --${BACKEND} \
+ -qmode ${QUANT} -G ${GROUP_SIZE} \
+ --max_seq_length ${CONTEXT_LENGTH} \
+ --max_context_length ${CONTEXT_LENGTH} \
+ -kv --use_sdpa_with_kv_cache \
+ --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
+ --model "llama3_2" \
+ --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
+
+```
+
+After exporting the model, push the exported `.pte` file and the tokenizer to
+your device.
+
+```shell
+adb shell mkdir -p /data/local/tmp/llama && \
+adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \
+ /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model && \
+adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
+ /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
+```
+
+## Build Core Executorch Components
+
+To be able to run the `.pte` file on device, first the core libraries,
+including the Vulkan backend, must be compiled for Android.
+
+```shell
+cmake . \
+ -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+ -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+ -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+ --preset "android-arm64-v8a" \
+ -DANDROID_PLATFORM=android-28 \
+ -DPYTHON_EXECUTABLE=python \
+ -DCMAKE_BUILD_TYPE=Release \
+ -DEXECUTORCH_PAL_DEFAULT=posix \
+ -DEXECUTORCH_BUILD_LLAMA_JNI=ON \
+ -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
+ -DEXECUTORCH_BUILD_VULKAN=ON \
+ -DEXECUTORCH_BUILD_TESTS=OFF \
+ -Bcmake-out-android-so && \
+cmake --build cmake-out-android-so -j16 --target install --config Release
+```
+
+## Build and push the llama runner binary to Android
+
+Then, build a binary that can be used to run the `.pte` file.
+
+```shell
+cmake examples/models/llama \
+ -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+ -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+ -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+ -DEXECUTORCH_ENABLE_LOGGING=ON \
+ -DANDROID_ABI=arm64-v8a \
+ -DANDROID_PLATFORM=android-28 \
+ -DCMAKE_BUILD_TYPE=Release \
+ -DPYTHON_EXECUTABLE=python \
+ -Bcmake-out-android-so/examples/models/llama && \
+cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release
+```
+
+Once the binary is built, it can be pushed to your Android device.
+
+```shell
+adb shell mkdir /data/local/tmp/etvk/ && \
+adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/
+```
+
+## Execute the llama runner binary
+
+Finally, we can execute the lowered `.pte` file on your device.
+
+```shell
+adb shell /data/local/tmp/etvk/llama_main \
+ --model_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
+ --tokenizer_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model \
+ --temperature=0 --seq_len=400 --warmup \
+ --prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\"
+```
+
+Here is some sample output captured from a Galaxy S24:
+
+```shell
+E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I'
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+
+Here is a short poem I came up with:
+
+"Moonlight whispers secrets to the night
+A gentle breeze that rustles the light
+The stars up high, a twinkling show
+A peaceful world, where dreams grow slow"
+
+I hope you enjoy it!<|eot_id|>
+
+PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
+ Prompt Tokens: 14 Generated Tokens: 54
+ Model Load Time: 2.277000 (seconds)
+ Total inference time: 1.189000 (seconds) Rate: 45.416316 (tokens/second)
+ Prompt evaluation: 0.164000 (seconds) Rate: 85.365854 (tokens/second)
+ Generated 54 tokens: 1.025000 (seconds) Rate: 52.682927 (tokens/second)
+ Time to first generated token: 0.164000 (seconds)
+ Sampling time over 68 tokens: 0.019000 (seconds)
+```
diff --git a/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md b/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md
new file mode 100644
index 00000000000..07982d81c1c
--- /dev/null
+++ b/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md
@@ -0,0 +1,144 @@
+# Executing and profiling an ExecuTorch Vulkan model on device
+
+This tutorial assumes that you have a working local copy of the ExecuTorch repo,
+and have gone through the steps to install the executorch pip package or have
+installed it by building from source.
+
+This tutorial also assumes that you have the Android SDK tools installed and
+that you are able to connect to an Android device via `adb`.
+
+Finally, the Android NDK should also be installed, and your environment should
+have a variable `ANDROID_NDK` that points to the root directory of the NDK.
+
+```shell
+export ANDROID_NDK=
+```
+
+## Lower a model to ExecuTorch Vulkan and obtain the `.pte` file
+
+
+The commands in this tutorial are assumed to be executed from ExecuTorch's root
+directory.
+
+```shell
+cd ~/executorch
+```
+
+For this tutorial, we will use the export script in
+[`executorch/examples/vulkan/export.py`](https://github.com/pytorch/executorch/tree/main/examples/vulkan),
+however any method of generating a `.pte` file will suffice. In this tutorial,
+the InceptionV3 model is exported.
+
+```shell
+python -m examples.vulkan.export --model_name=ic3 -o . -fp16
+```
+
+After exporting, there should be a file called `ic3_vulkan.pte` in the root
+directory of ExecuTorch. Feel free to modify the `-o` argument of the script to
+control where the `.pte` file will be stored.
+
+Then, push the `.pte` file to device.
+
+```shell
+adb shell mkdir -p /data/local/tmp/etvk/models/ && \
+adb push ic3_vulkan.pte /data/local/tmp/etvk/models/ic3_vulkan.pte
+```
+
+## Build the `executor_runner` binary and push to device
+
+To be able to run the `.pte` file on device, first the core libraries,
+including the Vulkan backend, must be compiled for Android. Note that
+`-DEXECUTORCH_ENABLE_EVENT_TRACER=ON` is used to turn on profiling, and
+`-DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON` is used to build the runner binary that
+will be used to execute and profile the `.pte` file.
+
+
+```shell
+cmake . \
+ -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+ -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+ -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+ --preset "android-arm64-v8a" \
+ -DANDROID_PLATFORM=android-28 \
+ -DPYTHON_EXECUTABLE=python \
+ -DCMAKE_BUILD_TYPE=Release \
+ -DEXECUTORCH_PAL_DEFAULT=posix \
+ -DEXECUTORCH_BUILD_LLAMA_JNI=ON \
+ -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
+ -DEXECUTORCH_BUILD_VULKAN=ON \
+ -DEXECUTORCH_BUILD_TESTS=OFF \
+ -DEXECUTORCH_BUILD_EXTENSION_EVALUE_UTIL=ON \
+ -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \
+ -DEXECUTORCH_ENABLE_EVENT_TRACER=ON \
+ -Bcmake-out-android-so && \
+cmake --build cmake-out-android-so -j16 --target install --config Release
+```
+
+Once the build completes, we can push the runner binary to device.
+
+```shell
+adb push cmake-out-android-so/executor_runner /data/local/tmp/etvk/executor_runner
+```
+
+## Execute the `.pte` file
+
+Finally, we can execute the lowered `.pte` file on your device. To test run the
+model file without profiling:
+
+```shell
+adb shell /data/local/tmp/etvk/executor_runner \
+ --model_path /data/local/tmp/etvk/models/ic3_vulkan.pte
+```
+
+Now, with profiling:
+
+```shell
+MODEL_NAME=ic3 && \
+BACKEND=vulkan && \
+NUM_ITERS=3 && \
+adb shell mkdir -p /data/local/tmp/etvk/etdumps/ && \
+adb shell /data/local/tmp/etvk/executor_runner \
+ --model_path /data/local/tmp/etvk/models/${MODEL_NAME}_${BACKEND}.pte \
+ --num_executions=${NUM_ITERS} \
+ --etdump_path /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp && \
+adb pull /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp ${MODEL_NAME}_${BACKEND}.etdp && \
+adb shell rm /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp && \
+python devtools/inspector/inspector_cli.py \
+ --etdump_path ${MODEL_NAME}_${BACKEND}.etdp
+```
+
+Here is some sample (tailed) output from a Samsung Galaxy S24:
+
+```shell
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 165 │ Execute │ conv2d_clamp_half_163 │ 0.345082 │ 0.346164 │ 0.346247 │ 0.345748 │ 0.344812 │ 0.346268 │ [] │ True │ │ [2081488974948084, 2081488995911052, 2081489016763676] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 166 │ Execute │ conv2d_clamp_half_164 │ 0.306124 │ 0.30654 │ 0.306998 │ 0.306557 │ 0.30602 │ 0.307112 │ [] │ True │ │ [2081488975294716, 2081488996256228, 2081489017110204] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 167 │ Execute │ set_zero_int32_165 │ 0.00240245 │ 0.00244403 │ 0.00248561 │ 0.00244403 │ 0.00239205 │ 0.002496 │ [] │ True │ │ [2081488975601100, 2081488996563132, 2081489017417680] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 168 │ Execute │ concat_2_texture3d_half_166 │ 0.0122305 │ 0.01248 │ 0.0125634 │ 0.0124108 │ 0.0121682 │ 0.0125842 │ [] │ True │ │ [2081488975603960, 2081488996565940, 2081489017420436] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 169 │ Execute │ set_zero_int32_167 │ 0.00157056 │ 0.00161195 │ 0.00161214 │ 0.00159478 │ 0.00156021 │ 0.00161219 │ [] │ True │ │ [2081488975616804, 2081488996578888, 2081489017432968] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 170 │ Execute │ concat_3_texture3d_half_168 │ 0.0420369 │ 0.0423281 │ 0.0427857 │ 0.0423974 │ 0.0419641 │ 0.0429001 │ [] │ True │ │ [2081488975618728, 2081488996580864, 2081489017434944] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 171 │ Execute │ update_concat_offset_3_int32_169 │ 0.00261035 │ 0.00265193 │ 0.00265212 │ 0.00263468 │ 0.00259995 │ 0.00265217 │ [] │ True │ │ [2081488975661992, 2081488996623556, 2081489017477272] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 172 │ Execute │ concat_1_texture3d_half_170 │ 0.00758157 │ 0.00774789 │ 0.00803914 │ 0.00779994 │ 0.00753999 │ 0.00811195 │ [] │ True │ │ [2081488975664956, 2081488996626572, 2081489017480288] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 173 │ Execute │ mean2d_half_171 │ 0.0147889 │ 0.0148721 │ 0.0150384 │ 0.0149067 │ 0.0147681 │ 0.01508 │ [] │ True │ │ [2081488975673432, 2081488996634476, 2081489017488400] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 174 │ Execute │ view_half_172 │ 0.00644803 │ 0.00644803 │ 0.00653119 │ 0.00648268 │ 0.00644803 │ 0.00655198 │ [] │ True │ │ [2081488975688876, 2081488996649712, 2081489017503532] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 175 │ Execute │ view_half_173 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ [] │ True │ │ [2081488975695688, 2081488996656524, 2081489017510448] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 176 │ Execute │ linear_naive_texture3d_half_174 │ 0.586726 │ 0.590096 │ 0.595338 │ 0.590876 │ 0.585884 │ 0.596648 │ [] │ True │ │ [2081488975700940, 2081488996661776, 2081489017515700] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 177 │ Execute │ image_to_nchw_texture3d_half_float_175 │ 0.00270395 │ 0.00270414 │ 0.00274572 │ 0.00272139 │ 0.00270391 │ 0.00275612 │ [] │ True │ │ [2081488976297952, 2081488997248024, 2081489018106160] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 178 │ Execute │ DELEGATE_CALL │ 20.8864 │ 20.9461 │ 21.5925 │ 21.1906 │ 20.8715 │ 21.7541 │ [] │ False │ │ [358395625, 380178646, 401147657] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 179 │ Execute │ Method::execute │ 20.8867 │ 20.9464 │ 21.593 │ 21.191 │ 20.8718 │ 21.7547 │ [] │ False │ │ [358395521, 380178542, 401147552] │
+╘═════╧════════════════════╧════════════════════════════════════════╧══════════════╧══════════════╧══════════════╧══════════════╧══════════════╧══════════════╧════════════╧═══════════════════╧═════════════════════════╧════════════════════════════════════════════════════════╛
+```
diff --git a/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md b/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md
new file mode 100644
index 00000000000..953c93a9c12
--- /dev/null
+++ b/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md
@@ -0,0 +1,13 @@
+# Vulkan Backend Tutorials
+
+**→{doc}`etvk-profiling-tutorial`**
+
+**→{doc}`etvk-llama-tutorial`**
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+:caption: Tutorials
+
+etvk-profiling-tutorial
+etvk-llama-tutorial
diff --git a/docs/source/backends/vulkan/vulkan-op-support-table.csv b/docs/source/backends/vulkan/vulkan-op-support-table.csv
new file mode 100644
index 00000000000..34d2ece924a
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-op-support-table.csv
@@ -0,0 +1,113 @@
+Namespace,Operator,Notes
+aten,_log_softmax,
+aten,_native_batch_norm_legit_no_training,
+aten,_softmax,
+aten,_to_copy,dtype conversion between float types only
+aten,_weight_int8pack_mm,
+aten,abs,
+aten,add,
+aten,addmm,
+aten,amax,keepdim=True required; max 2D reductions
+aten,amin,keepdim=True required; max 2D reductions
+aten,arange,
+aten,avg_pool2d,
+aten,bmm,
+aten,cat,
+aten,clamp,
+aten,clone,
+aten,constant_pad_nd,
+aten,convolution,batch=1 for 2D conv; no transposed 1D conv; no 3D conv
+aten,cos,
+aten,div,
+aten,div.Tensor_mode,
+aten,embedding,
+aten,eq,
+aten,exp,
+aten,expand_copy,no resize support
+aten,flip,
+aten,full,
+aten,full_like,
+aten,ge,
+aten,gelu,
+aten,gt,
+aten,hardshrink,
+aten,hardtanh,
+aten,index_select,
+aten,le,
+aten,leaky_relu,
+aten,linear,
+aten,lt,
+aten,max_pool2d,
+aten,max_pool2d_with_indices,
+aten,mean,keepdim=True required; max 2D reductions
+aten,minimum,
+aten,mm,
+aten,native_group_norm,
+aten,native_layer_norm,resize supported
+aten,neg,
+aten,ones,
+aten,ones_like,
+aten,permute,
+aten,permute_copy,
+aten,pow,
+aten,relu,
+aten,repeat,
+aten,round,
+aten,rsqrt,
+aten,scalar_tensor,
+aten,select_copy,
+aten,sigmoid,
+aten,sin,
+aten,slice_copy,
+aten,split,
+aten,split_with_sizes_copy,
+aten,sqrt,
+aten,squeeze_copy,
+aten,sub,
+aten,sum,keepdim=True required; max 2D reductions
+aten,t_copy,
+aten,tanh,
+aten,unsqueeze_copy,
+aten,upsample_bilinear2d,
+aten,upsample_nearest2d,
+aten,view_copy,
+aten,zeros,
+aten,zeros_like,
+aten,_assert_scalar,removed via graph pass
+aten,sym_constrain_range_for_size,removed via graph pass
+aten,sym_size,
+dim_order_ops,_clone_dim_order,no dtype conversion; removable if no dtype change
+dim_order_ops,_to_dim_order_copy,no dtype conversion; removable if no dtype change
+llama,custom_sdpa,
+llama,sdpa_with_kv_cache,
+llama,update_cache,
+operator,add,
+operator,eq,
+operator,ge,
+operator,getitem,
+operator,gt,
+operator,le,
+operator,lt,
+quantized_decomposed,choose_qparams,
+quantized_decomposed,choose_qparams_per_token_asymmetric,
+quantized_decomposed,dequantize_per_channel,
+quantized_decomposed,dequantize_per_tensor,
+quantized_decomposed,dequantize_per_token,
+quantized_decomposed,quantize_per_channel,
+quantized_decomposed,quantize_per_tensor,
+quantized_decomposed,quantize_per_token,
+torchao,choose_qparams_affine,
+torchao,dequantize_affine,
+torchao,quantize_affine,
+et_vk,add_q8ta_q8ta_q8to,no resize support
+et_vk,apply_rotary_emb,
+et_vk,conv2d_q8ta_q8csw_q8to,no resize support
+et_vk,conv2d_q8ta_q8csw_q8to_dw,no resize support
+et_vk,conv_with_clamp,batch=1 for 2D conv; no transposed 1D conv
+et_vk,dequantize_q8to_from_conv2d,no resize support
+et_vk,grid_priors,
+et_vk,linear_dq8ca_q4gsw,
+et_vk,linear_q4gsw,
+et_vk,linear_q8ta_q8csw,
+et_vk,linear_qcs4w,
+et_vk,quantize_q8ta_for_conv2d,no resize support
diff --git a/docs/source/backends/vulkan/vulkan-op-support.rst b/docs/source/backends/vulkan/vulkan-op-support.rst
new file mode 100644
index 00000000000..547f7f9dc6c
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-op-support.rst
@@ -0,0 +1,46 @@
+================
+Operator Support
+================
+
+This page lists the operators currently supported by the Vulkan backend. The
+source of truth for this information is `op_registry.py `_,
+which is used by the Vulkan Partitioner to determine which operators should be
+lowered to the Vulkan backend and additionally describes the capabilities of
+each operator implementation.
+
+If an operator used in your model is not in this list, feel free to create a
+feature request on Github and we will do our best to add an implementation for
+the operator.
+
+The namespace of an operator describes where it originates from:
+
+* **aten** - operators in this namespace correspond 1:1 to operators in PyTorch's
+ `ATen library `_.
+ They all support fp16 and fp32 dtypes at a minimum.
+* **dim_order_op** - these operators are inserted when lowering to ExecuTorch in
+ order to manage optimal tensor memory layouts. They are typically removed,
+ since the Vulkan backend manages optimal tensor representations internally.
+* **llama** - custom ops targeted for LLM inference. These are typically inserted
+ by model source transformations applied to a `nn.Module` and are not invoked
+ directly by a PyTorch model.
+* **operator** - these operators work with symbolic integers, which are also
+ supported by the Vulkan backend.
+* **quantized_decomposed** / **torchao** - these ops are introduced by quantization
+ workflows (either torchao's `quantize_` API or the PT2E quantization flow).
+ They typically represent quantizing/dequantizing a tensor, or choosing the
+ quantization parameters for a tensor. In practice, most instances of these
+ operators will be fused into a custom op in the **et_vk** namespace.
+* **et_vk** - these are custom operators implemented only in the Vulkan backend.
+ They typically represent quantized variants of **aten** operators, or fusions
+ of common operator patterns. They are inserted by operator fusion graph passes
+ when lowering to the Vulkan backend.
+
+All operators support dynamic input shapes unless otherwise noted (i.e. "no
+resize support"). The expectation is that over time, all operators will be able
+to support dynamic shapes.
+
+.. csv-table:: Vulkan Backend Operator Support
+ :file: vulkan-op-support-table.csv
+ :header-rows: 1
+ :widths: 25 25 75
+ :align: left
diff --git a/docs/source/backends/vulkan/vulkan-overview.md b/docs/source/backends/vulkan/vulkan-overview.md
new file mode 100644
index 00000000000..ede7d330e4b
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-overview.md
@@ -0,0 +1,163 @@
+# Vulkan Backend
+
+The ExecuTorch Vulkan (ET-VK) backend enables ExecuTorch models to execute on
+GPUs via the cross-platform [Vulkan API](https://www.vulkan.org/). Although the
+Vulkan API support is almost ubiquitous among modern GPUs, the ExecuTorch Vulkan
+backend is currently developed with a specific focus for **Android GPUs**.
+
+## Features
+
+- Wide operator support via an in-tree [GLSL compute shader library](https://github.com/pytorch/executorch/tree/main/backends/vulkan/runtime/graph/ops/glsl)
+- Support for models that require dynamic shapes
+- Support for FP32 and FP16 inference modes
+- Support for quantized linear layers with 8-bit/4-bit weights and 8-bit dynamically quantized activations
+- Support for quantized linear layers with 8-bit/4-bit weights and FP32/FP16 activations
+
+Note that the Vulkan backend is under active development, and its GLSL compute
+shader library is being consistently expanded over time. Additional support for
+quantized operators (i.e. quantized convolution) and additional quantization
+modes is on the way.
+
+## Target Requirements
+
+- Supports Vulkan 1.1
+
+## Development Requirements
+
+To contribute to the Vulkan delegate, the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home#android)
+must be installed on the development system. After installation, the `glslc` binary must
+be found in your `PATH` in order to compile Vulkan shaders. This can be checked by
+running
+
+```sh
+glslc --version
+```
+
+If this is not the case after completing the Vulkan SDK installation, you may have to
+go into `~/VulkanSDK//` and run
+
+```sh
+source setup-env.sh
+```
+
+or alternatively,
+
+```sh
+python install_vulkan.py
+```
+
+The [Android NDK](https://developer.android.com/ndk/downloads) must also be installed.
+Any NDK version past NDK r17c should suffice.
+
+----
+
+## Using the Vulkan Backend
+
+To lower a model to the Vulkan backend during the export and lowering process,
+pass an instance of `VulkanPartitioner` to `to_edge_transform_and_lower`. The
+example below demonstrates this process using the MobileNet V2 model from
+torchvision.
+
+```python
+import torch
+import torchvision.models as models
+
+from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
+from executorch.exir import to_edge_transform_and_lower
+
+from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
+
+mobilenet_v2 = models.mobilenetv2.mobilenet_v2(
+ weights=MobileNet_V2_Weights.DEFAULT
+).eval()
+
+sample_inputs = (torch.randn(1, 3, 224, 224),)
+
+exported_program = torch.export.export(mobilenet_v2, sample_inputs)
+
+etvk_program = to_edge_transform_and_lower(
+ exported_program,
+ partitioner=[VulkanPartitioner()],
+).to_executorch()
+
+with open("mv2_vulkan.pte", "wb") as file:
+ etvk_program.write_to_file(file)
+```
+
+See [Partitioner API](vulkan-partitioner.md)
+for a reference on available partitioner options.
+
+----
+
+## Quantization
+
+The Vulkan delegate currently supports execution of quantized linear layers.
+See [Vulkan Quantization](vulkan-quantization.md)
+for more information on available quantization schemes and APIs.
+
+----
+
+## Runtime Integration
+
+To run the model on-device, use the standard ExecuTorch runtime APIs.
+
+For integration in Android applications, the Vulkan backend is included in the
+[executorch-android-vulkan](https://mvnrepository.com/artifact/org.pytorch/executorch-android-vulkan)
+package.
+
+When building from source, pass `-DEXECUTORCH_BUILD_VULKAN=ON` when configuring
+the CMake build to compile the Vulkan backend. See [Running on Device](/getting-started.md#running-on-device)
+for more information.
+
+To link against the backend, add the `executorch_backends` CMake target as a
+build dependency, or link directly against `libvulkan_backend`. Due to the use
+of static initialization to register available compute shaders and operators,
+it is required to ensure that the library is linked with `--whole-archive`.
+
+```cmake
+# CMakeLists.txt
+find_package(executorch CONFIG REQUIRED COMPONENTS vulkan_backend executorch_backends)
+
+...
+target_link_libraries(
+ my_target
+ PRIVATE
+ executorch
+ executorch_backends
+ ...
+)
+
+# Ensure that unused code is not discarded. The required linker options may be
+# different depending on the target platform. Typically, the
+# executorch_target_link_options_shared_lib function from
+# executorch/tools/cmake/Utils.cmake can be used to set the required linker
+# options.
+target_link_options(
+ executorch_backends INTERFACE "SHELL:LINKER:--whole-archive \
+ $ \
+ LINKER:--no-whole-archive"
+)
+```
+
+No additional steps are necessary to use the backend beyond linking the target.
+Any Vulkan-delegated .pte file will automatically run on the registered backend.
+
+## Additional Resources
+
+**→{doc}`/backends/vulkan/vulkan-partitioner`**
+
+**→{doc}`/backends/vulkan/vulkan-quantization`**
+
+**→{doc}`/backends/vulkan/vulkan-troubleshooting`**
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+:caption: Vulkan Backend
+
+vulkan-partitioner
+vulkan-quantization
+vulkan-op-support
+vulkan-troubleshooting
+
+tutorials/vulkan-tutorials
diff --git a/docs/source/backends/vulkan/vulkan-partitioner.md b/docs/source/backends/vulkan/vulkan-partitioner.md
new file mode 100644
index 00000000000..566ec491b47
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-partitioner.md
@@ -0,0 +1,55 @@
+# Partitioner API
+
+[VulkanPartitioner](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/vulkan_partitioner.py)
+is a Python class that controls what operators in a model can or should be
+delegated to the Vulkan backend. It is the primary entrypoint to the Vulkan
+backend and is also used to configure the behaviour of the Vulkan backend.
+
+## Usage
+
+For most use-cases, constructing `VulkanPartitioner()` with no arguments is
+sufficient. In this case, the partitioner will lower as much of the model to
+the Vulkan backend as possible.
+
+```python
+etvk_program = to_edge_transform_and_lower(
+ exported_program,
+ partitioner=[VulkanPartitioner()],
+).to_executorch()
+```
+
+## Common Config Options
+
+Generally, the Vulkan backend is configured by passing a `compile_options`
+dictionary to `VulkanPartitioner()`, i.e.
+
+```python
+compile_options = {
+ "require_dynamic_shapes": True,
+ "force_fp16": True,
+}
+
+etvk_program = to_edge_transform_and_lower(
+ exported_program,
+ partitioner=[VulkanPartitioner(compile_options)],
+).to_executorch()
+```
+
+### `require_dynamic_shapes`
+
+If a model is expected to use dynamic shapes, then it is recommended to set the
+`"required_dynamic_shapes"` key in `compile_options`.
+
+Not all operators in Vulkan support dynamic shapes at the moment, although the
+majority do. This flag will prevent operators that don't support dynamic shapes
+from being lowered to Vulkan.
+
+### `force_fp16`
+
+This option causes the Vulkan backend to internally convert all FP32 tensors to
+FP16. This can improve inference latency and memory footprint at the cost of
+model accuracy.
+
+FP32 input tensors will be automatically converted to FP16 upon entering the
+Vulkan backend, and FP16 outputs will be automatically be converted to FP32 as
+they are returned.
diff --git a/docs/source/backends/vulkan/vulkan-quantization.md b/docs/source/backends/vulkan/vulkan-quantization.md
new file mode 100644
index 00000000000..89c9f7514b0
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-quantization.md
@@ -0,0 +1,163 @@
+# Quantization
+
+The Vulkan backend currently supports execution of quantized linear layers,
+where weights are symmetrically quantized to 8-bit or 4-bit with per output
+channel or per group quantization scales.
+
+Support for additional quantized operators and quantization schemes (i.e. static
++ dynamic quantized convolution, support for statically quantized linear) is
+under active development and will be added soon.
+
+### 4-bit quantization with torchao `quantize_`
+
+The `quantize_` API from [torchao](https://github.com/pytorch/ao) allows for
+more advanced quantization schemes, and is the quantization workflow needed to
+access 4-bit quantization. 4-bit quantization is commonly used for LLMs.
+
+Two options are available to execute linear layers with 4-bit quantization:
+
+1. Dynamically quantized activations via `Int8DynamicActivationIntxWeightConfig`
+2. Weight only quantization via `IntxWeightOnlyConfig`
+
+Dynamically quantized activations can provide a significant boost in latency
+compared to weight only quantization, since it allows GPUs to leverage
+accelerated integer dot product instructions when computing matrix
+multiplication.
+
+Below is a simple example of quantizing a simple sequence of linear layers using
+the `quantize_` API.
+
+```python
+import torch
+
+from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
+
+from executorch.exir import to_edge_transform_and_lower
+from torchao.quantization.granularity import PerGroup
+from torchao.quantization.quant_api import (
+ Int8DynamicActivationIntxWeightConfig,
+ IntxWeightOnlyConfig,
+ quantize_,
+)
+from torchao.utils import unwrap_tensor_subclass
+
+
+class LinearSequenceModule(torch.nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.linear1 = torch.nn.Linear(128, 64, bias=False)
+ self.linear2 = torch.nn.Linear(64, 32, bias=False)
+ self.linear3 = torch.nn.Linear(32, 16, bias=False)
+
+ def forward(self, x):
+ x = self.linear1(x)
+ x = self.linear2(x)
+ x = self.linear3(x)
+ return x
+
+
+linear_sequence_module = LinearSequenceModule()
+
+M = 32
+sample_inputs = (torch.randn(M, 128),)
+
+group_size = 32
+
+q_config_8da4w = Int8DynamicActivationIntxWeightConfig(
+ weight_dtype=torch.int4, weight_granularity=PerGroup(group_size)
+)
+
+q_config_4w = IntxWeightOnlyConfig(
+ weight_dtype=torch.int4, granularity=PerGroup(group_size)
+)
+
+quantize_(linear_sequence_module, q_config_8da4w)
+unwrap_tensor_subclass(linear_sequence_module)
+
+# Regular export path from here
+exported_program = torch.export.export(linear_sequence_module, sample_inputs)
+
+etvk_program = to_edge_transform_and_lower(
+ exported_program,
+ partitioner=[VulkanPartitioner()],
+).to_executorch()
+```
+
+### 8-bit quantization with PT2E quantization
+
+For 8-bit quantized linear layers, currently the only quantization scheme
+supported is weight only quantization, with weights that are symmetrically
+quantized to 8 bits with per output channel quantization scales.
+
+To access this quantization mode, the PT2E quantization flow must be used. At a
+high level, the steps to quantize a model are:
+
+1) Create an instance of the `VulkanQuantizer` class and specify desired quantization behaviour
+2) Use `torch.export.export` to prepare for quantization.
+3) Call `prepare_pt2e` to prepare the exported graph for quantization.
+4) Execute the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
+5) Call `convert_pt2e` to quantize the model.
+6) Export and lower the model using the standard flow.
+
+For example:
+
+```python
+import torch
+
+from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
+
+from executorch.backends.vulkan.quantizer.vulkan_quantizer import (
+ get_symmetric_quantization_config,
+ VulkanQuantizer,
+)
+
+from executorch.exir import to_edge_transform_and_lower
+
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
+
+from torchao.utils import unwrap_tensor_subclass
+
+
+class LinearSequenceModule(torch.nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.linear1 = torch.nn.Linear(128, 64, bias=False)
+ self.linear2 = torch.nn.Linear(64, 32, bias=False)
+ self.linear3 = torch.nn.Linear(32, 16, bias=False)
+
+ def forward(self, x):
+ x = self.linear1(x)
+ x = self.linear2(x)
+ x = self.linear3(x)
+ return x
+
+
+linear_sequence_module = LinearSequenceModule()
+
+M = 32
+# Create sample inputs
+sample_inputs = (torch.randn(M, 128),)
+
+# Setup quantizer
+quantizer = VulkanQuantizer()
+quantizer.set_global(get_symmetric_quantization_config(is_dynamic=False, weight_bits=8))
+
+# Export the model
+exported_program = torch.export.export(linear_sequence_module, sample_inputs)
+graph_module = exported_program.module()
+
+# Quantize the exported program with PT2E quantization flow
+quantized_module = prepare_pt2e(graph_module, quantizer)
+# Calibrate. In practice, this would be done by iterating over a real dataset
+quantized_module(*sample_inputs)
+quantized_module = convert_pt2e(quantized_module)
+
+# Export once more
+exported_program = torch.export.export(quantized_module, sample_inputs)
+
+# Lower to vulkan
+etvk_program = to_edge_transform_and_lower(
+ exported_program,
+ partitioner=[VulkanPartitioner()],
+).to_executorch()
+```
diff --git a/docs/source/backends/vulkan/vulkan-troubleshooting.md b/docs/source/backends/vulkan/vulkan-troubleshooting.md
new file mode 100644
index 00000000000..9845f588004
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-troubleshooting.md
@@ -0,0 +1,57 @@
+# Troubleshooting
+
+This page describes common issues that you may encounter when using the Vulkan
+backend and how to debug and resolve them.
+
+## Vulkan Backend Not Found
+
+If you try to execute a .pte file that has been lowered to the Vulkan backend
+and you see an error like:
+
+```shell
+E 00:00:00.366934 executorch:method.cpp:74] Backend VulkanBackend is not registered.
+```
+
+This error indicates the Vulkan backend is not registered with the runtime. This
+can happen because the backend was not compiled or linked, or because the
+registration code was optimized out.
+
+First, make sure that when building ExecuTorch, cmake is configured with
+`-DEXECUTORCH_BUILD_VULKAN=ON`.
+
+Next, make sure that your application is linking the `vulkan_backend` target,
+or the `executorch_backends` target.
+
+Finally, ensure that `vulkan_backend` or `executorch_backends` is being linked
+with the equivalent of `--whole-archive`.
+
+## Slow Performance
+
+Performance issues can be caused by a variety of factors:
+
+* A key compute shader (most often convolution or linear) is not performing well
+ on your target GPU
+* Unsupported operators are causing too many graph breaks
+* An existing operator is lacking support for some memory layout or storage type
+ resulting in a high number of copies being inserted to ensure tensors are in
+ a required representation for the next operator
+
+If you experience poor on-device performance for a particular model, please
+obtain some profiling data while running your model. The
+[profiling tutorial](./tutorials/etvk-profiling-tutorial.md) can
+be a good reference for how to do this.
+
+Then, please file an issue on Github with the following details:
+
+* The device(s) you have tested with, and which devices exhibit poor performance
+ running the model
+* The profiling data collected from executing the model
+* The release version of ExecuTorch you are using, or the commit hash you built
+ from if you built from source
+* If available, an export script that can be used to export your model to aid
+ in reproducing the issue
+* If available, the `.pte` file you are testing with to aid in reproducing the
+ issue.
+
+We will do our best to patch performance problems in the Vulkan backend and
+help you resolve your issue.
diff --git a/docs/source/backends/xnnpack/op-support.csv b/docs/source/backends/xnnpack/op-support.csv
new file mode 100644
index 00000000000..5350fed8d12
--- /dev/null
+++ b/docs/source/backends/xnnpack/op-support.csv
@@ -0,0 +1,47 @@
+Operator,Compute DType,Quantization,Constraints
+_to_dim_order_copy,"fp16, fp32",,no dtype conversion
+abs,"fp16, fp32",,
+add,"fp16, fp32",PT2E: static int8,alpha=1
+avg_pool2d,"fp16, fp32",PT2E: static int8,"ceil_mode=False, count_include_pad=False, divisor_override=pooling_region"
+bmm,"fp16, fp32",,
+cat,"fp16, fp32",PT2E: static int8,
+ceil,"fp16, fp32",,
+clamp,"fp16, fp32",,
+constant_pad_nd,"fp16, fp32",,no negative padding values
+conv1d,"fp16, fp32","PT2E: static or dynamic int8 activations
+8-bit weights, symmetric per-tensor or per-channel",constant weights
+conv2d,"fp16, fp32","PT2E: static or dynamic int8 activations
+8-bit weights, symmetric per-tensor or per-channel",constant weights
+dequantize_per_tensor,"fp16, fp32",,
+div,"fp16, fp32",,
+elu,"fp16, fp32",,
+exp,"fp16, fp32",,
+floor,"fp16, fp32",,
+gelu,"fp16, fp32",,
+hardswish,"fp16, fp32",,
+hardtanh,"fp16, fp32",,
+leaky_relu,"fp16, fp32",,
+linear,"fp16, fp32","PT2E: static or dynamic int8 activations
+8-bit weights, symmetric per-tensor or per-channel
+
+quantize\_: 8-bit dynamic activations
+4-bit groupwise weights",constant weights
+log,"fp16, fp32",,
+max_pool2d,"fp16, fp32",,"stride ≤ kernel_size, ceil_mode only for static shapes"
+maximum,"fp16, fp32",,
+mean,"fp16, fp32",,"4D tensors only; dims=[-2,-1] or [-1,-2]"
+minimum,"fp16, fp32",,
+mul,"fp16, fp32",PT2E: static int8,
+neg,"fp16, fp32",,
+permute_copy,"fp16, fp32",,
+pow,"fp16, fp32",,power=2 only
+quantize_per_tensor,"fp16, fp32",,
+relu,"fp16, fp32",,
+rsqrt,"fp16, fp32",,
+sigmoid,"fp16, fp32",,
+slice_copy,"fp16, fp32",,"no zero-dim tensors, no dynamic shapes"
+softmax,"fp16, fp32",,dim must be last dimension
+sqrt,"fp16, fp32",,
+sub,"fp16, fp32",,alpha=1
+tanh,"fp16, fp32",,
+upsample_bilinear2d,"fp16, fp32",,no dynamic output sizes
diff --git a/docs/source/backend-delegates-xnnpack-reference.md b/docs/source/backends/xnnpack/xnnpack-arch-internals.md
similarity index 90%
rename from docs/source/backend-delegates-xnnpack-reference.md
rename to docs/source/backends/xnnpack/xnnpack-arch-internals.md
index 8b4338e703c..52bcd3704cb 100644
--- a/docs/source/backend-delegates-xnnpack-reference.md
+++ b/docs/source/backends/xnnpack/xnnpack-arch-internals.md
@@ -1,4 +1,4 @@
-# XNNPACK Delegate Internals
+# Architecture and Internals
This is a high-level overview of the ExecuTorch XNNPACK backend delegate. This high performance delegate is aimed to reduce CPU inference latency for ExecuTorch models. We will provide a brief introduction to the XNNPACK library and explore the delegate’s overall architecture and intended use cases.
@@ -6,18 +6,18 @@ This is a high-level overview of the ExecuTorch XNNPACK backend delegate. This h
XNNPACK is a library of highly-optimized neural network operators for ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, and macOS environments. It is an open source project, you can find more information about it on [github](https://github.com/google/XNNPACK).
## What are ExecuTorch delegates?
-A delegate is an entry point for backends to process and execute parts of the ExecuTorch program. Delegated portions of ExecuTorch models hand off execution to backends. The XNNPACK backend delegate is one of many available in ExecuTorch. It leverages the XNNPACK third-party library to accelerate ExecuTorch programs efficiently across a variety of CPUs. More detailed information on the delegates and developing your own delegates is available [here](compiler-delegate-and-partitioner.md). It is recommended that you get familiar with that content before continuing on to the Architecture section.
+A delegate is an entry point for backends to process and execute parts of the ExecuTorch program. Delegated portions of ExecuTorch models hand off execution to backends. The XNNPACK backend delegate is one of many available in ExecuTorch. It leverages the XNNPACK third-party library to accelerate ExecuTorch programs efficiently across a variety of CPUs. More detailed information on the delegates and developing your own delegates is available [here](/compiler-delegate-and-partitioner.md). It is recommended that you get familiar with that content before continuing on to the Architecture section.
## Architecture
-
+
### Ahead-of-time
In the ExecuTorch export flow, lowering to the XNNPACK delegate happens at the `to_backend()` stage. In this stage, the model is partitioned by the `XnnpackPartitioner`. Partitioned sections of the graph are converted to a XNNPACK specific graph represenationed and then serialized via flatbuffer. The serialized flatbuffer is then ready to be deserialized and executed by the XNNPACK backend at runtime.
-
+
#### Partitioner
-The partitioner is implemented by backend delegates to mark nodes suitable for lowering. The `XnnpackPartitioner` lowers using node targets and module metadata. Some more references for partitioners can be found [here](compiler-delegate-and-partitioner.md)
+The partitioner is implemented by backend delegates to mark nodes suitable for lowering. The `XnnpackPartitioner` lowers using node targets and module metadata. Some more references for partitioners can be found [here](/compiler-delegate-and-partitioner.md)
##### Module-based partitioning
@@ -54,7 +54,7 @@ After partitioning the lowerable subgraphs from the model, The XNNPACK delegate
The XNNPACK delegate uses flatbuffer for serialization. In order to improve runtime performance, the XNNPACK delegate’s flatbuffer [schema](https://github.com/pytorch/executorch/blob/main/backends/xnnpack/serialization/schema.fbs) mirrors the XNNPACK Library’s graph level API calls. The serialized data are arguments to XNNPACK’s APIs, so that at runtime, the XNNPACK execution graph can efficiently be created with successive calls to XNNPACK’s APIs.
### Runtime
-The XNNPACK backend’s runtime interfaces with the ExecuTorch runtime through the custom `init` and `execute` function. Each delegated subgraph is contained in an individually serialized XNNPACK blob. When the model is initialized, ExecuTorch calls `init` on all XNNPACK Blobs to load the subgraph from serialized flatbuffer. After, when the model is executed, each subgraph is executed via the backend through the custom `execute` function. To read more about how delegate runtimes interface with ExecuTorch, refer to this [resource](compiler-delegate-and-partitioner.md).
+The XNNPACK backend’s runtime interfaces with the ExecuTorch runtime through the custom `init` and `execute` function. Each delegated subgraph is contained in an individually serialized XNNPACK blob. When the model is initialized, ExecuTorch calls `init` on all XNNPACK Blobs to load the subgraph from serialized flatbuffer. After, when the model is executed, each subgraph is executed via the backend through the custom `execute` function. To read more about how delegate runtimes interface with ExecuTorch, refer to this [resource](/compiler-delegate-and-partitioner.md).
#### **XNNPACK Library**
@@ -70,7 +70,7 @@ Since weight packing creates an extra copy of the weights inside XNNPACK, We fre
When executing the XNNPACK subgraphs, we prepare the tensor inputs and outputs and feed them to the XNNPACK runtime graph. After executing the runtime graph, the output pointers are filled with the computed tensors.
#### **Profiling**
-We have enabled basic profiling for the XNNPACK delegate that can be enabled with the compiler flag `-DEXECUTORCH_ENABLE_EVENT_TRACER` (add `-DENABLE_XNNPACK_PROFILING` for additional details). With ExecuTorch's Developer Tools integration, you can also now use the Developer Tools to profile the model. You can follow the steps in [Using the ExecuTorch Developer Tools to Profile a Model](tutorials/devtools-integration-tutorial) on how to profile ExecuTorch models and use Developer Tools' Inspector API to view XNNPACK's internal profiling information. An example implementation is available in the `executor_runner` (see [tutorial here](tutorial-xnnpack-delegate-lowering.md#profiling)).
+We have enabled basic profiling for the XNNPACK delegate that can be enabled with the compiler flag `-DEXECUTORCH_ENABLE_EVENT_TRACER` (add `-DENABLE_XNNPACK_PROFILING` for additional details). With ExecuTorch's Developer Tools integration, you can also now use the Developer Tools to profile the model. You can follow the steps in [Using the ExecuTorch Developer Tools to Profile a Model](/tutorials/devtools-integration-tutorial) on how to profile ExecuTorch models and use Developer Tools' Inspector API to view XNNPACK's internal profiling information. An example implementation is available in the `executor_runner` (see [tutorial here](/tutorial-xnnpack-delegate-lowering.md#profiling)).
[comment]: <> (TODO: Refactor quantizer to a more official quantization doc)
@@ -142,5 +142,5 @@ def _qdq_quantized_linear(
You can read more indepth explanations on PyTorch 2 quantization [here](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html).
## See Also
-- [Integrating XNNPACK Delegate in Android AAR](using-executorch-android.md)
-- [Complete the Lowering to XNNPACK Tutorial](tutorial-xnnpack-delegate-lowering.md)
+- [Integrating XNNPACK Delegate in Android AAR](/using-executorch-android.md)
+- [Complete the Lowering to XNNPACK Tutorial](/tutorial-xnnpack-delegate-lowering.md)
diff --git a/docs/source/xnnpack-delegate-architecture.png b/docs/source/backends/xnnpack/xnnpack-delegate-architecture.png
similarity index 100%
rename from docs/source/xnnpack-delegate-architecture.png
rename to docs/source/backends/xnnpack/xnnpack-delegate-architecture.png
diff --git a/docs/source/xnnpack-et-flow-diagram.png b/docs/source/backends/xnnpack/xnnpack-et-flow-diagram.png
similarity index 100%
rename from docs/source/xnnpack-et-flow-diagram.png
rename to docs/source/backends/xnnpack/xnnpack-et-flow-diagram.png
diff --git a/docs/source/backends/xnnpack/xnnpack-overview.md b/docs/source/backends/xnnpack/xnnpack-overview.md
new file mode 100644
index 00000000000..5ef92c81126
--- /dev/null
+++ b/docs/source/backends/xnnpack/xnnpack-overview.md
@@ -0,0 +1,100 @@
+# XNNPACK Backend
+
+The XNNPACK delegate is the ExecuTorch solution for CPU execution on mobile CPUs. [XNNPACK](https://github.com/google/XNNPACK/tree/master) is a library that provides optimized kernels for machine learning operators on Arm and x86 CPUs.
+
+## Features
+
+- Wide operator support on Arm and x86 CPUs, available on any modern mobile phone.
+- Support for a wide variety of quantization schemes and quantized operators.
+- Supports fp32 and fp16 activations.
+- Supports 8-bit quantization.
+
+## Target Requirements
+
+- ARM64 on Android, iOS, macOS, Linux, and Windows.
+- ARMv7 (with NEON) on Android.
+- ARMv6 (with VFPv2) on Linux.
+- x86 and x86-64 (up to AVX512) on Windows, Linux, Android.
+
+## Development Requirements
+
+The XNNPACK delegate does not introduce any development system requirements beyond those required by
+the core ExecuTorch runtime.
+
+----
+
+## Using the XNNPACK Backend
+
+To target the XNNPACK backend during the export and lowering process, pass an instance of the `XnnpackPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision.
+
+```python
+import torch
+import torchvision.models as models
+from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
+from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
+from executorch.exir import to_edge_transform_and_lower
+
+mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
+sample_inputs = (torch.randn(1, 3, 224, 224), )
+
+et_program = to_edge_transform_and_lower(
+ torch.export.export(mobilenet_v2, sample_inputs),
+ partitioner=[XnnpackPartitioner()],
+).to_executorch()
+
+with open("mv2_xnnpack.pte", "wb") as file:
+ et_program.write_to_file(file)
+```
+
+See [Partitioner API](/backends/xnnpack/xnnpack-partitioner) for a reference on available partitioner options.
+
+----
+
+## Quantization
+
+The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. See [XNNPACK Quantization](/backends/xnnpack/xnnpack-quantization) for more information on available quantization schemes and APIs.
+
+----
+
+## Runtime Integration
+
+To run the model on-device, use the standard ExecuTorch runtime APIs.
+
+The XNNPACK delegate is included by default in the published Android, iOS, and pip packages. When building from source, pass `-DEXECUTORCH_BUILD_XNNPACK=ON` when configuring the CMake build to compile the XNNPACK backend. See [Running on Device](/getting-started.md#running-on-device) for more information.
+
+To link against the backend, add the `executorch_backends` CMake target as a build dependency, or link directly against `libxnnpack_backend`. Due to the use of static registration, it may be necessary to link with whole-archive. This can typically be done by passing `"$"` to `target_link_libraries`.
+
+```
+# CMakeLists.txt
+add_subdirectory("executorch")
+...
+target_link_libraries(
+ my_target
+ PRIVATE executorch
+ executorch_backends
+ ...
+)
+```
+
+No additional steps are necessary to use the backend beyond linking the target. Any XNNPACK-delegated .pte file will automatically run on the registered backend.
+
+## Reference
+
+**→{doc}`/backends/xnnpack/xnnpack-troubleshooting` — Debug common issues.**
+
+**→{doc}`/backends/xnnpack/xnnpack-partitioner` — Partitioner options and supported operators.**
+
+**→{doc}`/backends/xnnpack/xnnpack-quantization` — Supported quantization schemes.**
+
+**→{doc}`/backends/xnnpack/xnnpack-arch-internals` — XNNPACK backend internals.**
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+:caption: XNNPACK Backend
+
+xnnpack-partitioner
+xnnpack-quantization
+xnnpack-troubleshooting
+xnnpack-arch-internals
+```
diff --git a/docs/source/backends/xnnpack/xnnpack-partitioner.rst b/docs/source/backends/xnnpack/xnnpack-partitioner.rst
new file mode 100644
index 00000000000..a0881aa3a6a
--- /dev/null
+++ b/docs/source/backends/xnnpack/xnnpack-partitioner.rst
@@ -0,0 +1,24 @@
+===============
+Partitioner API
+===============
+
+The XNNPACK partitioner API allows for configuration of the model delegation to XNNPACK. Passing an ``XnnpackPartitioner`` instance with no additional parameters will run as much of the model as possible on the XNNPACK backend. This is the most common use-case. For advanced use cases, the partitioner exposes the following options via the `constructor `_:
+
+- ``configs``: Control which operators are delegated to XNNPACK. By default, all available operators all delegated. See `../config/__init__.py `_ for an exhaustive list of available operator configs.
+- ``config_precisions``: Filter operators by data type. By default, delegate all precisions. One or more of ``ConfigPrecisionType.FP32``, ``ConfigPrecisionType.STATIC_QUANT``, or ``ConfigPrecisionType.DYNAMIC_QUANT``. See `ConfigPrecisionType `_.
+- ``per_op_mode``: If true, emit individual delegate calls for every operator. This is an advanced option intended to reduce memory overhead in some contexts at the cost of a small amount of runtime overhead. Defaults to false.
+- ``verbose``: If true, print additional information during lowering.
+
+================
+Operator Support
+================
+
+This section lists the operators supported by the XNNPACK backend. Operators are the building blocks of the ML model. See `IRs `_ for more information on the PyTorch operator set.
+
+All operators support dynamic input shapes unless otherwise noted.
+
+.. csv-table:: Operator Support
+ :file: op-support.csv
+ :header-rows: 1
+ :widths: 20 15 30 30
+ :align: center
diff --git a/docs/source/backends/xnnpack/xnnpack-quantization.md b/docs/source/backends/xnnpack/xnnpack-quantization.md
new file mode 100644
index 00000000000..e3a02d4bffc
--- /dev/null
+++ b/docs/source/backends/xnnpack/xnnpack-quantization.md
@@ -0,0 +1,94 @@
+# Quantization
+
+The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. To quantize a PyTorch model for the XNNPACK backend, use the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library.
+
+### Supported Quantization Schemes
+The XNNPACK delegate supports the following quantization schemes:
+
+- 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow).
+ - Supports both static and dynamic activations.
+ - Supports per-channel and per-tensor schemes.
+ - Supports linear, convolution, add, mul, cat, and adaptive avg pool 2d operators.
+
+Weight-only quantization is not currently supported on XNNPACK.
+
+### 8-bit Quantization using the PT2E Flow
+
+To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model:
+
+1) Create an instance of the `XnnpackQuantizer` class. Set quantization parameters.
+2) Use `torch.export.export` to prepare for quantization.
+3) Call `prepare_pt2e` to prepare the model for quantization.
+4) For static quantization, run the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
+5) Call `convert_pt2e` to quantize the model.
+6) Export and lower the model using the standard flow.
+
+The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques.
+
+```python
+import torch
+import torchvision.models as models
+from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
+from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import XNNPACKQuantizer, get_symmetric_quantization_config
+from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
+from executorch.exir import to_edge_transform_and_lower
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
+
+model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
+sample_inputs = (torch.randn(1, 3, 224, 224), )
+
+qparams = get_symmetric_quantization_config(is_per_channel=True) # (1)
+quantizer = XNNPACKQuantizer()
+quantizer.set_global(qparams)
+
+training_ep = torch.export.export(model, sample_inputs).module() # (2)
+prepared_model = prepare_pt2e(training_ep, quantizer) # (3)
+
+for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs
+ prepared_model(cal_sample) # (4) Calibrate
+
+quantized_model = convert_pt2e(prepared_model) # (5)
+
+et_program = to_edge_transform_and_lower( # (6)
+ torch.export.export(quantized_model, sample_inputs),
+ partitioner=[XnnpackPartitioner()],
+).to_executorch()
+```
+
+See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
+
+### LLM quantization with quantize_
+
+The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK:
+
+* Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity)
+* Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity)
+
+Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch).
+
+```python
+from torchao.quantization.granularity import PerGroup, PerAxis
+from torchao.quantization.quant_api import (
+ IntxWeightOnlyConfig,
+ Int8DynamicActivationIntxWeightConfig,
+ quantize_,
+)
+
+# Quantize embeddings with 8-bits, per channel
+embedding_config = IntxWeightOnlyConfig(
+ weight_dtype=torch.int8,
+ granularity=PerAxis(0),
+)
+qunatize_(
+ eager_model,
+ lambda m, fqn: isinstance(m, torch.nn.Embedding),
+)
+
+
+# Quatize linear layers with 8-bit dynamic activations and 4-bit weights
+linear_config = Int8DynamicActivationIntxWeightConfig(
+ weight_dtype=torch.int4,
+ weight_granularity=PerGroup(32),
+)
+quantize_(eager_model, linear_config)
+```
diff --git a/docs/source/backends/xnnpack/xnnpack-troubleshooting.md b/docs/source/backends/xnnpack/xnnpack-troubleshooting.md
new file mode 100644
index 00000000000..508acc06351
--- /dev/null
+++ b/docs/source/backends/xnnpack/xnnpack-troubleshooting.md
@@ -0,0 +1,25 @@
+# Troubleshooting
+
+This page describes common issues that you may encounter when using the XNNPACK backend and how to debug and resolve them.
+
+## XNNPACK Backend Not Found
+
+This error indicates the XNNPACK backend is not registered with the runtime. This can happen because the backend was not compiled or linked, or because the registration code was optimized out.
+
+The XNNPACK backend is built by default for Python, Android, iOS, and in most CMake presets.
+
+* Set the `EXECUTORCH_BUILD_XNNPACK=ON` CMake option option when building from source.
+ * Either by passing the option during CMake configuration or setting it inside the user CMake logic before including ExecuTorch.
+ * See [Building from Source](/using-executorch-building-from-source).
+* On iOS, link the `backend_xnnpack` [framework](/using-executorch-ios).
+* If the backend is still not found, link with `WHOLE_ARCHIVE`.
+ * Pass `"LINK_LIBRARY:WHOLE_ARCHIVE,xnnpack_backend>"` to `target_link_libraries` in CMake.
+
+## Slow Performance
+
+ * Try reducing the thread count using [_unsafe_reset_threadpool](/using-executorch-faqs.md#inference-is-slow-performance-troubleshooting).
+ * Small models may benefit from using fewer threads than default.
+ * Try values between 1 and 4 threads and measure performance on your model.
+ * Use [op-level profiling](/tutorials/devtools-integration-tutorial) to understand which operators are taking the most time.
+ * The XNNPACK backend provides operator-level timing for delegated operators.
+ * See general performance troubleshooting tips in [Performance Troubleshooting](/using-executorch-faqs.md#inference-is-slow-performance-troubleshooting).
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 31abdef2820..78268c8d053 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -264,7 +264,8 @@
"export-overview": "using-executorch-export.html",
"runtime-build-and-cross-compilation": "using-executorch-building-from-source.html",
"tutorials/export-to-executorch-tutorial": "../using-executorch-export.html",
- "build-run-vulkan": "backends-vulkan.html",
+ "build-run-vulkan": "backends/vulkan/vulkan-overview.html",
+ "backends-vulkan": "backends/vulkan/vulkan-overview.html",
"executorch-arm-delegate-tutorial": "backends-arm-ethos-u.html",
"build-run-coreml": "backends-coreml.html",
"build-run-mediatek-backend": "backends-mediatek.html",
diff --git a/docs/source/desktop-xnnpack.md b/docs/source/desktop-xnnpack.md
index 315dd747006..4a85dec946b 100644
--- a/docs/source/desktop-xnnpack.md
+++ b/docs/source/desktop-xnnpack.md
@@ -1 +1 @@
-```{include} backends-xnnpack.md
+```{include} backends/xnnpack/xnnpack-overview.md
diff --git a/docs/source/edge-platforms-section.md b/docs/source/edge-platforms-section.md
index 2b9ee2131de..209986507fa 100644
--- a/docs/source/edge-platforms-section.md
+++ b/docs/source/edge-platforms-section.md
@@ -59,12 +59,13 @@ Key features:
## Next Steps
After choosing your platform:
+
- **{doc}`backends-section`** - Deep dive into backend selection and optimization
- **{doc}`llm/working-with-llms`** - Working with Large Language Models on edge devices
```{toctree}
:hidden:
-:maxdepth: 2
+:maxdepth: 3
:caption: Edge Platforms
android-section
diff --git a/docs/source/embedded-section.md b/docs/source/embedded-section.md
index 5636a7546dc..aac64190030 100644
--- a/docs/source/embedded-section.md
+++ b/docs/source/embedded-section.md
@@ -26,6 +26,8 @@ Start here for C++ development with ExecuTorch runtime APIs and essential tutori
- {doc}`tutorial-arm-ethos-u` — Export a simple PyTorch model for the ExecuTorch Ethos-U backend
- {doc}`raspberry_pi_llama_tutorial` — Deploy a LLaMA model on a Raspberry Pi
+- {doc}`pico2_tutorial` — Deploy a demo MNIST model on the Raspberry Pi Pico 2
+
```{toctree}
:hidden:
@@ -38,3 +40,4 @@ using-executorch-building-from-source
embedded-backends
tutorial-arm-ethos-u
raspberry_pi_llama_tutorial
+pico2_tutorial
diff --git a/docs/source/examples-end-to-end-to-lower-model-to-delegate.md b/docs/source/examples-end-to-end-to-lower-model-to-delegate.md
index 4ef6bcd0d6e..fd14d718531 100644
--- a/docs/source/examples-end-to-end-to-lower-model-to-delegate.md
+++ b/docs/source/examples-end-to-end-to-lower-model-to-delegate.md
@@ -19,7 +19,7 @@ There are three flows for delegating a program to a backend:
is good for reusing lowered modules exported from other flows.
1. Lower parts of a module according to a partitioner. This is good for
lowering models that include both lowerable and non-lowerable nodes, and is
- the most streamlined procecss.
+ the most streamlined process.
### Flow 1: Lowering the whole module
diff --git a/docs/source/getting-started.md b/docs/source/getting-started.md
index 80672ac9d14..845db806e02 100644
--- a/docs/source/getting-started.md
+++ b/docs/source/getting-started.md
@@ -1,5 +1,5 @@
# Getting Started with ExecuTorch
-This section is intended to describe the necessary steps to take PyTorch model and run it using ExecuTorch. To use the framework, you will typically need to take the following steps:
+This section is intended to describe the necessary steps to take a PyTorch model and run it using ExecuTorch. To use the framework, you will typically need to take the following steps:
- Install the ExecuTorch python package and runtime libraries.
- Export the PyTorch model for the target hardware configuration.
- Run the model using the ExecuTorch runtime APIs on your development platform.
@@ -10,9 +10,9 @@ The following are required to install the ExecuTorch host libraries, needed to e
- Python 3.10 - 3.12
- g++ version 7 or higher, clang++ version 5 or higher, or another C++17-compatible toolchain.
-- Linux (x86_64 or ARM64) or macOS (ARM64).
+- Linux (x86_64 or ARM64), macOS (ARM64), or Windows (x86_64).
- Intel-based macOS systems require building PyTorch from source (see [Building From Source](using-executorch-building-from-source.md) for instructions).
- - Windows is supported via WSL.
+- On Windows, Visual Studio 2022 or later.
## Installation
To use ExecuTorch, you will need to install both the Python package and the appropriate platform-specific runtime libraries. Pip is the recommended way to install the ExecuTorch python package.
@@ -25,6 +25,7 @@ pip install executorch
To build the framework from source, see [Building From Source](using-executorch-building-from-source.md). Backend delegates may require additional dependencies. See the appropriate backend documentation for more information.
+> **_NOTE:_** On Windows, ExecuTorch requires a [Visual Studio Developer Powershell](https://learn.microsoft.com/en-us/visualstudio/ide/reference/command-prompt-powershell?view=vs-2022). Running from outside of a developer prompt will manifest as errors related to CL.exe.
@@ -44,7 +45,7 @@ ExecuTorch provides hardware acceleration for a wide variety of hardware. The mo
For mobile use cases, consider using XNNPACK for Android and Core ML or XNNPACK for iOS as a first step. See [Hardware Backends](backends-overview.md) for more information.
### Exporting
-Exporting is done using Python APIs. ExecuTorch provides a high degree of customization during the export process, but the typical flow is as follows. This example uses the MobileNet V2 image classification model implementation in torchvision, but the process supports any [export-compliant](https://pytorch.org/docs/stable/export.html) PyTorch model. For users working with Hugging Face models,
+Exporting is done using Python APIs. ExecuTorch provides a high degree of customization during the export process, but the typical flow is as follows. This example uses the MobileNet V2 image classification model implementation in torchvision, but the process supports any [export-compliant](https://pytorch.org/docs/stable/export.html) PyTorch model. For Hugging Face models,
you can find a list of supported models in the [*huggingface/optimum-executorch*](https://github.com/huggingface/optimum-executorch) repo.
```python
@@ -76,7 +77,7 @@ Quantization can also be done at this stage to reduce model size and runtime. Qu
After successfully generating a .pte file, it is common to use the Python runtime APIs to validate the model on the development platform. This can be used to evaluate model accuracy before running on-device.
-For the MobileNet V2 model from torchvision used in this example, image inputs are expected as a normalized, float32 tensor with a dimensions of (batch, channels, height, width). The output See [torchvision.models.mobilenet_v2](https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v2.html) for more information on the input and output tensor format for this model.
+For the MobileNet V2 model from torchvision used in this example, image inputs are expected as a normalized, float32 tensor with a dimensions of (batch, channels, height, width). The output is a tensor containing class logits. See [torchvision.models.mobilenet_v2](https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v2.html) for more information on the input and output tensor format for this model.
```python
import torch
@@ -103,7 +104,7 @@ print(torch.allclose(output[0], eager_reference_output, rtol=1e-3, atol=1e-5))
For complete examples of exporting and running the model, please refer to our [examples GitHub repository](https://github.com/meta-pytorch/executorch-examples/tree/main/mv2/python).
-Additionally, if you work with Hugging Face models, the [*huggingface/optimum-executorch*](https://github.com/huggingface/optimum-executorch) library simplifies running these models end-to-end with ExecuTorch, using familiar Hugging Face APIs. Visit the repository for specific examples and supported models.
+Additionally, for Hugging Face models, the [*huggingface/bptimum-executorch*](https://github.com/huggingface/optimum-executorch) library simplifies running these models end-to-end with ExecuTorch using familiar Hugging Face APIs. Visit the repository for specific examples and supported models.
@@ -131,7 +132,7 @@ dependencies {
```
#### Runtime APIs
-Models can be loaded and run using the `Module` class:
+Models can be loaded and run from Java or Kotlin using the `Module` class.
```java
import org.pytorch.executorch.EValue;
import org.pytorch.executorch.Module;
@@ -147,8 +148,11 @@ EValue[] output = model.forward(input_evalue);
float[] scores = output[0].toTensor().getDataAsFloatArray();
```
+Note that the [C++](#c) APIs can be used when targeting Android native.
+
For a full example of running a model on Android, see the [DeepLabV3AndroidDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo). For more information on Android development, including building from source, a full description of the Java APIs, and information on using ExecuTorch from Android native code, see [Using ExecuTorch on Android](using-executorch-android.md).
+
### iOS
#### Installation
@@ -165,22 +169,27 @@ For more information on iOS integration, including an API reference, logging set
ExecuTorch provides C++ APIs, which can be used to target embedded or mobile devices. The C++ APIs provide a greater level of control compared to other language bindings, allowing for advanced memory management, data loading, and platform integration.
#### Installation
-CMake is the preferred build system for the ExecuTorch C++ runtime. To use with CMake, clone the ExecuTorch repository as a subdirectory of your project, and use CMake's `add_subdirectory("executorch")` to include the dependency. The `executorch` target, as well as kernel and backend targets will be made available to link against. The runtime can also be built standalone to support diverse toolchains. See [Using ExecuTorch with C++](using-executorch-cpp.md) for a detailed description of build integration, targets, and cross compilation.
+CMake is the preferred build system for the ExecuTorch C++ runtime. To use with CMake, clone the ExecuTorch repository as a subdirectory of your project, and use CMake's `add_subdirectory("executorch")` to include the dependency. The `executorch` target, as well as kernel and backend targets will be made available to link against. The runtime can also be built standalone to support diverse toolchains. See [Using ExecuTorch with C++](using-executorch-cpp.md) and [Building from Source](using-executorch-building-from-source.md) for a detailed description of build integration, targets, and cross compilation.
```
git clone -b release/1.0 https://github.com/pytorch/executorch.git
```
-```python
+```cmake
+# Set CMAKE_CXX_STANDARD to 17 or above.
+set(CMAKE_CXX_STANDARD 17)
+
# CMakeLists.txt
+set(EXECUTORCH_BUILD_PRESET_FILE ${CMAKE_SOURCE_DIR}/executorch/tools/cmake/preset/llm.cmake)
+# Set other ExecuTorch options here.
+
add_subdirectory("executorch")
...
target_link_libraries(
my_target
PRIVATE executorch
- extension_module_static
- extension_tensor
- optimized_native_cpu_ops_lib
- xnnpack_backend)
+ executorch::backends
+ executorch::extensions
+ executorch::kernels)
```
diff --git a/docs/source/ios-coreml.md b/docs/source/ios-coreml.md
index 48271326d87..ff6551aa0c2 100644
--- a/docs/source/ios-coreml.md
+++ b/docs/source/ios-coreml.md
@@ -1 +1 @@
-```{include} backends-coreml.md
+```{include} backends/coreml/coreml-overview.md
diff --git a/docs/source/ios-mps.md b/docs/source/ios-mps.md
index d6f305d33aa..13717675ba5 100644
--- a/docs/source/ios-mps.md
+++ b/docs/source/ios-mps.md
@@ -1 +1 @@
-```{include} backends-mps.md
+```{include} backends/mps/mps-overview.md
diff --git a/docs/source/ios-xnnpack.md b/docs/source/ios-xnnpack.md
index 315dd747006..4a85dec946b 100644
--- a/docs/source/ios-xnnpack.md
+++ b/docs/source/ios-xnnpack.md
@@ -1 +1 @@
-```{include} backends-xnnpack.md
+```{include} backends/xnnpack/xnnpack-overview.md
diff --git a/docs/source/kernel-library-selective-build.md b/docs/source/kernel-library-selective-build.md
index 666206acb94..edec9567b7b 100644
--- a/docs/source/kernel-library-selective-build.md
+++ b/docs/source/kernel-library-selective-build.md
@@ -61,7 +61,7 @@ gen_selected_ops(
ROOT_OPS # comma separated operator names to be selected
INCLUDE_ALL_OPS # boolean flag to include all operators
OPS_FROM_MODEL # path to a pte file of model to select operators from
- DTYPE_SELECTIVE_BUILD # boolean flag to enable dtye selection
+ DTYPE_SELECTIVE_BUILD # boolean flag to enable dtype selection
)
```
diff --git a/docs/source/llm/export-llm-optimum.md b/docs/source/llm/export-llm-optimum.md
new file mode 100644
index 00000000000..1a104f77bc4
--- /dev/null
+++ b/docs/source/llm/export-llm-optimum.md
@@ -0,0 +1,171 @@
+# Exporting LLMs with HuggingFace's Optimum ExecuTorch
+
+[Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) provides a streamlined way to export Hugging Face transformer models to ExecuTorch format. It offers seamless integration with the Hugging Face ecosystem, making it easy to export models directly from the Hugging Face Hub.
+
+## Overview
+
+Optimum ExecuTorch supports a much wider variety of model architectures compared to ExecuTorch's native `export_llm` API. While `export_llm` focuses on a limited set of highly optimized models (Llama, Qwen, Phi, and SmolLM) with advanced features like SpinQuant and attention sink, Optimum ExecuTorch can export diverse architectures including Gemma, Mistral, GPT-2, BERT, T5, Whisper, Voxtral, and many others.
+
+### Use Optimum ExecuTorch when:
+- You need to export models beyond the limited set supported by `export_llm`
+- Exporting directly from Hugging Face Hub model IDs, including model variants such as finetunes
+- You want a simpler interface with Hugging Face ecosystem integration
+
+### Use export_llm when:
+- Working with one of the highly optimized supported models (Llama, Qwen, Phi, SmolLM)
+- You need advanced optimizations like SpinQuant or attention sink
+- You need pt2e quantization for QNN/CoreML/Vulkan backends
+- Working with Llama models requiring custom checkpoints
+
+See [Exporting LLMs](export-llm.md) for details on using the native `export_llm` API.
+
+## Prerequisites
+
+### Installation
+
+First, clone and install Optimum ExecuTorch from source:
+
+```bash
+git clone https://github.com/huggingface/optimum-executorch.git
+cd optimum-executorch
+pip install '.[dev]'
+```
+
+For access to the latest features and optimizations, install dependencies in dev mode:
+
+```bash
+python install_dev.py
+```
+
+This installs `executorch`, `torch`, `torchao`, `transformers`, and other dependencies from nightly builds or source.
+
+## Supported Models
+
+Optimum ExecuTorch supports a wide range of model architectures including decoder-only LLMs (Llama, Qwen, Gemma, Mistral, etc.), multimodal models, vision models, audio models (Whisper), encoder models (BERT, RoBERTa), and seq2seq models (T5).
+
+For the complete list of supported models, see the [Optimum ExecuTorch documentation](https://github.com/huggingface/optimum-executorch#-supported-models).
+
+## Export Methods
+
+Optimum ExecuTorch offers two ways to export models:
+
+### Method 1: CLI Export
+
+The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format.
+
+#### Basic Export
+
+```bash
+optimum-cli export executorch \
+ --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
+ --task "text-generation" \
+ --recipe "xnnpack" \
+ --output_dir="./smollm2_exported"
+```
+
+#### With Optimizations
+
+Add custom SDPA, KV cache optimization, and quantization:
+
+```bash
+optimum-cli export executorch \
+ --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
+ --task "text-generation" \
+ --recipe "xnnpack" \
+ --use_custom_sdpa \
+ --use_custom_kv_cache \
+ --qlinear 8da4w \
+ --qembedding 8w \
+ --output_dir="./smollm2_exported"
+```
+
+#### Available CLI Arguments
+
+Key arguments for LLM export include `--model`, `--task`, `--recipe` (backend), `--use_custom_sdpa`, `--use_custom_kv_cache`, `--qlinear` (linear quantization), `--qembedding` (embedding quantization), and `--max_seq_len`.
+
+For the complete list of arguments, run:
+```bash
+optimum-cli export executorch --help
+```
+
+## Optimization Options
+
+### Custom Operators
+
+Optimum ExecuTorch includes custom SDPA (~3x speedup) and custom KV cache (~2.5x speedup) operators. Enable with `--use_custom_sdpa` and `--use_custom_kv_cache`.
+
+### Quantization
+
+Optimum ExecuTorch uses [TorchAO](https://github.com/pytorch/ao) for quantization. Common options:
+- `--qlinear 8da4w`: int8 dynamic activation + int4 weight (recommended)
+- `--qembedding 4w` or `--qembedding 8w`: int4/int8 embedding quantization
+
+Example:
+```bash
+optimum-cli export executorch \
+ --model "meta-llama/Llama-3.2-1B" \
+ --task "text-generation" \
+ --recipe "xnnpack" \
+ --use_custom_sdpa \
+ --use_custom_kv_cache \
+ --qlinear 8da4w \
+ --qembedding 4w \
+ --output_dir="./llama32_1b"
+```
+
+### Backend Support
+
+Supported backends: `xnnpack` (CPU), `coreml` (Apple GPU), `portable` (baseline), `cuda` (NVIDIA GPU). Specify with `--recipe`.
+
+## Exporting Different Model Types
+
+Optimum ExecuTorch supports various model architectures with different tasks:
+
+- **Decoder-only LLMs**: Use `--task text-generation`
+- **Multimodal LLMs**: Use `--task multimodal-text-to-text`
+- **Seq2Seq models** (T5): Use `--task text2text-generation`
+- **ASR models** (Whisper): Use `--task automatic-speech-recognition`
+
+For detailed examples of exporting each model type, see the [Optimum ExecuTorch export guide](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md).
+
+## Running Exported Models
+
+### Verifying Output with Python
+
+After exporting, you can verify the model output in Python before deploying to device using classes from `modeling.py`, such as the `ExecuTorchModelForCausalLM` class for LLMs:
+
+```python
+from optimum.executorch import ExecuTorchModelForCausalLM
+from transformers import AutoTokenizer
+
+# Load the exported model
+model = ExecuTorchModelForCausalLM.from_pretrained("./smollm2_exported")
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
+
+# Generate text
+generated_text = model.text_generation(
+ tokenizer=tokenizer,
+ prompt="Once upon a time",
+ max_seq_len=128,
+)
+print(generated_text)
+```
+
+### Running on Device
+
+After verifying your model works correctly, deploy it to device:
+
+- [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime
+- [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices
+- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - Deploy to iOS devices
+
+## Performance
+
+For performance benchmarks and on-device metrics, see the [Optimum ExecuTorch benchmarks](https://github.com/huggingface/optimum-executorch#-benchmarks-on-mobile-devices) and the [ExecuTorch Benchmark Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fexecutorch).
+
+## Additional Resources
+
+- [Optimum ExecuTorch GitHub](https://github.com/huggingface/optimum-executorch) - Full documentation and examples
+- [Supported Models](https://github.com/huggingface/optimum-executorch#-supported-models) - Complete model list
+- [Export Guide](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md) - Detailed export examples
+- [TorchAO Quantization](https://github.com/pytorch/ao) - Quantization library documentation
diff --git a/docs/source/llm/export-llm.md b/docs/source/llm/export-llm.md
index 05328afbd43..108e357a3e1 100644
--- a/docs/source/llm/export-llm.md
+++ b/docs/source/llm/export-llm.md
@@ -20,6 +20,8 @@ As of this doc, the list of supported LLMs include the following:
The up-to-date list of supported LLMs can be found in the code [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32).
+**Note:** If you need to export models that are not on this list or other model architectures (such as Gemma, Mistral, BERT, T5, Whisper, etc.), see [Exporting LLMs with Optimum](export-llm-optimum.md) which supports a much wider variety of models from Hugging Face Hub.
+
## The export_llm API
`export_llm` is ExecuTorch's high-level export API for LLMs. In this tutorial, we will focus on exporting Llama 3.2 1B using this API. `export_llm`'s arguments are specified either through CLI args or through a yaml configuration whose fields are defined in [`LlmConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py). To call `export_llm`:
diff --git a/docs/source/llm/getting-started.md b/docs/source/llm/getting-started.md
index 6b6f9d96df7..95caae6ddd9 100644
--- a/docs/source/llm/getting-started.md
+++ b/docs/source/llm/getting-started.md
@@ -18,8 +18,12 @@ To follow this guide, you'll need to install ExecuTorch. Please see [Setting Up
Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) exporting the LLM to a `.pte` file and (2) running the `.pte` file using our C++ APIs or Swift/Java bindings.
-- [Exporting LLMs](export-llm.md)
+### Exporting
+- [Exporting LLMs](export-llm.md) - Export using ExecuTorch's native `export_llm` API with advanced optimizations
+- [Exporting LLMs with Optimum](export-llm-optimum.md) - Export Hugging Face models with broader architecture support
- [Exporting custom LLMs](export-custom-llm.md)
+
+### Running
- [Running with C++](run-with-c-plus-plus.md)
- [Running on Android (XNNPack)](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android)
- [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md)
diff --git a/docs/source/llm/run-on-ios.md b/docs/source/llm/run-on-ios.md
index 88ad94c38d3..f096995fca9 100644
--- a/docs/source/llm/run-on-ios.md
+++ b/docs/source/llm/run-on-ios.md
@@ -80,17 +80,22 @@ do {
#### Generating
-Generate up to a given number of tokens from an initial prompt. The callback block is invoked once per token as it’s produced.
+Generate tokens from an initial prompt, configured with an `ExecuTorchLLMConfig` object. The callback block is invoked once per token as it’s produced.
Objective-C:
```objectivec
+ExecuTorchLLMConfig *config = [[ExecuTorchLLMConfig alloc] initWithBlock:^(ExecuTorchLLMConfig *c) {
+ c.temperature = 0.8;
+ c.sequenceLength = 2048;
+}];
+
NSError *error = nil;
-BOOL success = [runner generate:@"Once upon a time"
- sequenceLength:50
- withTokenCallback:^(NSString *token) {
- NSLog(@"Generated token: %@", token);
- }
- error:&error];
+BOOL success = [runner generateWithPrompt:@"Once upon a time"
+ config:config
+ tokenCallback:^(NSString *token) {
+ NSLog(@"Generated token: %@", token);
+ }
+ error:&error];
if (!success) {
NSLog(@"Generation failed: %@", error);
}
@@ -99,7 +104,10 @@ if (!success) {
Swift:
```swift
do {
- try runner.generate("Once upon a time", sequenceLength: 50) { token in
+ try runner.generate("Once upon a time", Config {
+ $0.temperature = 0.8
+ $0.sequenceLength = 2048
+ }) { token in
print("Generated token:", token)
}
} catch {
@@ -121,6 +129,136 @@ Swift:
runner.stop()
```
+#### Resetting
+
+To clear the prefilled tokens from the KV cache and reset generation stats, call:
+
+Objective-C:
+```objectivec
+[runner reset];
+```
+
+Swift:
+```swift
+runner.reset()
+```
+
+### MultimodalRunner
+
+The `ExecuTorchLLMMultimodalRunner` class (bridged to Swift as `MultimodalRunner`) provides an interface for loading and running multimodal models that can accept a sequence of text, image, and audio inputs.
+
+#### Multimodal Inputs
+
+Inputs are provided as an array of `ExecuTorchLLMMultimodalInput` (or `MultimodalInput` in Swift). You can create inputs from String for text, `ExecuTorchLLMImage` for images (`Image` in Swift), and `ExecuTorchLLMAudio` for audio features (`Audio`) in Swift.
+
+Objective-C:
+```objectivec
+ExecuTorchLLMMultimodalInput *textInput = [ExecuTorchLLMMultimodalInput inputWithText:@"What's in this image?"];
+
+NSData *imageData = ...; // Your raw image bytes
+ExecuTorchLLMImage *image = [[ExecuTorchLLMImage alloc] initWithData:imageData width:336 height:336 channels:3];
+ExecuTorchLLMMultimodalInput *imageInput = [ExecuTorchLLMMultimodalInput inputWithImage:image];
+```
+
+Swift:
+```swift
+let textInput = MultimodalInput("What's in this image?")
+
+let imageData: Data = ... // Your raw image bytes
+let image = Image(data: imageData, width: 336, height: 336, channels: 3)
+let imageInput = MultimodalInput(image)
+
+let audioFeatureData: Data = ... // Your raw audio feature bytes
+let audio = Audio(float: audioFeatureData, batchSize: 1, bins: 128, frames: 3000)
+let audioInput = MultimodalInput(audio)
+```
+
+#### Initialization
+
+Create a runner by specifying the paths to your multimodal model and its tokenizer.
+
+Objective-C:
+```objectivec
+NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"llava" ofType:@"pte"];
+NSString *tokenizerPath = [[NSBundle mainBundle] pathForResource:@"llava_tokenizer" ofType:@"bin"];
+
+ExecuTorchLLMMultimodalRunner *runner = [[ExecuTorchLLMMultimodalRunner alloc] initWithModelPath:modelPath
+ tokenizerPath:tokenizerPath];
+```
+
+Swift:
+```swift
+let modelPath = Bundle.main.path(forResource: "llava", ofType: "pte")!
+let tokenizerPath = Bundle.main.path(forResource: "llava_tokenizer", ofType: "bin")!
+
+let runner = MultimodalRunner(modelPath: modelPath, tokenizerPath: tokenizerPath)
+```
+
+#### Loading
+
+Explicitly load the model before generation.
+
+Objective-C:
+```objectivec
+NSError *error = nil;
+BOOL success = [runner loadWithError:&error];
+if (!success) {
+ NSLog(@"Failed to load: %@", error);
+}
+```
+
+Swift:
+```swift
+do {
+ try runner.load()
+} catch {
+ print("Failed to load: \(error)")
+}
+```
+
+#### Generating
+
+Generate tokens from an ordered array of multimodal inputs.
+
+Objective-C:
+```objectivec
+NSArray *inputs = @[textInput, imageInput];
+
+ExecuTorchLLMConfig *config = [[ExecuTorchLLMConfig alloc] initWithBlock:^(ExecuTorchLLMConfig *c) {
+ c.sequenceLength = 768;
+}];
+
+NSError *error = nil;
+BOOL success = [runner generateWithInputs:inputs
+ config:config
+ tokenCallback:^(NSString *token) {
+ NSLog(@"Generated token: %@", token);
+ }
+ error:&error];
+if (!success) {
+ NSLog(@"Generation failed: %@", error);
+}
+```
+
+Swift:
+```swift
+let inputs = [textInput, imageInput]
+
+do {
+ try runner.generate(inputs, Config {
+ $0.sequenceLength = 768
+ }) { token in
+ print("Generated token:", token)
+ }
+} catch {
+ print("Generation failed:", error)
+}
+```
+
+#### Stopping and Resetting
+
+The stop and reset methods for `MultimodalRunner` behave identically to those on `TextRunner`.
+
## Demo
Get hands-on with our [etLLM iOS Demo App](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) to see the LLM runtime APIs in action.
diff --git a/docs/source/llm/working-with-llms.md b/docs/source/llm/working-with-llms.md
index 4c238f7ae5c..e4088efd12b 100644
--- a/docs/source/llm/working-with-llms.md
+++ b/docs/source/llm/working-with-llms.md
@@ -11,6 +11,7 @@ Learn how to export LLM models and deploy them across different platforms and ru
getting-started
export-llm
+export-llm-optimum
export-custom-llm
run-with-c-plus-plus
build-run-llama3-qualcomm-ai-engine-direct-backend
diff --git a/docs/source/pico2_tutorial.md b/docs/source/pico2_tutorial.md
new file mode 100644
index 00000000000..7098df11b05
--- /dev/null
+++ b/docs/source/pico2_tutorial.md
@@ -0,0 +1,198 @@
+# Pico2: A simple MNIST Tutorial
+
+Deploy your PyTorch models directly to Raspberry Pi Pico2 microcontroller with ExecuTorch.
+
+## What You'll Build
+
+A 28×28 MNIST digit classifier running on memory constrained, low power microcontrollers:
+
+- Input: ASCII art digits (0, 1, 4, 7)
+- Output: Real-time predictions via USB serial
+- Memory: <400KB total footprint
+
+## Prerequisites
+
+- [Environment Setup section](https://docs.pytorch.org/executorch/1.0/using-executorch-building-from-source.html)
+
+- Refer to this link on how to accept 'EULA' agreement and setup toolchain [link](https://docs.pytorch.org/executorch/1.0/backends-arm-ethos-u.html#development-requirements)
+
+- Verify ARM toolchain
+
+```bash
+which arm-none-eabi-gcc # --> arm/ethos-u-scratch/arm-gnu-toolchain-13.3.rel1-x86_64-arm-none-eabi/bin/
+```
+
+## Step 1: Generate pte from given example Model
+
+- Use the [provided example model](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist.py)
+
+```bash
+python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte
+```
+
+- **Note:** This is hand-crafted MNIST Classifier (proof-of-concept), and not production trained. This tiny MLP recognizes digits 0, 1, 4, and 7 using manually designed feature detectors.
+
+## Step 2: Build Firmware for Pico2
+
+```bash
+# Generate model
+
+python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte
+
+# Build Pico2 firmware (one command!)
+
+./executorch/examples/rpi/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2
+```
+
+Output: **executorch_pico.uf2** firmware file (examples/raspberry_pi/pico2/build/)
+
+**Note:** 'build_firmware_pico.sh' script converts given model pte to hex array and generates C code for the same via this helper [script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/pte_to_array.py). This C code is then compiled to generate final .uf2 binary which is then flashed to Pico2.
+
+## Step 3: Flash to Pico2
+
+Hold BOOTSEL button on Pico2
+Connect USB → Mounts as ^RPI-RP2^ drive
+Drag & drop ^executorch_pico.uf2^ file
+Release BOOTSEL → Pico2 reboots with your model
+
+## Step 4: Verify Deployment
+
+**Success indicators:**
+
+- LED blinks 10× at 500ms → Model running ✅
+- LED blinks 10× at 100ms → Error, check serial ❌
+
+**View predictions:**
+
+```bash
+# Connect serial terminal
+screen /dev/tty.usbmodem1101 115200
+# Expected output:
+
+Something like:
+
+=== Digit 7 ===
+############################
+############################
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+ ####
+####
+###
+
+Input stats: 159 white pixels out of 784 total
+Running neural network inference...
+✅ Neural network results:
+ Digit 0: 370.000
+ Digit 1: 0.000
+ Digit 2: -3.000
+ Digit 3: -3.000
+ Digit 4: 860.000
+ Digit 5: -3.000
+ Digit 6: -3.000
+ Digit 7: 1640.000 ← PREDICTED
+ Digit 8: -3.000
+ Digit 9: -3.000
+
+� PREDICTED: 7 (Expected: 7) ✅ CORRECT!
+```
+
+## Memory Optimization Tips
+
+### Pico2 Constraints
+
+- 520KB SRAM (runtime memory)
+- 4MB Flash (model storage)
+- Keep models small:
+
+### Common Issues
+
+- "Memory allocation failed" → Reduce model size and use quantization
+- "Operator missing" → Use selective build: ^--operators=add,mul,relu^
+- "Import error" → Check ^arm-none-eabi-gcc^ toolchain setup.
+
+In order to resolve some of the issues above, refer to the following guides:
+
+- [ExecuTorch Quantization Optimization Guide](https://docs.pytorch.org/executorch/1.0/quantization-optimization.html)
+- [Model Export & Lowering](https://docs.pytorch.org/executorch/1.0/using-executorch-export.html) and
+- [Selective Build support](https://docs.pytorch.org/executorch/1.0/kernel-library-selective-build.html)
+
+### Firmware Size Analysis
+
+```bash
+cd
+ls -al examples/raspberry_pi/pico2/build/executorch_pico.elf
+```
+
+- **Overall section sizes**
+
+```bash
+arm-none-eabi-size -A examples/raspberry_pi/pico2/build/executorch_pico.elf
+```
+
+- **Detailed section breakdown**
+
+```bash
+arm-none-eabi-objdump -h examples/raspberry_pi/pico2/build/executorch_pico.elf
+```
+
+- **Symbol sizes (largest consumers)**
+
+```bash
+arm-none-eabi-nm --print-size --size-sort --radix=d examples/raspberry_pi/pico2/build/executorch_pico.elf | tail -20
+```
+
+### Model Memory Footprint
+
+- **Model data specifically**
+
+```bash
+arm-none-eabi-nm --print-size --size-sort --radix=d examples/raspberry_pi/pico2/build/executorch_pico.elf | grep -i model
+```
+
+- **Check what's in .bss (uninitialized data)**
+
+```bash
+arm-none-eabi-objdump -t examples/raspberry_pi/pico2/build/executorch_pico.elf | grep ".bss" | head -10
+```
+
+- **Memory map overview**
+
+```bash
+arm-none-eabi-readelf -l examples/raspberry_pi/pico2/build/executorch_pico.elf
+```
+
+## Next Steps
+
+### Scale up your deployment
+
+- Use real production trained model
+- Optimize further → INT8 quantization, pruning
+
+### Happy Inference!
+
+**Result:** PyTorch model → Pico2 deployment in 4 simple steps 🚀
+Total tutorial time: ~15 minutes
+
+**Conclusion:** Real-time inference on memory constrained, low power microcontrollers, a complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment
diff --git a/docs/source/platforms-desktop.md b/docs/source/platforms-desktop.md
index acbdb06a6b6..ba22786576f 100644
--- a/docs/source/platforms-desktop.md
+++ b/docs/source/platforms-desktop.md
@@ -9,15 +9,15 @@ ExecuTorch supports desktop and laptop deployment across Linux, macOS, and Windo
## Available Backends by Platform
### Linux
-- [XNNPACK (CPU)](backends-xnnpack)
+- [XNNPACK (CPU)](backends/xnnpack/xnnpack-overview.md)
- [OpenVINO (Intel)](build-run-openvino)
- [ARM Ethos-U (ARM64)](backends-arm-ethos-u)
### macOS
- [CoreML (recommended)](backends-coreml)
- [MPS (Apple Silicon)](backends-mps)
-- [XNNPACK (CPU)](backends-xnnpack)
+- [XNNPACK (CPU)](backends/xnnpack/xnnpack-overview.md)
### Windows
-- [XNNPACK (CPU)](backends-xnnpack)
+- [XNNPACK (CPU)](backends/xnnpack/xnnpack-overview.md)
- [OpenVINO (Intel)](build-run-openvino)
diff --git a/docs/source/quantization-overview.md b/docs/source/quantization-overview.md
index 4ff8d34a4a8..81b15f6c8bb 100644
--- a/docs/source/quantization-overview.md
+++ b/docs/source/quantization-overview.md
@@ -28,8 +28,8 @@ These quantizers usually support configs that allow users to specify quantizatio
Not all quantization options are supported by all backends. Consult backend-specific guides for supported quantization modes and configuration, and how to initialize the backend-specific PT2E quantizer:
-* [XNNPACK quantization](backends-xnnpack.md#quantization)
-* [CoreML quantization](backends-coreml.md#quantization)
+* [XNNPACK quantization](backends/xnnpack/xnnpack-quantization.md)
+* [CoreML quantization](backends/coreml/coreml-quantization.md)
* [QNN quantization](backends-qualcomm.md#step-2-optional-quantize-your-model)
diff --git a/docs/source/success-stories.md b/docs/source/success-stories.md
index cba874132c6..cddfaa6c5a6 100644
--- a/docs/source/success-stories.md
+++ b/docs/source/success-stories.md
@@ -6,51 +6,121 @@ Discover how organizations are leveraging ExecuTorch to deploy AI models at scal
---
-## 🎯 Featured Success Stories
+## Featured Success Stories
::::{grid} 1
:gutter: 3
-:::{grid-item-card} **🚀 Story 1: [Title Placeholder]**
+:::{grid-item-card} **Meta's Family of Apps**
:class-header: bg-primary text-white
-**Industry:** [Industry]
-**Hardware:** [Hardware Platform]
-**Impact:** [Key Metrics]
+**Industry:** Social Media & Messaging
+**Hardware:** Android & iOS Devices
+**Impact:** Billions of users, latency reduction
-[Placeholder Description] - Brief overview of the challenge, solution, and results achieved.
+Powers Instagram, WhatsApp, Facebook, and Messenger with real-time on-device AI for content ranking, recommendations, and privacy-preserving features at scale.
-
-[Read Full Story →](#story-1-details)
+[Read Blog →](https://engineering.fb.com/2025/07/28/android/executorch-on-device-ml-meta-family-of-apps/)
:::
-:::{grid-item-card} **⚡ Story 2: [Title Placeholder]**
+:::{grid-item-card} **Meta Quest & Ray-Ban Smart Glasses**
:class-header: bg-success text-white
-**Industry:** [Industry]
-**Hardware:** [Hardware Platform]
-**Impact:** [Key Metrics]
+**Industry:** AR/VR & Wearables
+**Hardware:** Quest 3, Ray-Ban Meta Smart Glasses, Meta Ray-Ban Display
-[Placeholder Description] - Brief overview of the challenge, solution, and results achieved.
+Enables immersive mixed reality with real-time computer vision, hand tracking, voice commands, and translation on power-constrained wearable devices.
+:::
+:::{grid-item-card} **Liquid AI: Efficient, Flexible On-Device Intelligence**
+:class-header: bg-info text-white
+**Industry:** Artificial Intelligence / Edge Computing
+**Hardware:** CPU via PyTorch ExecuTorch
+**Impact:** 2× faster inference, lower latency, seamless multimodal deployment
-[Read Full Story →](#story-2-details)
+Liquid AI builds foundation models that make AI work where the cloud can't. In its LFM2 series, the team uses PyTorch ExecuTorch within the LEAP Edge SDK to deploy high-performance multimodal models efficiently across devices. ExecuTorch provides the flexibility to support custom architectures and processing pipelines while reducing inference latency through graph optimization and caching. Together, they enable faster, more efficient, privacy-preserving AI that runs entirely on the edge.
+
+[Read Blog →](https://www.liquid.ai/blog/how-liquid-ai-uses-executorch-to-power-efficient-flexible-on-device-intelligence)
:::
-:::{grid-item-card} **🧠 Story 3: [Title Placeholder]**
-:class-header: bg-info text-white
+:::{grid-item-card} **PrivateMind: Complete Privacy with On-Device AI**
+:class-header: bg-warning text-white
-**Industry:** [Industry]
-**Hardware:** [Hardware Platform]
-**Impact:** [Key Metrics]
+**Industry:** Privacy & Personal Computing
+**Hardware:** iOS & Android Devices
+**Impact:** 100% on-device processing
-[Placeholder Description] - Brief overview of the challenge, solution, and results achieved.
+PrivateMind delivers a fully private AI assistant using ExecuTorch's .pte format. Built with React Native ExecuTorch, it supports LLaMA, Qwen, Phi-4, and custom models with offline speech-to-text and PDF chat capabilities.
+[Visit →](https://privatemind.swmansion.com)
+:::
+
+:::{grid-item-card} **NimbleEdge: On-Device Agentic AI Platform**
+:class-header: bg-danger text-white
+
+**Industry:** AI Infrastructure
+**Hardware:** iOS & Android Devices
+**Impact:** 30% higher TPS on iOS, faster time-to-market with Qwen/Gemma models
-[Read Full Story →](#story-3-details)
+NimbleEdge successfully integrated ExecuTorch with its open-source DeliteAI platform to enable agentic workflows orchestrated in Python on mobile devices. The extensible ExecuTorch ecosystem allowed implementation of on-device optimization techniques leveraging contextual sparsity. ExecuTorch significantly accelerated the release of "NimbleEdge AI" for iOS, enabling models like Qwen 2.5 with tool calling support and achieving up to 30% higher transactions per second.
+
+[Visit →](https://nimbleedge.com) • [Blog →](https://www.nimbleedge.com/blog/meet-nimbleedge-ai-the-first-truly-private-on-device-assistant) • [iOS App →](https://apps.apple.com/in/app/nimbleedge-ai/id6746237456)
:::
::::
---
+
+## Featured Ecosystem Integrations and Interoperability
+
+::::{grid} 2 2 3 3
+:gutter: 2
+
+:::{grid-item-card} **Hugging Face Transformers**
+:class-header: bg-secondary text-white
+
+Popular models from Hugging Face easily export to ExecuTorch format for on-device deployment.
+
+[Learn More →](https://github.com/huggingface/optimum-executorch/)
+:::
+
+:::{grid-item-card} **React Native ExecuTorch**
+:class-header: bg-secondary text-white
+
+Declarative toolkit for running AI models and LLMs in React Native apps with privacy-first, on-device execution.
+
+[Explore →](https://docs.swmansion.com/react-native-executorch/) • [Blog →](https://expo.dev/blog/how-to-run-ai-models-with-react-native-executorch)
+:::
+
+:::{grid-item-card} **torchao**
+:class-header: bg-secondary text-white
+
+PyTorch-native quantization and optimization library for preparing efficient models for ExecuTorch deployment.
+
+[Blog →](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/) • [Qwen Example →](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4) • [Phi Example →](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4)
+:::
+
+:::{grid-item-card} **Unsloth**
+:class-header: bg-secondary text-white
+
+Optimize LLM fine-tuning with faster training and reduced VRAM usage, then deploy efficiently with ExecuTorch.
+
+[Example Model →](https://huggingface.co/metascroy/Qwen3-4B-int8-int4-unsloth)
+:::
+
+::::
+
+---
+
+## Featured Demos
+
+- **Text and Multimodal LLM demo mobile apps** - Text (Llama, Qwen3, Phi-4) and multimodal (Gemma3, Voxtral) mobile demo apps. [Try →](https://github.com/meta-pytorch/executorch-examples/tree/main/llm)
+
+- **Voxtral** - Deploy audio-text-input LLM on CPU (via XNNPACK) and on CUDA. [Try →](https://github.com/pytorch/executorch/blob/main/examples/models/voxtral/README.md)
+
+- **LoRA adapter** - Export two LoRA adapters that share a single foundation weight file, saving memory and disk space. [Try →](https://github.com/meta-pytorch/executorch-examples/tree/main/program-data-separation/cpp/lora_example)
+
+- **OpenVINO from Intel** - Deploy [Yolo12](https://github.com/pytorch/executorch/tree/main/examples/models/yolo12), [Llama](https://github.com/pytorch/executorch/tree/main/examples/openvino/llama), and [Stable Diffusion](https://github.com/pytorch/executorch/tree/main/examples/openvino/stable_diffusion) on [OpenVINO from Intel](https://www.intel.com/content/www/us/en/developer/articles/community/optimizing-executorch-on-ai-pcs.html).
+
+*Want to showcase your demo? [Submit here →](https://github.com/pytorch/executorch/issues)*
diff --git a/docs/source/tutorial-xnnpack-delegate-lowering.md b/docs/source/tutorial-xnnpack-delegate-lowering.md
index 3fb079f24d6..5c88246b0ba 100644
--- a/docs/source/tutorial-xnnpack-delegate-lowering.md
+++ b/docs/source/tutorial-xnnpack-delegate-lowering.md
@@ -12,7 +12,7 @@ In this tutorial, you will learn how to export an XNNPACK lowered Model and run
:class-card: card-prerequisites
* [Setting up ExecuTorch](getting-started-setup.rst)
* [Model Lowering Tutorial](tutorials/export-to-executorch-tutorial)
-* [ExecuTorch XNNPACK Delegate](backends-xnnpack.md)
+* [ExecuTorch XNNPACK Delegate](backends/xnnpack/xnnpack-overview.md)
:::
::::
diff --git a/docs/source/tutorials_source/bundled_program.bp b/docs/source/tutorials_source/bundled_program.bp
deleted file mode 100644
index 8afe3cfee26..00000000000
Binary files a/docs/source/tutorials_source/bundled_program.bp and /dev/null differ
diff --git a/docs/source/using-executorch-android.md b/docs/source/using-executorch-android.md
index cdeb2417a5f..e097722b8e6 100644
--- a/docs/source/using-executorch-android.md
+++ b/docs/source/using-executorch-android.md
@@ -1,12 +1,20 @@
+
# Using ExecuTorch on Android
-To use from Android, ExecuTorch provides Java/Kotlin API bindings and Android platform integration, available as an AAR file.
+🚀 Quick Start: __New to ExecuTorch__ ? Jump to [Using AAR from Maven Central](#using-aar-from-maven-central) for the fastest setup, then see the [Runtime Integration](#runtime-integration) example.
-Note: This page covers Android app integration through the AAR library. The ExecuTorch C++ APIs can also be used from Android native, and the documentation can be found on [this page about cross compilation](using-executorch-building-from-source.md#cross-compilation).
+To use from Android, ExecuTorch provides Java/Kotlin API bindings and Android platform integration, available as an AAR file.
+Note: This page covers Android app integration through the AAR library. The ExecuTorch C++ APIs can also be used from Android native, and the documentation can be found on this page about cross compilation.
## Installation
-All ExecuTorch Android libraries are packaged into an [Android library (AAR)](https://developer.android.com/studio/projects/android-library), `executorch.aar` for both generic (image/audio processing) and LLM (LLaMA) use case. In each release, prebuilt AAR artifacts are uploaded to [Maven](https://repo.maven.apache.org/maven2/org/pytorch/executorch-android/) and S3. Users can also build the AAR from source.
+__Choose your installation method:__
+
+- __[Maven Central](#using-aar-from-maven-central)__ (recommended): Easiest for most developers
+- __[Direct AAR file](#using-aar-file-directly)__: For specific versions or offline development
+- __[Build from source](#building-from-source)__: For custom backends or contributions
+
+All ExecuTorch Android libraries are packaged into an Android library (AAR), executorch.aar for both generic (image/audio processing) and LLM (LLaMA) use case. In each release, prebuilt AAR artifacts are uploaded to Maven and S3. Users can also build the AAR from source.
### Contents of library
@@ -14,52 +22,63 @@ The AAR artifact contains the Java library for users to integrate with their Jav
- [Java library](https://github.com/pytorch/executorch/tree/main/extension/android/executorch_android/src/main/java/org/pytorch/executorch)
- JNI contains the JNI binding for the corresponding Java code, and ExecuTorch native library, including
- - core ExecuTorch runtime libraries
+ - Core ExecuTorch runtime libraries
- XNNPACK backend
- Portable kernels
- Optimized kernels
- Quantized kernels
- LLaMa-specific Custom ops library.
-- Comes with two ABI variants, arm64-v8a and x86\_64.
+- Comes with two ABI variants, arm64-v8a and x86_64.
The AAR library can be used for generic Android device with arm64-v8a or x86_64 architecture. It can be used across form factors, including phones, tablets, tv boxes, etc, as it does not contain any UI components.
## Using AAR from Maven Central
-ExecuTorch is available on [Maven Central](https://mvnrepository.com/artifact/org.pytorch/executorch-android).
-
-Simply add the target [`org.pytorch:executorch-android:${executorch_version}`](https://repo.maven.apache.org/maven2/org/pytorch/executorch-android/${executorch_version}/) to your Android app dependency (build.gradle), and build your app.
+✅ Recommended for most developers
+ExecuTorch is available on Maven Central.
+Simply add the target org.pytorch:executorch-android:${executorch_version} to your Android app dependency (build.gradle), and build your app. For example:
-For example:
-```
-# app/build.gradle.kts
+```kotlin
+app/build.gradle.kts
dependencies {
- implementation("org.pytorch:executorch-android:${executorch_version}")
+implementation("org.pytorch:executorch-android:${executorch_version}")
}
```
-Note: If you want to use release v0.5.0, please use dependency `org.pytorch:executorch-android:0.5.1`.
-
-Click the screenshot below to watch the *demo video* on how to add the package and run a simple ExecuTorch model with Android Studio.
+Note: If you want to use release v1.0.0, please use dependency org.pytorch:executorch-android:1.0.0.
+Click the screenshot below to watch the demo video on how to add the package and run a simple ExecuTorch model with Android Studio.
-
+
## Using AAR file directly
You can also directly specify an AAR file in the app. We upload pre-built AAR to S3 during each release, or as a snapshot.
-### Released versions (recommended)
+### Latest Released versions (Recommended)
+
+Starting from [v1.0.0](https://github.com/pytorch/executorch/releases/tag/v1.0.0), there are respective executorch.aar library available by backends
+
+| AAR | SHASUMS | Backend |
+| ------- | --- | ------- |
+| [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-xnnpack/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-xnnpack/executorch.aar.sha256sums) | [XNNPACK](backends/xnnpack/xnnpack-overview.md) |
+| [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-qnn/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-qnn/executorch.aar.sha256sums) | [Qualcomm AI Engine](backends-qualcomm.md) |
+| [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-vulkan/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/1.0.0-vulkan/executorch.aar.sha256sums) | [Vulkan](backends/vulkan/vulkan-overview.md) |
+
+### Older Released versions
+
+Download the older released version
| Version | AAR | SHASUMS |
| ------- | --- | ------- |
-| [${executorch_version}](https://github.com/pytorch/executorch/releases/tag/${executorch_version}) | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/${executorch_version}/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/${executorch_version}/executorch.aar.sha256sums) |
+| [v0.7.0](https://github.com/pytorch/executorch/releases/tag/v0.7.0) | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/v0.7.0/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/v0.7.0/executorch.aar.sha256sums) |
| [v0.6.0](https://github.com/pytorch/executorch/releases/tag/v0.6.0) | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/v0.6.0/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/v0.6.0/executorch.aar.sha256sums) |
| [v0.5.0](https://github.com/pytorch/executorch/releases/tag/v0.5.0) | [executorch.aar](https://ossci-android.s3.amazonaws.com/executorch/release/v0.5.0-rc3/executorch.aar) | [executorch.aar.sha256sums](https://ossci-android.s3.amazonaws.com/executorch/release/v0.5.0-rc3/executorch.aar.sha256sums) |
### Snapshots from main branch
Starting from 2025-04-12, you can download nightly `main` branch snapshots:
+
* `executorch.aar`: `https://ossci-android.s3.amazonaws.com/executorch/release/snapshot-{YYYYMMDD}/executorch.aar`
* `executorch.aar.sha256sums`: `https://ossci-android.s3.amazonaws.com/executorch/release/snapshot-{YYYYMMDD}/executorch.aar.sha256sums`
* Replace `YYYYMMDD` with the actual date you want to use.
@@ -77,11 +96,11 @@ We aim to make every daily snapshot available and usable. However, for best stab
## Using AAR file
To add the AAR file to your app:
-1. Download the AAR.
-2. Add it to your gradle build rule as a file path.
+Download the AAR.
+Add it to your gradle build rule as a file path.
+An AAR file itself does not contain dependency info, unlike the Maven one which bundled with pom.xml. The Java package requires fbjni and soloader, and currently requires users to explicitly declare the dependency. Therefore, two more dependencies in gradle rule is required:
-An AAR file itself does not contain dependency info, unlike the Maven one which bundled with pom.xml. The Java package requires `fbjni` and `soloader`, and currently requires users to explicitly declare the dependency. Therefore, two more `dependencies` in gradle rule is required:
-```
+```kotlin
implementation("com.facebook.soloader:soloader:0.10.5")
implementation("com.facebook.fbjni:fbjni:0.7.0")
```
@@ -89,18 +108,20 @@ implementation("com.facebook.fbjni:fbjni:0.7.0")
### Example usage
In your app working directory, such as executorch-examples/llm/android/LlamaDemo,
-```
+
+```sh
mkdir -p app/libs
curl https://ossci-android.s3.amazonaws.com/executorch/release/${executorch_version}/executorch.aar -o app/libs/executorch.aar
```
And include it in gradle:
-```
-# app/build.gradle.kts
+
+```kotlin
+app/build.gradle.kts
dependencies {
- implementation(files("libs/executorch.aar"))
- implementation("com.facebook.soloader:soloader:0.10.5")
- implementation("com.facebook.fbjni:fbjni:0.7.0")
+implementation(files("libs/executorch.aar"))
+implementation("com.facebook.soloader:soloader:0.10.5")
+implementation("com.facebook.fbjni:fbjni:0.7.0")
}
```
@@ -108,52 +129,62 @@ Now you can compile your app with the ExecuTorch Android library.
## Building from Source
-`scripts/build_android_library.sh` is a helper script to build the Java library (into .jar), native library (into .so), and the packaged AAR file.
-
-You need Android [SDK](https://developer.android.com/studio) and [NDK](https://developer.android.com/ndk/downloads) to use it.
-
-Current NDK version used in ExecuTorch CI: r27b.
+```text
+scripts/build_android_library.sh
+```
-You need to set `ANDROID_HOME` to Android SDK home and `ANDROID_NDK` to the correct NDK root (containing NOTICE file).
+is a helper script to build the Java library (into .jar), native library (into .so), and the packaged AAR file.
+You need Android SDK and NDK to use it.
+Current NDK version used in ExecuTorch CI: r28c.
+You need to set ANDROID_HOME to Android SDK home and ANDROID_NDK to the correct NDK root (containing NOTICE file).
-```
+```sh
export ANDROID_HOME=/path/to/sdk
export ANDROID_NDK=/path/to/ndk
sh scripts/build_android_library.sh
```
-Currently, XNNPACK backend is always built with the script.
+NOTE: Currently, XNNPACK backend is always built with the script.
### Optional environment variables
-Optionally, set these environment variables before running `build_android_library.sh`.
+Optionally, set these environment variables before running build_android_library.sh.
-#### ANDROID_ABIS
-Set environment variable `ANDROID_ABIS` to either `arm64-v8a` or `x86_64` if you only need to build the native library for one ABI only.
-```
+- __ANDROID_ABIS__
+
+Set environment variable ANDROID_ABIS to either arm64-v8a or x86_64 if you only need to build the native library for one ABI only.
+
+```sh
export ANDROID_ABIS=arm64-v8a
-# or
-# export ANDROID_ABIS=x86_64
+```
+
+ (Or)
+
+```sh
+export ANDROID_ABIS=x86_64
+```
+
+And then run the script.
+
+```sh
sh scripts/build_android_library.sh
```
-#### EXECUTORCH_CMAKE_BUILD_TYPE
-Set environment variable `EXECUTORCH_CMAKE_BUILD_TYPE` to `Release` or `Debug` based on your needs.
+- __EXECUTORCH_CMAKE_BUILD_TYPE__
+
+Set environment variable EXECUTORCH_CMAKE_BUILD_TYPE to Release or Debug based on your needs.
-#### Using MediaTek backend
+- __Using MediaTek backend__
-To use [MediaTek backend](backends-mediatek.md),
-after installing and setting up the SDK, set `NEURON_BUFFER_ALLOCATOR_LIB` and `NEURON_USDK_ADAPTER_LIB` to the corresponding path.
+To use MediaTek backend, after installing and setting up the SDK, set NEURON_BUFFER_ALLOCATOR_LIB and NEURON_USDK_ADAPTER_LIB to the corresponding path.
-#### Using Qualcomm AI Engine Backend
+- __Using Qualcomm AI Engine Backend__
-To use [Qualcomm AI Engine Backend](backends-qualcomm.md#qualcomm-ai-engine-backend),
-after installing and setting up the SDK, set `QNN_SDK_ROOT` to the corresponding path.
+To use Qualcomm AI Engine Backend, after installing and setting up the SDK, set QNN_SDK_ROOT to the corresponding path.
-#### Using Vulkan Backend
+- __Using Vulkan Backend__
-To use [Vulkan Backend](backends-vulkan.md#vulkan-backend),
-set `EXECUTORCH_BUILD_VULKAN` to `ON`.
+To use Vulkan Backend, set EXECUTORCH_BUILD_VULKAN to ON.
## Android Backends
@@ -161,11 +192,12 @@ The following backends are available for Android:
| Backend | Type | Doc |
| ------- | -------- | --- |
-| [XNNPACK](https://github.com/google/XNNPACK) | CPU | [Doc](backends-xnnpack.md) |
+| [XNNPACK](https://github.com/google/XNNPACK) | CPU | [Doc](backends/xnnpack/xnnpack-overview.md) |
| [MediaTek NeuroPilot](https://neuropilot.mediatek.com/) | NPU | [Doc](backends-mediatek.md) |
| [Qualcomm AI Engine](https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) | NPU | [Doc](backends-qualcomm.md) |
-| [Vulkan](https://www.vulkan.org/) | GPU | [Doc](backends-vulkan.md) |
+| [Vulkan](https://www.vulkan.org/) | GPU | [Doc](backends/vulkan/vulkan-overview.md) |
+Start with XNNPACK (CPU backend) for maximum compatibility, then add hardware-specific backends for optimization.
## Runtime Integration
@@ -175,26 +207,27 @@ Here is an example code sample in Java that demonstrates how to integrate ExecuT
import org.pytorch.executorch.EValue;
import org.pytorch.executorch.Module;
import org.pytorch.executorch.Tensor;
-
public class MainActivity extends Activity {
- private Module module;
-
- @Override
- protected void onCreate(Bundle savedInstanceState) {
- super.onCreate(savedInstanceState);
- // Load the ExecuTorch module
- Module module = Module.load("/data/local/tmp/add.pte");
- Tensor tensor1 = Tensor.fromBlob(new float[] {1.0f}, new long[] {1});
- Tensor tensor2 = Tensor.fromBlob(new float[] {20.0f}, new long[] {1});
-
- EValue eValue1 = EValue.from(tensor1);
- EValue eValue2 = EValue.from(tensor2);
- float result = module.forward(eValue1, eValue2)[0].toTensor().getDataAsFloatArray()[0];
- }
+private Module module;
+
+@Override
+protected void onCreate(Bundle savedInstanceState) {
+ super.onCreate(savedInstanceState);
+ // Load the ExecuTorch module
+ Module module = Module.load("/data/local/tmp/add.pte");
+
+ Tensor tensor1 = Tensor.fromBlob(new float[] {1.0f}, new long[] {1});
+ Tensor tensor2 = Tensor.fromBlob(new float[] {20.0f}, new long[] {1});
+
+ EValue eValue1 = EValue.from(tensor1);
+ EValue eValue2 = EValue.from(tensor2);
+
+ float result = module.forward(eValue1, eValue2)[0].toTensor().getDataAsFloatArray()[0];
}
```
-Push the corresponding pte file to the phone:
+Push the corresponding pte file to your Android device:
+
```sh
adb push extension/module/test/resources/add.pte /data/local/tmp/
```
diff --git a/docs/source/using-executorch-building-from-source.md b/docs/source/using-executorch-building-from-source.md
index 48901f62a76..aa71d8248c5 100644
--- a/docs/source/using-executorch-building-from-source.md
+++ b/docs/source/using-executorch-building-from-source.md
@@ -5,6 +5,7 @@ Even if you don't use CMake directly, CMake can emit scripts for other format
like Make, Ninja or Xcode. For information, see [cmake-generators(7)](https://cmake.org/cmake/help/latest/manual/cmake-generators.7.html).
## System Requirements
+
### Operating System
ExecuTorch is tested on the following systems, although it should also work in similar environments.
@@ -16,10 +17,11 @@ ExecuTorch is tested on the following systems, although it should also work in s
* macOS (x86_64/ARM64)
* Big Sur (11.0)+
* Windows (x86_64)
+ * Windows 10+ with Visual Studio 2022+ and [Clang-CL](https://learn.microsoft.com/en-us/cpp/build/clang-support-msbuild?view=msvc-170)
* Windows Subsystem for Linux (WSL) with any of the Linux options
- * Windows 10+ with Visual Studio 2022+ (experimental)
### Software Requirements
+
* `conda` or another virtual environment manager
- `conda` is recommended as it provides cross-language
support and integrates smoothly with `pip` (Python's built-in package manager)
@@ -27,16 +29,19 @@ ExecuTorch is tested on the following systems, although it should also work in s
* `g++` version 7 or higher, `clang++` version 5 or higher, or another
C++17-compatible toolchain.
* `python` version 3.10-3.12
-* `Xcode Command Line Tools` (macOS only)
* `ccache` (optional) - A compiler cache that speeds up recompilation
+* **macOS**
+ - `Xcode Command Line Tools`
+* **Windows**
+ - `Visual Studio Clang Tools` - See [Clang/LLVM support in Visual Studio](https://learn.microsoft.com/en-us/cpp/build/clang-support-msbuild?view=msvc-170).
-Additional dependencies will be installed automatically when running the [Python installation](#building-the-python-package).
+Additional dependencies will be automatically installed when running the [Python installation](#building-the-python-package).
Note that the cross-compilable core runtime code supports a wider range of
-toolchains, down to C++17. See the [Runtime Overview](runtime-overview.md) for
+toolchains, down to C++17. See [Runtime Overview](runtime-overview.md) for
portability details.
## Environment Setup
- Clone the ExecuTorch repository from GitHub and create a conda environment as follows. Venv can be used in place on conda.
+ Clone the ExecuTorch repository from GitHub and create a conda environment. Venv can be used in place on conda.
```bash
git clone -b release/1.0 https://github.com/pytorch/executorch.git
cd executorch
@@ -44,6 +49,13 @@ portability details.
conda activate executorch
```
+> **_NOTE:_** Addition Windows Setup
+>
+> ExecuTorch requires symlinks to be enabled to build the Python components. To enable symlinks, run the following command before cloning the repository. Missing symlinks will manifest as an error related to `version.py` when running `pip install .`. See [src/README.md](https://github.com/pytorch/executorch/blob/main/src/README.md) for more information.
+> ```bash
+> git config --system core.symlinks true
+> ```
+
## Building the Python package
@@ -60,7 +72,7 @@ portability details.
* `--clean`: Removes build artifacts.
* `--editable`: Install the ExecuTorch python package in editable mode (see [Editable Install](#editable-install)).
* `--minimal`: Install only the minimal set of dependencies required to run ExecuTorch. Do not install dependencies for examples.
- * `--use-pt-pinned-commit`: Install the pinned PyTorch commit. When not specified, the latest PyTorch nightly build is installed.
+ * `--use-pt-pinned-commit`: Install the pinned PyTorch commit or release version. When not specified, the latest PyTorch nightly build is installed.
For Intel-based macOS systems, use `--use-pt-pinned-commit --minimal`. As PyTorch does not provide pre-built binaries for Intel Mac, installation requires building PyTorch from source. Instructions can be found in [PyTorch Installation](https://github.com/pytorch/pytorch#installation).
@@ -71,6 +83,13 @@ portability details.
CMAKE_ARGS="-DEXECUTORCH_BUILD_MPS=ON" ./install_executorch.sh
```
+ ### Verify the Build
+
+To verify that the Python components are installed correctly, run the following command. This will create a file named mv2_xnnpack_fp32.pte in the current directory for the MobileNet V2 model with the XNNPACK backend. If it completes without error, the ExecuTorch Python components are installed successfully.
+```bash
+python -m executorch.examples.xnnpack.aot_compiler --model_name="mv2" --delegate
+```
+
### Editable Install
For development, include the `--editable` flag, which allows for local changes to ExecuTorch Python code to be reflected without a re-install. Note that when C++ files are modified, you will need to re-run the full installation to reflect the changes.
```bash
@@ -112,47 +131,39 @@ portability details.
## Building the C++ Runtime
-The ExecuTorch C++ runtime is built using CMake. It can be compiled standalone to run examples, added as a CMake dependency, or cross-compiled for Android, iOS, or embedded platforms.
+The ExecuTorch runtime uses CMake as the build system. When using ExecuTorch from C++ user code with CMake, adding ExecuTorch as a submodule and referencing via CMake `add_subdirectory` will build the runtime as part of the user build.
-### Configuring
+When user code is not using CMake, the runtime can be built standalone and linked. The CMake options described below apply in both cases. Scripts are also provided for [Android AAR](#cross-compiling-for-android) and [iOS framework](#cross-compiling-for-ios) builds.
-Configuration should be done after cloning, pulling the upstream repo, or changing build options. Once this is done, you won't need to do it again until you pull from the upstream repo or modify any CMake-related files.
+| Use Case | How to Build |
+| :------------------------- | :--------------------------------------------------------------------------------- |
+| C++ with user CMake | Use CMake `add_subdirectory`. |
+| C++ without user CMake | Bulild ExecuTorch standalone with CMake. Link libraries with user build. |
+| Android with Java/Kotlin | Use [scripts/build_android_libraries.sh](#cross-compiling-for-android). |
+| Android with C++ | Follow C++ build steps, [cross-compile for Android](#cross-compiling-for-android). |
+| iOS | Use [scripts/build_ios_frameworks.sh](#cross-compiling-for-ios). |
-```bash
-# cd to the root of the executorch repo
-cd executorch
-
-# Clean and configure the CMake build system. It's good practice to do this
-# whenever cloning or pulling the upstream repo.
-./install_executorch.sh --clean
-(mkdir cmake-out && cd cmake-out && cmake ..)
-```
+### Configuring
-### Building
+Configuration should be done after cloning, pulling the upstream repo, or changing build options. Once this is done, you won't need to do it again until you pull from the upstream repo or modify any CMake-related files.
-Build all targets with `cmake --build`.
+When building as a submodule as part of a user CMake build, ExecuTorch CMake options can be specified either as part of the user CMake configuration or in user CMake code.
+CMake configuration for standalone runtime build:
```bash
-# cd to the root of the executorch repo
-cd executorch
-
-# Build using the configuration that you previously generated under the
-# `cmake-out` directory.
-#
-# NOTE: The `-j` argument specifies how many jobs/processes to use when
-# building, and tends to speed up the build significantly. It's typical to use
-# "core count + 1" as the `-j` value.
-cmake --build cmake-out -j9
+mkdir cmake-out
+cmake -B cmake-out --preset [preset] [options]
+cmake --build cmake-out -j10
```
-> **_TIP:_** For faster rebuilds, consider installing ccache (see [Compiler Cache section](#compiler-cache-ccache) above). On first builds, ccache populates its cache. Subsequent builds with the same compiler flags can be significantly faster.
-
-### Build Presets
+#### Build Presets
-ExecuTorch provides fine-grained control over what is built, as described in [Build Options](#build-options). These options are grouped into CMake presets to cover common scenarios, while providing the ability to override individual options. Presets can be specified when configuring CMake by specifying `--preset [name]` when configuring.
+ExecuTorch provides fine-grained control over what is built, as described in [Build Options](#build-options). These options are grouped into CMake presets to cover common scenarios while preserving the ability to override individual options. Presets can be specified when configuring CMake by specifying `--preset [name]` when configuring.
Preset values for common scenarios are listed below. Using a platform preset is recommended to avoid needing to specify many fine-grained build options.
+ * `android-arm64-v8a` - Build features and backends common for arm64-v8a Android targets.
+ * `android-x86_64` - Build features and backends common for x86_64 Android targets.
* `arm-baremetal` - Build for bare-metal ARM targets.
* `ios` - Build features and backends common for iOS targets.
* `macos` - Build features and backends common for Mac targets.
@@ -161,77 +172,34 @@ Preset values for common scenarios are listed below. Using a platform preset is
* `profiling` - Build the ExecuTorch runtime with profiling enabled.
* `zephyr` - Build for Zephyr RTOS.
+User CMake:
+```cmake
+set(EXECUTORCH_BUILD_PRESET_FILE ${CMAKE_SOURCE_DIR}/executorch/tools/cmake/preset/llm.cmake)
+```
+
+Standalone build:
```bash
# Configure the build with the ios preset.
cmake .. --preset ios
```
-### CMake Targets and Libraries
-
-To link against the ExecuTorch framework from CMake, the following top-level targets are exposed:
-
- * `executorch::backends`: Contains all configured backends.
- * `executorch::extensions`: Contains all configured extensions.
- * `executorch::kernels`: Contains all configured kernel libraries.
-
-The backends, extensions, and kernels included in these targets are controlled by the various `EXECUTORCH_` CMake options specified by the build. Using these targets will automatically pull in the required dependencies to use the configured features.
-
-### Running an Example Model
+#### Build Options
-The example `executor_runner` binary can be used to run a model and sanity-check the build. Run the following commands to generate and run a simple model.
-You should see the message "Model executed successfully" followed by the output values.
+CMake options can be used to for fine-grained control of build type, control which features are built, and configure functionality, such as logging. Options are typically specified during CMake configuration. Default values of each option are set by the active preset, but can be overridden by specifying the option when configuring.
-``` bash
-python -m examples.portable.scripts.export --model_name="add"
-./cmake-out/executor_runner --model_path add.pte
-```
+Note that many build options require other options to be enabled. This may require enabling multiple options to enable a given feature. The CMake build output will provide an error message when a required option is not enabled.
+User CMake:
+```cmake
+set(EXECUTORCH_BUILD_XNNPACK ON)
```
-I 00:00:00.000526 executorch:executor_runner.cpp:82] Model file add.pte is loaded.
-I 00:00:00.000595 executorch:executor_runner.cpp:91] Using method forward
-I 00:00:00.000612 executorch:executor_runner.cpp:138] Setting up planned buffer 0, size 48.
-I 00:00:00.000669 executorch:executor_runner.cpp:161] Method loaded.
-I 00:00:00.000685 executorch:executor_runner.cpp:171] Inputs prepared.
-I 00:00:00.000764 executorch:executor_runner.cpp:180] Model executed successfully.
-I 00:00:00.000770 executorch:executor_runner.cpp:184] 1 outputs:
-Output 0: tensor(sizes=[1], [2.])
-```
-
-### Compiler Cache (ccache)
-
-ExecuTorch automatically detects and enables [ccache](https://ccache.dev/) if it's installed. This significantly speeds up recompilation by caching previously compiled objects:
-
-- If ccache is detected, you'll see: `ccache found and enabled for faster builds`
-- If ccache is not installed, you'll see: `ccache not found, builds will not be cached`
-
-To install ccache:
+Standalone build:
```bash
-# Ubuntu/Debian
-sudo apt install ccache
-
-# macOS
-brew install ccache
-
-# CentOS/RHEL
-sudo yum install ccache
-# or
-sudo dnf install ccache
+cmake -DEXECUTORCH_BUILD_XNNPACK=ON
```
-No additional configuration is needed - the build system will automatically use ccache when available.
-
-See [CMakeLists.txt](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt)
-
-
-
-## Build Options
-
-CMake options can be used to for fine-grained control of build type, control which features are built, and configure functionality, such as logging. Options are typically specified during CMake configuration. Default values of each option are set by the active preset, but can be overridden by specifying the option when configuring.
-
-Note that many build options require other options to be enabled. This may require enabling multiple options to enable a given feature. The CMake build output will provide an error message when a required option is not enabled.
-
-#### Build Type
+##### Build Type
The CMake build is typically set to `Debug` or `Release`. For production use or profiling, release mode should be used to improve performance and reduce binary size. It disables program verification and executorch logging and adds optimizations flags. The `EXECUTORCH_OPTIMIZE_SIZE` flag can be used to further optimize for size with a small performance tradeoff.
@@ -240,7 +208,7 @@ The CMake build is typically set to `Debug` or `Release`. For production use or
cmake .. -DCMAKE_BUILD_TYPE=Release
```
-#### Backends
+##### Backends
Typically, each hardware backend exposes a CMake option to control whether the backend is built. See backend-specific documentation for more details.
@@ -260,7 +228,7 @@ Typically, each hardware backend exposes a CMake option to control whether the b
cmake .. -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_VULKAN=ON
```
-#### Extensions
+##### Extensions
ExecuTorch extensions provide optional functionality outside of the core runtime. As the core runtime is designed to run in constrained environments, these features are typically disabled by default. Extensions include higher-level APIs (Module and Tensor), multi-threading support (Threadpool), training, and more.
@@ -281,7 +249,7 @@ ExecuTorch extensions provide optional functionality outside of the core runtime
cmake .. -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON
```
-#### Logging
+##### Logging
Logging is enabled by default in debug builds and disabled in release. When enabled, the default log level is Info. Both log enable and level can be overriden with options. See [Logging](using-executorch-runtime-integration.md#logging). Disabling logging and decreasing log verbosity will reduce binary size by stripping unused strings from the build.
@@ -293,7 +261,39 @@ Logging is enabled by default in debug builds and disabled in release. When enab
cmake .. -DEXECUTORCH_ENABLE_LOGGING=ON -DEXECUTORCH_LOG_LEVEL=debug
```
-#### Output Libraries
+### Building
+
+Build all targets with `cmake --build`.
+
+```bash
+# cd to the root of the executorch repo
+cd executorch
+
+# Build using the configuration that you previously generated under the
+# `cmake-out` directory.
+#
+# NOTE: The `-j` argument specifies how many jobs/processes to use when
+# building, and tends to speed up the build significantly. It's typical to use
+# "core count + 1" as the `-j` value.
+cmake --build cmake-out -j9
+```
+
+> **_TIP:_** For faster rebuilds, consider installing ccache (see [Compiler Cache section](#compiler-cache-ccache) above). On first builds, ccache populates its cache. Subsequent builds with the same compiler flags can be significantly faster.
+
+
+
+
+## CMake Targets and Output Libraries
+
+To link against the ExecuTorch framework from CMake, the following top-level targets are exposed:
+
+ * `executorch::backends`: Contains all configured backends.
+ * `executorch::extensions`: Contains all configured extensions.
+ * `executorch::kernels`: Contains all configured kernel libraries.
+
+The backends, extensions, and kernels included in these targets are controlled by the various `EXECUTORCH_` CMake options specified by the build. Using these targets will automatically pull in the required dependencies to use the configured features.
+
+### Linking Without CMake
To link against the runtime from outside of the CMake ecosystem, the runtime can be first built with CMake and then linked directly. A few of the relevant top-level targets are described below. Note that this is a more involved process than using CMake and is only recommended when using CMake is not viable.
@@ -312,6 +312,26 @@ To link against the runtime from outside of the CMake ecosystem, the runtime can
Backends typically introduce additional targets. See backend-specific documentation for more details.
+### Verify the Build
+
+To verify the build, ExecuTorch optionally compiles a simple, stand-alone model runner to run PTE files with all-one input tensors. It is not enabled by default in most presets, but can be enabled by configuring with `-DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON -DEXECUTORCH_BUILD_EXTENSION_EVALUE_UTIL=ON`.
+
+Once compiled, invoke the runner with a sample PTE (such as the one generated by [verifying the Python build](#verify-the-build)).
+```bash
+cmake-out/executor_runner --model_path=mv2_xnnpack_fp32.pte
+```
+
+If the runner runs successfully, you should see output similar to the following:
+```
+I 00:00:00.043703 executorch:executor_runner.cpp:379] Model executed successfully 1 time(s) in 15.013292 ms.
+I 00:00:00.043720 executorch:executor_runner.cpp:383] 1 outputs:
+OutputX 0: tensor(sizes=[1, 1000], [
+ -0.509859, 0.300644, 0.0953884, 0.147724, 0.231202, 0.338554, 0.206888, -0.0575762, -0.389273, -0.0606864,
+ ...,
+ 0.421219, 0.100447, -0.506771, -0.115824, -0.693017, -0.183262, 0.154781, -0.410684, 0.0119296, 0.449713,
+])
+```
+
## Cross-Compiling for Android
@@ -325,8 +345,7 @@ Backends typically introduce additional targets. See backend-specific documentat
### Building the AAR
-With the NDK installed, the `build_android_library.sh` script will build the ExecuTorch Java AAR. This file contains the ExecuTorch Java bindings
-and native code. See [Using the AAR File](using-executorch-android.md#using-aar-file) for usage.
+With the NDK installed, the `build_android_library.sh` script will build the ExecuTorch Java AAR, which contains ExecuTorch Java bindings. See [Using the AAR File](using-executorch-android.md#using-aar-file) for usage.
```bash
export ANDROID_ABIS=arm64-v8a
@@ -335,36 +354,21 @@ mkdir -p $BUILD_AAR_DIR
sh scripts/build_android_library.sh
```
-### Building the Example Runner
+### Android Native
-The native executor runner can be cross-compiled for android and deployed via ADB. This step is intended as
-an example of CMake cross compilation and is not necessary for integration into an app.
+To use the ExecuTorch runtime from native Android C++ code, the runtime can be cross-compiled for Android. The recommended approach is to add ExecuTorch as a submodule of the user project and use [CMake](https://developer.android.com/ndk/guides/cmake) for the native build. The above steps for C++ with CMake can be followed.
+For direct cross-compilation, the ExecuTorch runtime can be configured to build with the NDK toolchain:
```bash
-# Run the following lines from the `executorch/` folder
-./install_executorch.sh --clean
-mkdir cmake-android-out && cd cmake-android-out
-
# point -DCMAKE_TOOLCHAIN_FILE to the location where ndk is installed
-cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a ..
-
-cd ..
-cmake --build cmake-android-out -j9
-
-adb shell mkdir -p /data/local/tmp/executorch
-# push the binary to an Android device
-adb push cmake-android-out/executor_runner /data/local/tmp/executorch
-# push the model file
-adb push add.pte /data/local/tmp/executorch
-
-adb shell "/data/local/tmp/executorch/executor_runner --model_path /data/local/tmp/executorch/add.pte"
+cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a ..
```
## Cross-Compiling for iOS
-For iOS, we'll build [frameworks](https://developer.apple.com/documentation/xcode/creating-a-multi-platform-binary-framework-bundle) instead of static libraries. The frameworks contain the compiled ExecuTorch runtime and public headers.
+iOS binaries are built as [frameworks](https://developer.apple.com/documentation/xcode/creating-a-multi-platform-binary-framework-bundle) instead of static libraries. The frameworks contain the compiled ExecuTorch runtime and public headers.
### Pre-requisites
@@ -385,119 +389,36 @@ xcode-select --install
```
Run the above command with `--help` flag to learn more on how to build additional backends
-(like [Core ML](backends-coreml.md), [MPS](backends-mps.md) or XNNPACK), etc.
+(like [Core ML](backends/coreml/coreml-overview.md), [MPS](backends/mps/mps-overview.md) or XNNPACK), etc.
Note that some backends may require additional dependencies and certain versions of Xcode and iOS.
See backend-specific documentation for more details.
2. Copy over the generated `.xcframework` bundles to your Xcode project, link them against
your targets and don't forget to add an extra linker flag `-all_load`.
-Check out the [iOS Demo App](https://github.com/meta-pytorch/executorch-examples/tree/main/mv3/apple/ExecuTorchDemo) tutorial for more info.
-
-
-
-## Building on Windows
-
-ExecuTorch provides experimental support for native Windows builds.
-
-> **_NOTE:_** All commands should be executed on Windows powershell in administrator mode.
-
-### Environment Setup
-
-#### Pre-requisites
+See the [iOS Demo App](https://github.com/meta-pytorch/executorch-examples/tree/main/mv3/apple/ExecuTorchDemo) tutorial for example usage of the ExecuTorch frameworks.
-1. Install miniconda for Windows from the [official website](https://docs.conda.io/en/latest/miniconda.html).
-2. Install Git for Windows from the [official website](https://git-scm.com/download/win).
-3. Install ClangCL for Windows from the [official website](https://learn.microsoft.com/en-us/cpp/build/clang-support-msbuild?view=msvc-170) or through a [Visual Studio](https://learn.microsoft.com/en-us/cpp/build/clang-support-msbuild?view=msvc-170) or [Visual Studio Code](https://code.visualstudio.com/docs/cpp/config-clang-mac) installation.
+## Compiler Cache (ccache)
-#### Clone and Configure Environment
-
-```bash
-git config --global core.symlinks true
-git clone --recurse -submodules https://github.com/pytorch/executorch.git
-cd executorch
-conda create -yn et python=3.12
-conda activate et
-```
-
-If Conda is not available, run conda-hook.ps1, where `$miniconda_dir` is the directory where miniconda is installed.
-This is `“C:\Users\\AppData\Local”` by default.
-
-```bash
-$miniconda_dir\\shell\\condabin\\conda-hook.ps1
-```
-
-### Build the Python Package
-
-Run `install_executorch.bat` to build and install the ExecuTorch Python package and runtime bindings.
-
-```bash
-cd executorch
-./install_executorch.bat
-```
-
-> **_NOTE_** Many components are not currently buildable on Windows. These instructions install a very minimal ExecuTorch which can be used as a sanity check.
+ExecuTorch automatically detects and enables [ccache](https://ccache.dev/) if it's installed. This significantly speeds up recompilation by caching previously compiled objects:
-### Build the C++ Runtime
+- If ccache is detected, you'll see: `ccache found and enabled for faster builds`
+- If ccache is not installed, you'll see: `ccache not found, builds will not be cached`
+To install ccache:
```bash
-del -Recurse -Force cmake-out; `
-cmake . `
- -DCMAKE_INSTALL_PREFIX=cmake-out `
- -DPYTHON_EXECUTABLE=$miniconda_dir\\envs\\et\\python.exe `
- -DCMAKE_PREFIX_PATH=$miniconda_dir\\envs\\et\\Lib\\site-packages `
- -DCMAKE_BUILD_TYPE=Release `
- -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON `
- -DEXECUTORCH_BUILD_FLATC=ON `
- -DEXECUTORCH_BUILD_PYBIND=OFF `
- -DEXECUTORCH_BUILD_XNNPACK=ON `
- -DEXECUTORCH_BUILD_KERNELS_LLM=ON `
- -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON `
- -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON `
- -DEXECUTORCH_ENABLE_LOGGING=ON `
- -T ClangCL `
- -Bcmake-out; `
-cmake --build cmake-out -j64 --target install --config Release
-```
-
-> **_NOTE_** `$miniconda_dir` is the directory where you installed miniconda. This is `“C:\Users\\AppData\Local”` by default.
-
-### Running an Example Model
-
-To validate the installation by running a model, create a file named export_mv2.py. Then, run the powershell commands to export and run the model.
-The expected output is a tensor of size 1x1000, containing class scores.
-
-```py
-# export_mv2.py
-import torch
-from executorch.exir import to_edge_transform_and_lower
-from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
-from torchvision.models import mobilenet_v2
-from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
-
-mv2 = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
-example_inputs = (torch.randn((1, 3, 224, 224)),)
-
-program = to_edge_transform_and_lower(
- torch.export.export(model, example_inputs)
-).to_executorch()
-
-with open("mv2_xnnpack.pte", "wb") as file:
- executorch_program.write_to_file(file)
-```
+# Ubuntu/Debian
+sudo apt install ccache
-```bash
-python .\\export_mv2.py
-.\\cmake-out\\backends\\xnnpack\\Release\\xnn_executor_runner.exe --model_path=.\\mv2_xnnpack.pte
-```
+# macOS
+brew install ccache
-```bash
-Output 0: tensor(sizes=[1, 1000], [
- -0.50986, 0.30064, 0.0953904, 0.147726, 0.231205, 0.338555, 0.206892, -0.0575775, … ])
+# CentOS/RHEL
+sudo yum install ccache
+# or
+sudo dnf install ccache
```
-## Next Steps
+No additional configuration is needed - the build system will automatically use ccache when available.
-* [Selective Build](kernel-library-selective-build.md) to link only kernels used by the program. This can provide significant binary size savings.
-* Tutorials on building [Android](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo#executorch-android-demo-app) and [iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/mv3/apple/ExecuTorchDemo) demo apps.
-* Tutorials on deploying applications to embedded devices such as [ARM Cortex-M/Ethos-U](backends-arm-ethos-u.md) and [XTensa HiFi DSP](backends-cadence.md).
+See [CMakeLists.txt](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt)
diff --git a/docs/source/using-executorch-export.md b/docs/source/using-executorch-export.md
index 7abf5cbd30a..ae73cb5aeac 100644
--- a/docs/source/using-executorch-export.md
+++ b/docs/source/using-executorch-export.md
@@ -32,10 +32,10 @@ As part of the .pte file creation process, ExecuTorch identifies portions of the
Commonly used hardware backends are listed below. For mobile, consider using XNNPACK for Android and XNNPACK or Core ML for iOS. To create a .pte file for a specific backend, pass the appropriate partitioner class to `to_edge_transform_and_lower`. See the appropriate backend documentation and the [Export and Lowering](#export-and-lowering) section below for more information.
-- [XNNPACK (CPU)](backends-xnnpack.md)
-- [Core ML (iOS)](backends-coreml.md)
-- [Metal Performance Shaders (iOS GPU)](backends-mps.md)
-- [Vulkan (Android GPU)](backends-vulkan.md)
+- [XNNPACK (CPU)](backends/xnnpack/xnnpack-overview.md)
+- [Core ML (iOS)](backends/coreml/coreml-overview.md)
+- [Metal Performance Shaders (iOS GPU)](backends/mps/mps-overview.md)
+- [Vulkan (Android GPU)](backends/vulkan/vulkan-overview.md)
- [Qualcomm NPU](backends-qualcomm.md)
- [MediaTek NPU](backends-mediatek.md)
- [Arm Ethos-U NPU](backends-arm-ethos-u.md)
diff --git a/docs/source/using-executorch-ios.md b/docs/source/using-executorch-ios.md
index 15ccef8d8a1..f5d520f9874 100644
--- a/docs/source/using-executorch-ios.md
+++ b/docs/source/using-executorch-ios.md
@@ -18,7 +18,9 @@ The ExecuTorch Runtime for iOS and macOS (ARM64) is distributed as a collection
Link your binary with the ExecuTorch runtime and any backends or kernels used by the exported ML model. It is recommended to link the core runtime to the components that use ExecuTorch directly, and link kernels and backends against the main app target.
-**Note:** To access logs, link against the Debug build of the ExecuTorch runtime, i.e., the `executorch_debug` framework. For optimal performance, always link against the Release version of the deliverables (those without the `_debug` suffix), which have all logging overhead removed.
+**Note:** You may need to add some extra linker flags for the build settings of the components that links against ExecuTorch backends or kernels to let them register properly at the app startup. See the [Linkage](#Linkage) section for more details.
+
+**Note:** To access logs, link against the Debug build of the ExecuTorch runtime, i.e., the `executorch_debug` framework. For optimal performance, always link against the Release version of the deliverables (those without the `_debug` suffix), which have all logging overhead removed. See the [Logging](#Logging) section for more details.
### Swift Package Manager
@@ -26,7 +28,7 @@ The prebuilt ExecuTorch runtime, backend, and kernels are available as a [Swift
#### Xcode
-In Xcode, go to `File > Add Package Dependencies`. Paste the URL of the [ExecuTorch repo](https://github.com/pytorch/executorch) into the search bar and select it. Make sure to change the branch name to the desired ExecuTorch version in format "swiftpm-", (e.g. "swiftpm-0.7.0"), or a branch name in format "swiftpm-." (e.g. "swiftpm-0.8.0-20250801") for a [nightly build](https://ossci-ios.s3.amazonaws.com/list.html) on a specific date.
+In Xcode, go to `File > Add Package Dependencies`. Paste the URL of the [ExecuTorch repo](https://github.com/pytorch/executorch) into the search bar and select it. Make sure to change the branch name to the desired ExecuTorch version in format "swiftpm-", (e.g. "swiftpm-1.0.0"), or a branch name in format "swiftpm-." (e.g. "swiftpm-1.1.0-20251101") for a [nightly build](https://ossci-ios.s3.amazonaws.com/list.html) on a specific date.

@@ -59,7 +61,7 @@ let package = Package(
],
dependencies: [
// Use "swiftpm-." branch name for a nightly build.
- .package(url: "https://github.com/pytorch/executorch.git", branch: "swiftpm-0.7.0")
+ .package(url: "https://github.com/pytorch/executorch.git", branch: "swiftpm-1.0.0")
],
targets: [
.target(
@@ -70,6 +72,10 @@ let package = Package(
.product(name: "kernels_optimized", package: "executorch"),
// Add other backends and kernels as needed.
]),
+ linkerSettings: [
+ // Force load all symbols from static libraries to trigger backends and kernels registration
+ .unsafeFlags(["-Wl,-all_load"])
+ ]
]
)
```
@@ -107,7 +113,7 @@ git clone -b release/1.0 https://github.com/pytorch/executorch.git --depth 1 --r
python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip
```
-4. Install the required dependencies, including those needed for the backends like [Core ML](backends-coreml.md) or [MPS](backends-mps.md), if you plan to build them later:
+4. Install the required dependencies, including those needed for the backends like [Core ML](backends/coreml/coreml-overview.md) or [MPS](backends/mps/mps-overview.md), if you plan to build them later:
```bash
./install_requirements.sh
diff --git a/examples/models/voxtral/README.md b/examples/models/voxtral/README.md
index 8cac4264bba..f793e8251ef 100644
--- a/examples/models/voxtral/README.md
+++ b/examples/models/voxtral/README.md
@@ -36,6 +36,64 @@ optimum-cli export executorch \
This exports Voxtral with XNNPack backend acceleration and 4-bit weight/8-bit activation linear quantization.
+## CUDA Support
+If your environment has CUDA support, you can enable the runner to run on CUDA for improved performance. Follow the export and runtime commands below:
+
+### Exporting with CUDA
+```
+optimum-cli export executorch \
+ --model "mistralai/Voxtral-Mini-3B-2507" \
+ --task "multimodal-text-to-text" \
+ --recipe "cuda" \
+ --dtype bfloat16 \
+ --device cuda \
+ --max_seq_len 1024 \
+ --output_dir="voxtral"
+```
+
+This will generate:
+- `model.pte` - The exported model
+- `aoti_cuda_blob.ptd` - The CUDA kernel blob required for runtime
+
+Furthermore, we support several quantization formats on CUDA.
+For example, to export Voxtral with int4 weight and int4mm for linear layers, you can use the following command,
+```
+optimum-cli export executorch \
+ --model "mistralai/Voxtral-Mini-3B-2507" \
+ --task "multimodal-text-to-text" \
+ --recipe "cuda" \
+ --dtype bfloat16 \
+ --device cuda \
+ --max_seq_len 1024 \
+ --qlinear 4w \
+ --qlinear_encoder 4w \
+ --qlinear_packing_format tile_packed_to_4d \
+ --qlinear_encoder_packing_format tile_packed_to_4d \
+ --output_dir="voxtral"
+```
+
+See the "Building the multimodal runner" section below for instructions on building with CUDA support, and the "Running the model" section for runtime instructions.
+
+## Metal Support
+On Apple Silicon, you can enable the runner to run on Metal. Follow the export and runtime commands below:
+
+### Exporting with Metal
+```
+optimum-cli export executorch \
+ --model "mistralai/Voxtral-Mini-3B-2507" \
+ --task "multimodal-text-to-text" \
+ --recipe "metal" \
+ --dtype bfloat16 \
+ --max_seq_len 1024 \
+ --output_dir="voxtral"
+```
+
+This will generate:
+- `model.pte` - The exported model
+- `aoti_metal_blob.ptd` - The Metal kernel blob required for runtime
+
+See the "Building the multimodal runner" section below for instructions on building with Metal support, and the "Running the model" section for runtime instructions.
+
# Running the model
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
The Voxtral runner will do the following things:
@@ -52,7 +110,12 @@ We provide a simple way to transform raw audio data into a mel spectrogram by ex
```
# Export a preprocessor that can handle audio up to 5 mins (300s).
-python -m executorch.extension.audio.mel_spectrogram --feature_size 128 --stack_output --max_audio_len 300 --output_file voxtral_preprocessor.pte
+
+python -m executorch.extension.audio.mel_spectrogram \
+ --feature_size 128 \
+ --stack_output \
+ --max_audio_len 300 \
+ --output_file voxtral_preprocessor.pte
```
## Building the multimodal runner
@@ -64,6 +127,46 @@ cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release
```
+### Building for CUDA
+```
+# Install ExecuTorch with CUDA support
+CMAKE_ARGS="-DEXECUTORCH_BUILD_CUDA=ON" ./install_executorch.sh
+
+# Build the multimodal runner with CUDA
+cmake --preset llm \
+ -DEXECUTORCH_BUILD_CUDA=ON \
+ -DCMAKE_INSTALL_PREFIX=cmake-out \
+ -DCMAKE_BUILD_TYPE=Release \
+ -Bcmake-out -S.
+cmake --build cmake-out -j16 --target install --config Release
+
+cmake -DEXECUTORCH_BUILD_CUDA=ON \
+ -DCMAKE_BUILD_TYPE=Release \
+ -Sexamples/models/voxtral \
+ -Bcmake-out/examples/models/voxtral/
+cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
+```
+
+### Building for Metal
+```
+# Install ExecuTorch with Metal support
+CMAKE_ARGS="-DEXECUTORCH_BUILD_METAL=ON" ./install_executorch.sh
+
+# Build the multimodal runner with Metal
+cmake --preset llm \
+ -DEXECUTORCH_BUILD_METAL=ON \
+ -DCMAKE_INSTALL_PREFIX=cmake-out \
+ -DCMAKE_BUILD_TYPE=Release \
+ -Bcmake-out -S.
+cmake --build cmake-out -j16 --target install --config Release
+
+cmake -DEXECUTORCH_BUILD_METAL=ON \
+ -DCMAKE_BUILD_TYPE=Release \
+ -Sexamples/models/voxtral \
+ -Bcmake-out/examples/models/voxtral/
+cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
+```
+
## Running the model
You can download the `tekken.json` tokenizer from [Voxtral's HuggingFace repo](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507).
```
@@ -71,11 +174,26 @@ You can download the `tekken.json` tokenizer from [Voxtral's HuggingFace repo](h
--model_path path/to/model.pte \
--tokenizer_path path/to/tekken.json \
--prompt "What can you tell me about this audio?" \
- --audio_path path/to/audio_input.bin \
- --processor_path path/to/voxtral_preprocessor.pte # If you're passing raw audio file in audio_path
+ --audio_path path/to/audio_input.wav \
+ --processor_path path/to/voxtral_preprocessor.pte
```
-Example output:
+### Running with preprocessed audio (.bin file)
+If you already have a preprocessed mel spectrogram saved as a `.bin` file, you can skip the preprocessor:
+```
+./cmake-out/examples/models/voxtral/voxtral_runner \
+ --model_path path/to/model.pte \
+ --tokenizer_path path/to/tekken.json \
+ --prompt "What can you tell me about this audio?" \
+ --audio_path path/to/preprocessed_audio.bin
+```
+
+### Running on CUDA or Metal:
+Add the `--data_path` argument to provide the appropriate data blob to the commands above:
+- For CUDA: `--data_path path/to/aoti_cuda_blob.ptd`
+- For Metal: `--data_path path/to/aoti_metal_blob.ptd`
+
+# Example output:
```
The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that
the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might
@@ -89,6 +207,7 @@ I 00:00:24.036822 executorch:stats.h:147] Time to first generated token:
I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds)
```
+# Generating audio input
You can easily produce an `.bin` for the audio input in Python like this:
```
# t = some torch.Tensor
@@ -101,3 +220,13 @@ You can also produce raw audio file as follows (for Option A):
```
ffmpeg -i audio.mp3 -f f32le -acodec pcm_f32le -ar 16000 audio_input.bin
```
+
+### Generating a .wav file on Mac
+On macOS, you can use the built-in `say` command to generate speech audio and convert it to a `.wav` file:
+```
+# Generate audio using text-to-speech
+say -o call_samantha_hall.aiff "Call Samantha Hall"
+
+# Convert to .wav format
+afconvert -f WAVE -d LEI16 call_samantha_hall.aiff call_samantha_hall.wav
+```
diff --git a/examples/nxp/README.md b/examples/nxp/README.md
index 8a6ba39c091..336a0e9189b 100644
--- a/examples/nxp/README.md
+++ b/examples/nxp/README.md
@@ -4,11 +4,11 @@ format and delegate the model computation to eIQ Neutron NPU using the eIQ Neutr
## Layout
* `experimental/` - contains CifarNet model example.
-* `models` - demo models instantiation used in examples
+* `models` - demo models instantiation used in examples.
* `aot_neutron_compile.py` - script with end-to-end ExecuTorch AoT Neutron Backend workflow.
* `README.md` - this file.
-* `run_aot_example.sh` - utility script to launch _aot_neutron_compile.py_. Primarily for CI purpose.
-* `setup.sh` - setup script to install NeutronBackend dependencies.
+* `run_aot_example.sh` - utility script to launch _aot_neutron_compile.py_. Primarily for CI purpose.
+* `setup.sh` - setup script to install Neutron Backend dependencies.
## Setup
Please finish tutorial [Setting up ExecuTorch](https://pytorch.org/executorch/main/getting-started-setup).
@@ -23,24 +23,24 @@ $ ./examples/nxp/setup.sh
* MobileNetV2
## PyTorch Model Delegation to Neutron Backend
-First we will start with an example script converting the model. This example show the CifarNet model preparation.
-It is the same model which is part of the `example_cifarnet` in
+First we will start with an example script converting the model. This example show the CifarNet model preparation.
+It is the same model which is part of the `example_cifarnet` in
[MCUXpresso SDK](https://www.nxp.com/design/design-center/software/development-software/mcuxpresso-software-and-tools-/mcuxpresso-software-development-kit-sdk:MCUXpresso-SDK).
-The NXP MCUXpresso software and tools offer comprehensive development solutions designed to help accelerate embedded
-system development of applications based on MCUs from NXP. The MCUXpresso SDK includes a flexible set of peripheral
+The NXP MCUXpresso software and tools offer comprehensive development solutions designed to help accelerate embedded
+system development of applications based on MCUs from NXP. The MCUXpresso SDK includes a flexible set of peripheral
drivers designed to speed up and simplify development of embedded applications.
The steps are expected to be executed from the `executorch` root folder.
-1. Run the `aot_neutron_compile.py` example with the `cifar10` model
+1. Run the `aot_neutron_compile.py` example with the `cifar10` model
```commandline
$ python -m examples.nxp.aot_neutron_compile --quantize \
- --delegate --neutron_converter_flavor SDK_25_06 -m cifar10
+ --delegate --neutron_converter_flavor SDK_25_09 -m cifar10
```
-2. It will generate you `cifar10_nxp_delegate.pte` file which can be used with the MCUXpresso SDK `cifarnet_example`
+2. It will generate you `cifar10_nxp_delegate.pte` file which can be used with the MCUXpresso SDK `cifarnet_example`
project, presented [here](https://mcuxpresso.nxp.com/mcuxsdk/latest/html/middleware/eiq/executorch/docs/nxp/topics/example_applications.html#how-to-build-and-run-executorch-cifarnet-example).
This project will guide you through the process of deploying your PTE model to the device.
To get the MCUXpresso SDK follow this [guide](https://mcuxpresso.nxp.com/mcuxsdk/latest/html/middleware/eiq/executorch/docs/nxp/topics/getting_mcuxpresso.html),
-use the MCUXpresso SDK v25.06.00.
+use the MCUXpresso SDK v25.09.00.
diff --git a/examples/raspberry_pi/pico2/README.md b/examples/raspberry_pi/pico2/README.md
index 976754d6c5e..e9da5a7fd1d 100644
--- a/examples/raspberry_pi/pico2/README.md
+++ b/examples/raspberry_pi/pico2/README.md
@@ -4,44 +4,48 @@ This document outlines the steps required to run a simple MNIST digit recognitio
## Demo Model: Hand-crafted MNIST Classifier
-The included `export_mlp_mnist.py` creates a demonstration model with hand-crafted weights (not production-trained). This tiny MLP recognizes digits 0, 1, 4, and 7 using manually designed feature detectors.
+The included `export_mlp_mnist.py` (in examples/raspberry_pi/pico2) creates a demonstration model with hand-crafted weights (not production-trained). This tiny MLP recognizes digits 0, 1, 4, and 7 using manually designed feature detectors.
Note: This is a proof-of-concept. For production use, train your model on real MNIST data.
-## Bring Your Own Model
+## Bring Your Own Model and Deploy
This demo demonstrates ExecuTorch's ability to bring your own PyTorch model and deploy it to Pico2 with one simple script. The complete pipeline works from any PyTorch model to a runnable binary:
-### Train your PyTorch model
+- Use existing demo model (examples/raspberry_pi/pico2/export_mlp_mnist.py) or bring your own model
+- Build firmware with one command and pass the model file (.pte) as an argument
+- Deploy directly to Pico2
-Export using `torch.export()` and `to_edge()`
-Build firmware with one command
-Deploy directly to Pico2
+### Important Caveats
-#### Important Caveats:
-
-- Memory constraints - Models must fit in 520KB SRAM
+- Memory constraints - Models must fit in 520KB SRAM (Pico2)
- Missing operators - Some ops may not be supported
-- Selective builds - Include only operators your model uses
+- Selective builds - Include only operators your model uses if you want to reduce binary size
## Memory Constraints & Optimization
-- Critical: Pico2 has limited memory:
-- 520KB SRAM (on-chip static RAM)
-- 4MB QSPI Flash (onboard storage)
+- Critical: Pico2 has limited memory
+ - 520KB SRAM (on-chip static RAM)
+ - 4MB QSPI Flash (onboard storage)
### Always apply optimization techniques on large models that do not fit in Pico2 memory:
Large models will not fit. Keep your `.pte` files small!
+
- Quantization (INT8, INT4)
- Model pruning
- Operator fusion
- Selective builds (include only needed operators)
-For more details , refer to the [ExecuTorch Quantization Optimization Guide](https://docs.pytorch.org/executorch/1.0/quantization-optimization.html), [Model Export & Lowering](https://docs.pytorch.org/executorch/1.0/using-executorch-export.html) and [Selective Build support](https://docs.pytorch.org/executorch/1.0/kernel-library-selective-build.html)
+
+For more details , refer to the following guides:
+
+- [ExecuTorch Quantization Optimization Guide](https://docs.pytorch.org/executorch/1.0/quantization-optimization.html)
+- [Model Export & Lowering](https://docs.pytorch.org/executorch/1.0/using-executorch-export.html) and
+- [Selective Build support](https://docs.pytorch.org/executorch/1.0/kernel-library-selective-build.html)
## (Prerequisites) Prepare the Environment for Arm
Setup executorch development environment. Also see instructions for setting up the environment for Arm.
-Make sure you have the toolchain configured correctly. Refer to this [setup](https://docs.pytorch.org/executorch/1.0/backends-arm-ethos-u.html#development-requirements) for more details.
+Make sure you have the toolchain configured correctly. Refer to this [setup](https://docs.pytorch.org/executorch/main/backends-arm-ethos-u.html#development-requirements) for more details.
```bash
which arm-none-eabi-gcc
@@ -73,6 +77,7 @@ Hold the BOOTSEL button on Pico2 and connect to your computer. It mounts as `RPI
### Verify Execution
The Pico2 LED blinks 10 times at 500ms intervals for successful execution. Via serial terminal, you'll see:
+
```bash
...
...
@@ -134,9 +139,11 @@ Running neural network inference...
### Debugging via Serial Terminal
On macOS/Linux:
+
```bash
screen /dev/tty.usbmodem1101 115200
```
+
Replace `/dev/tty.usbmodem1101` with your device path. If LED blinks 10 times at 100ms intervals, check logs for errors, but if it blinks 10 times at 500ms intervals, it is successful!
-Result: A complete PyTorch → ExecuTorch → Pico2 demo neural network deployment! 🚀
+Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment! 🚀
diff --git a/examples/vulkan/README.md b/examples/vulkan/README.md
index 71fdd0e4183..7831809be69 100644
--- a/examples/vulkan/README.md
+++ b/examples/vulkan/README.md
@@ -1,80 +1,84 @@
-# Vulkan Delegate Export Examples
+# Example export script for the ExecuTorch Vulkan backend
-This directory contains scripts for exporting models with the Vulkan delegate in ExecuTorch. Vulkan delegation allows you to run your models on devices with Vulkan-capable GPUs, potentially providing significant performance improvements over CPU execution.
+This directory contains `export.py`, a utility script that can be used to export
+models registered in [`executorch/examples/models/__init__.py`](https://github.com/pytorch/executorch/blob/main/examples/models/__init__.py)
+to the Vulkan backend.
-## Scripts
+## Usage
-- `export.py`: Basic export script for models to use with Vulkan delegate
-- `aot_compiler.py`: Advanced export script with quantization support
+Note that all example commands are assumed to be executed from executorch root.
-## Usage
+```shell
+cd ~/executorch
+```
### Basic Export
-```bash
-python -m executorch.examples.vulkan.export -m -o
+For example, to export MobileNet V2:
+
+```shell
+MODEL_NAME=mv2 && \
+OUTPUT_DIR=. && \
+python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR}
```
-### Export with Quantization (Experimental)
+This will create a file name `mv2_vulkan.pte` in the specified output directory.
-```bash
-python -m executorch.examples.vulkan.aot_compiler -m -q -o
-```
+### With dynamic shape support
-### Dynamic Shape Support
+To enable exporting with dynamic shapes, simply add the `-d` flag.
-```bash
-python -m executorch.examples.vulkan.export -m -d -o
+```shell
+MODEL_NAME=mv2 && \
+OUTPUT_DIR=. && \
+python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d
```
-### Additional Options
+### Export a bundled pte
-- `-s/--strict`: Export with strict mode (default: True)
-- `-a/--segment_alignment`: Specify segment alignment in hex (default: 0x1000)
-- `-e/--external_constants`: Save constants in external .ptd file (default: False)
-- `-r/--etrecord`: Generate and save an ETRecord to the given file location
+Use the `-b` flag to export a bundled PTE file (i.e. `.bpte`). This is a `.pte`
+file with bundled test cases that can be used for correctness checking.
-## Examples
+```shell
+MODEL_NAME=mv2 && \
+OUTPUT_DIR=. && \
+python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d -b
+```
-```bash
-# Export MobileNetV2 with Vulkan delegate
-python -m executorch.examples.vulkan.export -m mobilenet_v2 -o ./exported_models
+This will create a file called `mv2_vulkan.bpte` in the specified output directory.
-# Export MobileNetV3 with quantization
-python -m executorch.examples.vulkan.aot_compiler -m mobilenet_v3 -q -o ./exported_models
+### With correctness testing
-# Export with dynamic shapes
-python -m executorch.examples.vulkan.export -m mobilenet_v2 -d -o ./exported_models
+The script can also execute the exported and lowered model via pybindings to
+check output correctness before writing the output file.
-# Export with ETRecord for debugging
-python -m executorch.examples.vulkan.export -m mobilenet_v2 -r ./records/mobilenet_record.etrecord -o ./exported_models
-```
+To enable this, ensure that your machine:
-## Supported Operations
+1. Has the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home#android) installed
+2. Has Vulkan drivers
-The Vulkan delegate supports various operations including:
+Additionally, you will need to install the executorch python package from
+source, since the Vulkan backend is not included by default in the pip package.
-- Basic arithmetic (add, subtract, multiply, divide)
-- Activations (ReLU, Sigmoid, Tanh, etc.)
-- Convolutions (Conv1d, Conv2d, ConvTranspose2d)
-- Pooling operations (MaxPool2d, AvgPool2d)
-- Linear/Fully connected layers
-- BatchNorm, GroupNorm
-- Various tensor operations (cat, reshape, permute, etc.)
+```shell
+CMAKE_ARGS="-DEXECUTORCH_BUILD_VULKAN=ON " ./install_executorch.sh -e
+```
-For a complete list of supported operations, refer to the Vulkan delegate implementation in the ExecuTorch codebase.
+Once these conditions are fulfilled, the `--test` flag can be passed to the
+script.
-## Debugging and Optimization
+```shell
+MODEL_NAME=mv2 && \
+OUTPUT_DIR=. && \
+python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d --test
+```
-If you encounter issues with Vulkan delegation:
+You should see some output like
-1. Use `-r/--etrecord` to generate an ETRecord for debugging
-2. Check if your operations are supported by the Vulkan delegate
-3. Ensure your Vulkan drivers are up to date
-4. Try using the export script with `--strict False` if strict mode causes issues
+```shell
+INFO:root:✓ Model test PASSED - outputs match reference within tolerance
+```
-## Requirements
+### Quantization support
-- Vulkan runtime libraries (libvulkan.so.1)
-- A Vulkan-capable GPU with appropriate drivers
-- PyTorch with Vulkan support
+Support for quantization is under active development and will be added soon!