You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Before building ExecuTorch, it is highly recommended to create an isolated Python environment.
13
-
This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs.
12
+
Before building ExecuTorch, it is highly recommended to create an isolated Python environment. This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs:
14
13
15
14
```bash
16
15
sudo apt update
@@ -19,11 +18,11 @@ python3 -m venv pyenv
19
18
source pyenv/bin/activate
20
19
21
20
```
22
-
Once activated, all subsequent steps should be executed within this Python virtual environment.
21
+
Keep your Python virtual environment activated while you complete the next steps. This ensures all dependencies install in the correct location.
23
22
24
-
###Download the ExecuTorch Source Code
23
+
## Download the ExecuTorch source code
25
24
26
-
Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched.
25
+
Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched:
The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases.
35
+
The instructions in this Learning Path were tested on ExecuTorch v1.0.0. Commands or configuration options might differ in later releases.
37
36
{{% /notice %}}
38
37
39
-
###Build and Install the ExecuTorch Python Components
38
+
## Build and install the ExecuTorch Python components
40
39
41
-
Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment.
42
-
This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.
40
+
Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment. This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.
43
41
44
42
Run the following command from your ExecuTorch workspace:
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md
+17-19Lines changed: 17 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,21 +6,21 @@ weight: 3
6
6
layout: learningpathall
7
7
---
8
8
9
+
## Overview
9
10
10
-
In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled.
11
-
Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine.
11
+
In this section, you'll cross-compile ExecuTorch for an Arm64 (AArch64) target with XNNPACK and KleidiAI support. Cross-compiling builds all binaries and libraries for your Arm device, even if your development system uses x86_64. This process lets you run and test ExecuTorch on Arm hardware, taking advantage of Arm-optimized performance features.
12
12
13
-
###Install the Cross-Compilation Toolchain
14
-
On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake:
13
+
## Install the cross-compilation toolchain
14
+
On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, which is a fast build backend commonly used by CMake:
|`EXECUTORCH_BUILD_XNNPACK`| Builds the **XNNPACK backend**, which provides highly optimized CPU operators (GEMM, convolution, etc.) for Arm64 platforms. |
61
-
|`EXECUTORCH_XNNPACK_ENABLE_KLEIDI`| Enables **Arm KleidiAI** acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs. |
62
-
|`EXECUTORCH_BUILD_DEVTOOLS`| Builds **developer tools** such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging. |
63
-
|`EXECUTORCH_BUILD_EXTENSION_MODULE`| Builds the **Module API** extension, which provides a high-level abstraction for model loading and execution using `Module` objects. |
64
-
|`EXECUTORCH_BUILD_EXTENSION_TENSOR`| Builds the **Tensor API** extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime. |
65
-
|`EXECUTORCH_BUILD_KERNELS_OPTIMIZED`| Enables building **optimized kernel implementations** for better performance on supported architectures. |
66
-
|`EXECUTORCH_ENABLE_EVENT_TRACER`| Enables the **event tracing** feature, which records performance and operator timing information for runtime analysis. |
60
+
|`EXECUTORCH_BUILD_XNNPACK`| Builds the XNNPACK backend, which provides highly optimized CPU operators (such as GEMM and convolution) for Arm64 platforms. |
61
+
|`EXECUTORCH_XNNPACK_ENABLE_KLEIDI`| Enables Arm KleidiAI acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs. |
62
+
|`EXECUTORCH_BUILD_DEVTOOLS`| Builds developer tools such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging. |
63
+
|`EXECUTORCH_BUILD_EXTENSION_MODULE`| Builds the Module API extension, which provides a high-level abstraction for model loading and execution using `Module` objects. |
64
+
|`EXECUTORCH_BUILD_EXTENSION_TENSOR`| Builds the Tensor API extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime. |
65
+
|`EXECUTORCH_BUILD_KERNELS_OPTIMIZED`| Enables building optimized kernel implementations for better performance on supported architectures. |
66
+
|`EXECUTORCH_ENABLE_EVENT_TRACER`| Enables the event tracing feature, which records performance and operator timing information for runtime analysis. |
67
67
68
68
69
69
70
-
###Build ExecuTorch
70
+
## Build ExecuTorch
71
71
Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools:
72
72
73
73
```bash
74
74
cmake --build . -j$(nproc)
75
75
```
76
76
CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target.
77
77
78
-
###Locate the executor_runner Binary
78
+
## Locate the executor_runner binary
79
79
If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under:
80
80
81
81
```output
82
82
build-arm64/executor_runner
83
-
```
84
-
You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line.
85
-
This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration.
83
+
You’ll use `executor_runner` in later sections to execute and profile ExecuTorch models directly from the command line on your Arm64 target. This standalone binary lets you run models using the XNNPACK backend with KleidiAI acceleration, making it easy to benchmark and analyze performance on Arm devices.
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,19 @@
1
1
---
2
-
title: KleidiAI micro-kernels support in ExecuTorch
2
+
title: Accelerate ExecuTorch operators with KleidiAI micro-kernels
3
3
weight: 4
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
-
ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers.
8
+
## Understand how KleidiAI micro-kernels integrate with ExecuTorch
9
9
10
-
Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
10
+
ExecuTorch uses XNNPACK as its main CPU backend to run and optimize operators like convolutions, matrix multiplications, and fully connected layers.
11
11
12
-
These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
12
+
KleidiAI SME (Scalable Matrix Extension) micro-kernels are integrated into XNNPACK to boost performance on supported Arm platforms. These micro-kernels accelerate operators that use specific data types and quantization settings in ExecuTorch models.
13
13
14
-
When an operator matches one of the supported configurations, ExecuTorch automatically dispatches it through the KleidiAI-optimized path.
14
+
When an operator matches a supported configuration, ExecuTorch automatically uses the KleidiAI-optimized path for faster execution. If an operator is not supported by KleidiAI, ExecuTorch falls back to the standard XNNPACK implementation. This ensures your models always run correctly, even if they do not use KleidiAI acceleration.
15
15
16
-
Operators that are not covered by KleidiAI fall back to the standard XNNPACK implementations during inference, ensuring functional correctness across all models.
16
+
## Understand how KleidiAI micro-kernels integrate with ExecuTorch
17
17
18
18
In ExecuTorch v1.0.0, the following operator types are implemented through the XNNPACK backend and can potentially benefit from KleidiAI acceleration:
- Quantization schemes (for example, symmetric/asymmetric, per-tensor/per-channel)
28
28
- Tensor memory layout and alignment
29
29
- Kernel dimensions and stride settings
30
30
31
31
The following section provides detailed information on which operator configurations can benefit from KleidiAI acceleration, along with their corresponding data type and quantization support.
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md
+10-12Lines changed: 10 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,18 +6,16 @@ weight: 5
6
6
layout: learningpathall
7
7
---
8
8
9
+
## Overview
10
+
9
11
In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.
10
12
11
13
To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.
12
14
13
15
These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels.
14
16
15
-
### Define a Simple Linear Benchmark Model
16
-
17
-
The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer.
18
-
This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels.
19
-
20
-
By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
17
+
## Define a linear benchmark model with PyTorch for ExecuTorch
18
+
This step can be confusing at first, but building a minimal model helps you focus on the core operator performance. You’ll be able to quickly test different GEMM implementations and see how each one performs on Arm-based hardware. If you run into errors, check that your PyTorch and ExecuTorch versions are up to date and that you’re using the correct data types for your target GEMM variant. By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
21
19
22
20
23
21
```python
@@ -38,7 +36,7 @@ class DemoLinearModel(torch.nn.Module):
38
36
```
39
37
This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants.
40
38
41
-
### Export FP16/FP32 model for pf16_gemm and pf32_gemm
39
+
### Export FP16 and FP32 models for pf16_gemm and pf32_gemm variants
@@ -214,7 +212,7 @@ These ETRecord files are essential for subsequent model inspection and performan
214
212
{{%/notice%}}
215
213
216
214
217
-
###Run the Complete Benchmark Model Export Script
215
+
## Run the benchmark model export script for ExecuTorch
218
216
Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4).
219
217
This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format.
At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels.
241
+
Great job! You now have a complete set of benchmark models exported for multiple GEMM variants and quantization levels. You’re ready to move on and measure performance using ExecuTorch and KleidiAI micro-kernels on Arm-based hardware.
0 commit comments