Skip to content

Commit ec28558

Browse files
Merge pull request #2585 from madeline-underwood/kleidi
Kleidi_JA to sign off
2 parents a79e014 + ed894c9 commit ec28558

File tree

10 files changed

+84
-87
lines changed

10 files changed

+84
-87
lines changed
Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,15 @@
11
---
2-
title: Environment setup
2+
title: Set up your environment
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

99

10-
### Python Environment Setup
10+
## Set up your Python environment
1111

12-
Before building ExecuTorch, it is highly recommended to create an isolated Python environment.
13-
This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs.
12+
Before building ExecuTorch, it is highly recommended to create an isolated Python environment. This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs:
1413

1514
```bash
1615
sudo apt update
@@ -19,11 +18,11 @@ python3 -m venv pyenv
1918
source pyenv/bin/activate
2019

2120
```
22-
Once activated, all subsequent steps should be executed within this Python virtual environment.
21+
Keep your Python virtual environment activated while you complete the next steps. This ensures all dependencies install in the correct location.
2322

24-
### Download the ExecuTorch Source Code
23+
## Download the ExecuTorch source code
2524

26-
Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched.
25+
Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched:
2726

2827
```bash
2928
export WORKSPACE=$HOME
@@ -33,27 +32,27 @@ git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.g
3332
```
3433

3534
{{% notice Note %}}
36-
The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases.
35+
The instructions in this Learning Path were tested on ExecuTorch v1.0.0. Commands or configuration options might differ in later releases.
3736
{{% /notice %}}
3837

39-
### Build and Install the ExecuTorch Python Components
38+
## Build and install the ExecuTorch Python components
4039

41-
Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment.
42-
This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.
40+
Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment. This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.
4341

4442
Run the following command from your ExecuTorch workspace:
4543
```bash
4644
cd $WORKSPACE/executorch
4745
CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh
4846

4947
```
50-
This will build ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector.
48+
This builds ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector.
49+
50+
## Verify the Installation
51+
After the build completes, check that ExecuTorch is installed in your active Python environment. Run the following command:
5152

52-
### Verify the Installation
53-
After the build completes successfully, verify that ExecuTorch was installed into your current Python environment:
5453

5554
```bash
5655
python -c "import executorch; print('Executorch build and install successfully.')"
5756
```
5857

59-
If the output confirms success, you’re ready to begin cross-compilation and profiling preparation for KleidiAI micro-kernels.
58+
If you see the success message, your environment is ready. You can now move on to cross-compiling and preparing to profile KleidiAI micro-kernels.

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md

Lines changed: 17 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,21 @@ weight: 3
66
layout: learningpathall
77
---
88

9+
## Overview
910

10-
In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled.
11-
Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine.
11+
In this section, you'll cross-compile ExecuTorch for an Arm64 (AArch64) target with XNNPACK and KleidiAI support. Cross-compiling builds all binaries and libraries for your Arm device, even if your development system uses x86_64. This process lets you run and test ExecuTorch on Arm hardware, taking advantage of Arm-optimized performance features.
1212

13-
### Install the Cross-Compilation Toolchain
14-
On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake:
13+
## Install the cross-compilation toolchain
14+
On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, which is a fast build backend commonly used by CMake:
1515
```bash
1616
sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y
1717
```
1818

19-
### Run CMake Configuration
19+
## Run CMake configuration
2020

2121
Use CMake to configure the ExecuTorch build for the AArch64 target.
2222

23-
The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI.
23+
The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI:
2424

2525
```bash
2626

@@ -53,34 +53,32 @@ cmake -GNinja \
5353

5454
```
5555

56-
#### Key Build Options
56+
## Key Build Options
5757

5858
| **CMake Option** | **Description** |
5959
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
60-
| `EXECUTORCH_BUILD_XNNPACK` | Builds the **XNNPACK backend**, which provides highly optimized CPU operators (GEMM, convolution, etc.) for Arm64 platforms. |
61-
| `EXECUTORCH_XNNPACK_ENABLE_KLEIDI` | Enables **Arm KleidiAI** acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs. |
62-
| `EXECUTORCH_BUILD_DEVTOOLS` | Builds **developer tools** such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging. |
63-
| `EXECUTORCH_BUILD_EXTENSION_MODULE` | Builds the **Module API** extension, which provides a high-level abstraction for model loading and execution using `Module` objects. |
64-
| `EXECUTORCH_BUILD_EXTENSION_TENSOR` | Builds the **Tensor API** extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime. |
65-
| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | Enables building **optimized kernel implementations** for better performance on supported architectures. |
66-
| `EXECUTORCH_ENABLE_EVENT_TRACER` | Enables the **event tracing** feature, which records performance and operator timing information for runtime analysis. |
60+
| `EXECUTORCH_BUILD_XNNPACK` | Builds the XNNPACK backend, which provides highly optimized CPU operators (such as GEMM and convolution) for Arm64 platforms. |
61+
| `EXECUTORCH_XNNPACK_ENABLE_KLEIDI` | Enables Arm KleidiAI acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs. |
62+
| `EXECUTORCH_BUILD_DEVTOOLS` | Builds developer tools such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging. |
63+
| `EXECUTORCH_BUILD_EXTENSION_MODULE` | Builds the Module API extension, which provides a high-level abstraction for model loading and execution using `Module` objects. |
64+
| `EXECUTORCH_BUILD_EXTENSION_TENSOR` | Builds the Tensor API extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime. |
65+
| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | Enables building optimized kernel implementations for better performance on supported architectures. |
66+
| `EXECUTORCH_ENABLE_EVENT_TRACER` | Enables the event tracing feature, which records performance and operator timing information for runtime analysis. |
6767

6868

6969

70-
### Build ExecuTorch
70+
## Build ExecuTorch
7171
Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools:
7272

7373
```bash
7474
cmake --build . -j$(nproc)
7575
```
7676
CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target.
7777

78-
### Locate the executor_runner Binary
78+
## Locate the executor_runner binary
7979
If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under:
8080

8181
```output
8282
build-arm64/executor_runner
83-
```
84-
You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line.
85-
This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration.
83+
You’ll use `executor_runner` in later sections to execute and profile ExecuTorch models directly from the command line on your Arm64 target. This standalone binary lets you run models using the XNNPACK backend with KleidiAI acceleration, making it easy to benchmark and analyze performance on Arm devices.
8684

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
---
2-
title: KleidiAI micro-kernels support in ExecuTorch
2+
title: Accelerate ExecuTorch operators with KleidiAI micro-kernels
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers.
8+
## Understand how KleidiAI micro-kernels integrate with ExecuTorch
99

10-
Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
10+
ExecuTorch uses XNNPACK as its main CPU backend to run and optimize operators like convolutions, matrix multiplications, and fully connected layers.
1111

12-
These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
12+
KleidiAI SME (Scalable Matrix Extension) micro-kernels are integrated into XNNPACK to boost performance on supported Arm platforms. These micro-kernels accelerate operators that use specific data types and quantization settings in ExecuTorch models.
1313

14-
When an operator matches one of the supported configurations, ExecuTorch automatically dispatches it through the KleidiAI-optimized path.
14+
When an operator matches a supported configuration, ExecuTorch automatically uses the KleidiAI-optimized path for faster execution. If an operator is not supported by KleidiAI, ExecuTorch falls back to the standard XNNPACK implementation. This ensures your models always run correctly, even if they do not use KleidiAI acceleration.
1515

16-
Operators that are not covered by KleidiAI fall back to the standard XNNPACK implementations during inference, ensuring functional correctness across all models.
16+
## Understand how KleidiAI micro-kernels integrate with ExecuTorch
1717

1818
In ExecuTorch v1.0.0, the following operator types are implemented through the XNNPACK backend and can potentially benefit from KleidiAI acceleration:
1919
- XNNFullyConnected – Fully connected (dense) layers
@@ -23,15 +23,15 @@ In ExecuTorch v1.0.0, the following operator types are implemented through the X
2323
However, not all instances of these operators are accelerated by KleidiAI.
2424

2525
Acceleration eligibility depends on several operator attributes and backend support, including:
26-
- Data types (e.g., float32, int8, int4)
27-
- Quantization schemes (e.g., symmetric/asymmetric, per-tensor/per-channel)
26+
- Data types (for example, float32, int8, int4)
27+
- Quantization schemes (for example, symmetric/asymmetric, per-tensor/per-channel)
2828
- Tensor memory layout and alignment
2929
- Kernel dimensions and stride settings
3030

3131
The following section provides detailed information on which operator configurations can benefit from KleidiAI acceleration, along with their corresponding data type and quantization support.
3232

3333

34-
### XNNFullyConnected
34+
## XNNFullyConnected
3535

3636
| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
3737
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
@@ -42,14 +42,14 @@ The following section provides detailed information on which operator configurat
4242
| qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32 |
4343

4444

45-
### XNNConv2d
45+
## XNNConv2d
4646
| XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType |
4747
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
4848
| pf32_gemm | FP32 | FP32, pointwise (1×1) | FP32 |
4949
| pqs8_qc8w_gemm | Asymmetric INT8 quantization (NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) |
5050

5151

52-
### XNNBatchMatrixMultiply
52+
## XNNBatchMatrixMultiply
5353
| XNNPACK GEMM Variant | Input A DataType| Input B DataType |Output DataType |
5454
| ------------------ | ---------------------------- | --------------------------------------- |--------------------------------------- |
5555
| pf32_gemm | FP32 | FP32 | FP32 |

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,16 @@ weight: 5
66
layout: learningpathall
77
---
88

9+
## Overview
10+
911
In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.
1012

1113
To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.
1214

1315
These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels.
1416

15-
### Define a Simple Linear Benchmark Model
16-
17-
The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer.
18-
This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels.
19-
20-
By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
17+
## Define a linear benchmark model with PyTorch for ExecuTorch
18+
This step can be confusing at first, but building a minimal model helps you focus on the core operator performance. You’ll be able to quickly test different GEMM implementations and see how each one performs on Arm-based hardware. If you run into errors, check that your PyTorch and ExecuTorch versions are up to date and that you’re using the correct data types for your target GEMM variant. By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
2119

2220

2321
```python
@@ -38,7 +36,7 @@ class DemoLinearModel(torch.nn.Module):
3836
```
3937
This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants.
4038

41-
### Export FP16/FP32 model for pf16_gemm and pf32_gemm
39+
### Export FP16 and FP32 models for pf16_gemm and pf32_gemm variants
4240

4341
| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
4442
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
@@ -89,7 +87,7 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm")
8987

9088
```
9189

92-
### Export INT8 Quantized Model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm
90+
### Export INT8 quantized models for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm variants
9391
INT8 quantized GEMMs are designed to reduce memory footprint and improve performance while maintaining acceptable accuracy.
9492

9593
| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
@@ -152,7 +150,7 @@ export_int8_quantize_model(True,"linear_model_qp8_f32_qc8w_gemm");
152150

153151
```
154152

155-
### Export INT4 quantized model for qp8_f32_qb4w_gemm
153+
## Export INT4 quantized model for qp8_f32_qb4w_gemm variant
156154
This final variant represents KleidiAI’s INT4 path, accelerated by SME2 micro-kernels.
157155

158156
| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
@@ -214,7 +212,7 @@ These ETRecord files are essential for subsequent model inspection and performan
214212
{{%/notice%}}
215213

216214

217-
### Run the Complete Benchmark Model Export Script
215+
## Run the benchmark model export script for ExecuTorch
218216
Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4).
219217
This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format.
220218

@@ -224,7 +222,7 @@ chmod +x export-linear-model.py
224222
python3 ./export-linear-model.py
225223
```
226224

227-
### Verify the Generated Files
225+
## Verify exported ExecuTorch and KleidiAI model files
228226
After successful execution, you should see both .pte (ExecuTorch model) and .etrecord (profiling metadata) files in the model/ directory:
229227

230228
``` bash
@@ -240,4 +238,4 @@ linear_model_qp8_f32_qb4w_gemm.pte
240238
linear_model_qp8_f32_qc8w_gemm.etrecord
241239
linear_model_qp8_f32_qc8w_gemm.pte
242240
```
243-
At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels.
241+
Great job! You now have a complete set of benchmark models exported for multiple GEMM variants and quantization levels. You’re ready to move on and measure performance using ExecuTorch and KleidiAI micro-kernels on Arm-based hardware.

0 commit comments

Comments
 (0)