Skip to content

Commit 53aabad

Browse files
authored
Merge pull request #2572 from pareenaverma/content_review
Tech review of ET KleidiAI profiling LP
2 parents f084481 + 56c20fd commit 53aabad

File tree

10 files changed

+138
-69
lines changed

10 files changed

+138
-69
lines changed

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,45 +10,50 @@ layout: learningpathall
1010
### Python Environment Setup
1111

1212
Before building ExecuTorch, it is highly recommended to create an isolated Python environment.
13-
This prevents dependency conflicts with your system Python installation and ensures a clean build environment.
13+
This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs.
1414

1515
```bash
16-
cd $WORKSPACE
16+
sudo apt update
17+
sudo apt install -y python3 python3.12-dev python3-venv build-essential cmake
1718
python3 -m venv pyenv
1819
source pyenv/bin/activate
1920

2021
```
21-
All subsequent steps should be executed within this Python virtual environment.
22+
Once activated, all subsequent steps should be executed within this Python virtual environment.
2223

2324
### Download the ExecuTorch Source Code
2425

2526
Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched.
2627

27-
```bash
28+
```bash
29+
export WORKSPACE=$HOME
2830
cd $WORKSPACE
2931
git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.git
3032

3133
```
3234

33-
> **Note:**
34-
> The instructions in this guide are based on **ExecuTorch v1.0.0**.
35-
> Commands or configuration options may differ in later releases.
35+
{{% notice Note %}}
36+
The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases.
37+
{{% /notice %}}
3638

3739
### Build and Install the ExecuTorch Python Components
3840

39-
Next, build the Python bindings and install them into your environment. The following command uses the provided installation script to configure, compile, and install ExecuTorch with developer tools enabled.
41+
Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment.
42+
This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.
4043

44+
Run the following command from your ExecuTorch workspace:
4145
```bash
4246
cd $WORKSPACE/executorch
4347
CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh
4448

4549
```
50+
This will build ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector.
4651

47-
This will build ExecuTorch and its dependencies using CMake, enabling optional developer utilities such as ETDump and Inspector.
48-
49-
After installation completes successfully, you can verify the environment by running:
52+
### Verify the Installation
53+
After the build completes successfully, verify that ExecuTorch was installed into your current Python environment:
5054

5155
```bash
5256
python -c "import executorch; print('Executorch build and install successfully.')"
5357
```
5458

59+
If the output confirms success, you’re ready to begin cross-compilation and profiling preparation for KleidiAI micro-kernels.

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,26 @@
11
---
2-
title: Cross-Compile ExecuTorch for the Aarch64 platform
2+
title: Cross-Compile ExecuTorch for the AArch64 platform
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

99

10-
This section describes how to cross-compile ExecuTorch for an AArch64 target platform with XNNPACK and KleidiAI support enabled.
11-
All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc).
10+
In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled.
11+
Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine.
1212

13+
### Install the Cross-Compilation Toolchain
14+
On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake:
15+
```bash
16+
sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y
17+
```
1318

1419
### Run CMake Configuration
1520

16-
Use CMake to configure the ExecuTorch build for Aarch64. The example below enables key extensions, developer tools, and XNNPACK with KleidiAI acceleration:
21+
Use CMake to configure the ExecuTorch build for the AArch64 target.
22+
23+
The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI.
1724

1825
```bash
1926

@@ -61,18 +68,19 @@ cmake -GNinja \
6168

6269

6370
### Build ExecuTorch
71+
Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools:
6472

6573
```bash
6674
cmake --build . -j$(nproc)
67-
6875
```
76+
CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target.
6977

70-
If the build completes successfully, you should find the executor_runner binary under the directory:
78+
### Locate the executor_runner Binary
79+
If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under:
7180

72-
```bash
81+
```output
7382
build-arm64/executor_runner
74-
7583
```
76-
84+
You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line.
7785
This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration.
7886

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ weight: 4
55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
ExecuTorch uses XNNPACK as its primary CPU backend for operator execution and performance optimization.
8+
ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers.
99

10-
Within this architecture, only a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
10+
Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
1111

1212
These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
1313

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,16 @@ weight: 5
66
layout: learningpathall
77
---
88

9-
In the previous section, we discussed that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.
9+
In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.
1010

11-
To evaluate the performance of these variants across different hardware platforms, we will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.
11+
To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.
1212

13+
These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels.
1314

14-
### Fully connected benchmark model
15+
### Define a Simple Linear Benchmark Model
1516

16-
In the following example model, we use simple model to generate nodes that can be accelerated by Kleidiai.
17+
The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer.
18+
This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels.
1719

1820
By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
1921

@@ -34,8 +36,9 @@ class DemoLinearModel(torch.nn.Module):
3436
return (torch.randn(1, 256, dtype=dtype),)
3537

3638
```
39+
This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants.
3740

38-
### Export FP16/FP32 model for pf16_gemm/pf32_gemm Variants
41+
### Export FP16/FP32 model for pf16_gemm and pf32_gemm
3942

4043
| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
4144
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
@@ -86,15 +89,16 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm")
8689

8790
```
8891

89-
### Export int8 quantized model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm variant
92+
### Export INT8 Quantized Model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm
93+
INT8 quantized GEMMs are designed to reduce memory footprint and improve performance while maintaining acceptable accuracy.
9094

9195
| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
9296
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
9397
| qp8_f32_qc8w_gemm | Asymmetric INT8 per-row quantization | Per-channel symmetric INT8 quantization | FP32 |
9498
| pqs8_qc8w_gemm | Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |
9599

96100

97-
The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variant to accelerate computation:
101+
The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variants to accelerate computation:
98102

99103
```python
100104

@@ -148,7 +152,9 @@ export_int8_quantize_model(True,"linear_model_qp8_f32_qc8w_gemm");
148152

149153
```
150154

151-
### Export int4 quantized model for qp8_f32_qb4w_gemm variant
155+
### Export INT4 quantized model for qp8_f32_qb4w_gemm
156+
This final variant represents KleidiAI’s INT4 path, accelerated by SME2 micro-kernels.
157+
152158
| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
153159
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
154160
| qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32 |
@@ -200,17 +206,26 @@ def export_int4_quantize_model(dynamic: bool, model_name: str):
200206
etrecord.save(etr_file)
201207

202208
export_int4_quantize_model(False,"linear_model_qp8_f32_qb4w_gemm");
203-
204-
205209
```
206210

207-
**NOTE:**
208-
211+
{{%notice Note%}}
209212
When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file.
210213
These ETRecord files are essential for subsequent model inspection and performance analysis using the ExecuTorch Inspector API.
214+
{{%/notice%}}
211215

212216

213-
After running this script, both the PTE model file and the etrecord file are generated.
217+
### Run the Complete Benchmark Model Export Script
218+
Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4).
219+
This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format.
220+
221+
```bash
222+
wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/arm-learning-paths/refs/heads/main/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-linear-model.py
223+
chmod +x export-linear-model.py
224+
python3 ./export-linear-model.py
225+
```
226+
227+
### Verify the Generated Files
228+
After successful execution, you should see both .pte (ExecuTorch model) and .etrecord (profiling metadata) files in the model/ directory:
214229

215230
``` bash
216231
$ ls model/ -1
@@ -225,5 +240,4 @@ linear_model_qp8_f32_qb4w_gemm.pte
225240
linear_model_qp8_f32_qc8w_gemm.etrecord
226241
linear_model_qp8_f32_qc8w_gemm.pte
227242
```
228-
229-
The complete source code is available [here](../export-linear-model.py).
243+
At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels.

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,22 @@ weight: 6
66
layout: learningpathall
77
---
88

9-
In the previous section, we discussed that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels.
9+
In the previous section, you saw that that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels.
1010

1111

1212
| XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType |
1313
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
1414
| pqs8_qc8w_gemm | Asymmetric INT8 quantization(NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) |
1515
| pf32_gemm | FP32 | FP32, pointwise (1×1) | FP32 |
1616

17-
To evaluate the performance of Conv2d operators across multiple hardware platforms, we create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis.
17+
To evaluate the performance of Conv2d operators across multiple hardware platforms, you will create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis.
1818

1919

20-
### INT8-quantized Conv2d benchmark model
20+
### INT8-Quantized Conv2d benchmark model
2121

2222
The following example defines a simple model to generate INT8-quantized Conv2d nodes that can be accelerated by KleidiAI.
2323

24-
By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
24+
By adjusting some of the model’s input parameters, you can also simulate the behavior of nodes that appear in real-world models.
2525

2626

2727
```python
@@ -100,7 +100,7 @@ export_int8_quantize_conv2d_model("qint8_conv2d_pqs8_qc8w_gemm");
100100

101101
### PointwiseConv2d benchmark model
102102

103-
In the following example model, we use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai.
103+
In the following example model, you will use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai.
104104

105105
As before, input parameters can be adjusted to simulate real-world model behavior.
106106

@@ -158,10 +158,21 @@ export_pointwise_model("pointwise_conv2d_pf32_gemm")
158158

159159
```
160160

161-
**NOTES:**
162-
161+
{{%notice Note%}}
163162
When exporting models, the generate_etrecord option is enabled to produce the .etrecord file alongside the .pte model file.
164163
These ETRecord files are essential for subsequent model analysis and performance evaluation.
164+
{{%/notice%}}
165+
166+
167+
### Run the Complete Benchmark Model Script
168+
Rather than executing each block by hand, download and run the full export script. It will generate both Conv2d variants, run quantization (INT8) where applicable, partition to XNNPACK, lower, and export to ExecuTorch .pte together with .etrecord metadata.
169+
170+
```bash
171+
wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py
172+
chmod +x export-conv2d.py
173+
python3 ./export-conv2d.py
174+
```
175+
### Validate Outputs
165176

166177
After running this script, both the PTE model file and the etrecord file are generated.
167178

@@ -173,4 +184,3 @@ pointwise_conv2d_pf32_gemm.etrecord
173184
pointwise_conv2d_pf32_gemm.pte
174185
```
175186

176-
The complete source code is available [here](../export-conv2d.py).

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ weight: 7
66
layout: learningpathall
77
---
88

9-
In the previous section, we discussed that the Batch Matrix Multiply operator supports multiple GEMM (General Matrix Multiplication) variants.
9+
The Batch Matrix Multiply operator (torch.bmm) under XNNPACK lowers to GEMM and, when shapes and dtypes match supported patterns, can dispatch to KleidiAI micro-kernels on Arm.
1010

11-
To evaluate the performance of these variants across different hardware platforms, we construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis.
11+
To evaluate the performance of these variants across different hardware platforms, you will construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis.
1212

1313

1414
### Matrix multiply benchmark model
@@ -72,11 +72,22 @@ export_mutrix_mul_model(torch.float32,"matrix_mul_pf32_gemm")
7272

7373
```
7474

75-
**NOTE:**
76-
75+
{{%notice Note%}}
7776
When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file.
7877
These ETRecord files are essential for subsequent model analysis and performance evaluation.
78+
{{%/notice%}}
79+
80+
### Run the Complete Benchmark Model Script
81+
Instead of executing each export block manually, you can download and run the full matrix-multiply benchmark script.
82+
This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation.
83+
84+
```bash
85+
wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-matrix-mul.py
86+
chmod +x export-matrix-mul.py
87+
python3 ./export-matrix-mul.py
88+
```
7989

90+
### Verify the output
8091

8192
After running this script, both the PTE model file and the etrecord file are generated.
8293

@@ -87,5 +98,6 @@ model/matrix_mul_pf16_gemm.pte
8798
model/matrix_mul_pf32_gemm.etrecord
8899
model/matrix_mul_pf32_gemm.pte
89100
```
101+
These files are the inputs for upcoming executor_runner benchmarks, where you’ll measure and compare KleidiAI micro-kernel performance.
90102

91103
The complete source code is available [here](../export-matrix-mul.py).

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,35 @@
11
---
2-
title: Run model and generate the etdump
2+
title: Run model and generate the ETDump
33
weight: 8
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
After generating the model, we can now run it on an ARM64 platform using the following command:
9+
### Copy artifacts to your Arm64 target
10+
From your x86_64 host (where you cross-compiled), copy the runner and exported models to the Arm device:
11+
12+
```bash
13+
scp $WORKSPACE/build-arm64/executor_runner <arm_user>@<arm_host>:~/bench/
14+
scp -r model/ <arm_user>@<arm_host>:~/bench/
15+
```
16+
17+
### Run a model and emit ETDump
18+
Use one of the models you exported earlier (e.g., FP32 linear: linear_model_pf32_gemm.pte).
19+
The flags below tell executor_runner where to write the ETDump and how many times to execute.
1020

1121
```bash
12-
cd $WORKSPACE
13-
/build-arm64/executor_runner -etdump_path model/linear_model_f32.etdump -model_path model/linear_model_f32.pte -num_executions=1 -cpu_threads 1
22+
cd ~/bench
23+
./executor_runner -etdump_path model/linear_model_pf32_gemm.etdump -model_path model/linear_model_pf32_gemm.pte -num_executions=1 -cpu_threads 1
1424

1525
```
1626

1727
You can adjust the number of execution threads and the number of times the model is invoked.
1828

1929

20-
You should see output similar to the example below.
30+
You should see logs like:
2131

22-
```bash
32+
```output
2333
D 00:00:00.015988 executorch:XNNPACKBackend.cpp:57] Creating XNN workspace
2434
D 00:00:00.018719 executorch:XNNPACKBackend.cpp:69] Created XNN workspace: 0xaff21c2323e0
2535
D 00:00:00.027595 executorch:operator_registry.cpp:96] Successfully registered all kernels from shared library: NOT_SUPPORTED
@@ -42,6 +52,6 @@ OutputX 0: tensor(sizes=[1, 256], [
4252
I 00:00:00.093912 executorch:executor_runner.cpp:125] ETDump written to file 'model/linear_model_f32.etdump'.
4353
4454
```
55+
If execution succeeds, an ETDump file is created next to your model. You will load the .etdump in the next section and analyze which operators dispatched to KleidiAI and how each micro-kernel performed.
4556

46-
If the execution is successful, an etdump file will also be generated.
4757

0 commit comments

Comments
 (0)