Skip to content

Commit ed894c9

Browse files
Refactor documentation: improve clarity and consistency in cross-compilation, ExecuTorch integration, and benchmarking sections
1 parent 9960589 commit ed894c9

File tree

6 files changed

+23
-18
lines changed

6 files changed

+23
-18
lines changed

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
## Overview
1010

11-
In this section, youll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled. Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine.
11+
In this section, you'll cross-compile ExecuTorch for an Arm64 (AArch64) target with XNNPACK and KleidiAI support. Cross-compiling builds all binaries and libraries for your Arm device, even if your development system uses x86_64. This process lets you run and test ExecuTorch on Arm hardware, taking advantage of Arm-optimized performance features.
1212

1313
## Install the cross-compilation toolchain
1414
On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, which is a fast build backend commonly used by CMake:
@@ -75,7 +75,7 @@ cmake --build . -j$(nproc)
7575
```
7676
CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target.
7777

78-
## Locate the executor_runner Binary
78+
## Locate the executor_runner binary
7979
If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under:
8080

8181
```output

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,15 @@ weight: 4
55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers.
8+
## Understand how KleidiAI micro-kernels integrate with ExecuTorch
99

10-
Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
10+
ExecuTorch uses XNNPACK as its main CPU backend to run and optimize operators like convolutions, matrix multiplications, and fully connected layers.
1111

12-
These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
12+
KleidiAI SME (Scalable Matrix Extension) micro-kernels are integrated into XNNPACK to boost performance on supported Arm platforms. These micro-kernels accelerate operators that use specific data types and quantization settings in ExecuTorch models.
1313

14-
When an operator matches one of the supported configurations, ExecuTorch automatically dispatches it through the KleidiAI-optimized path.
14+
When an operator matches a supported configuration, ExecuTorch automatically uses the KleidiAI-optimized path for faster execution. If an operator is not supported by KleidiAI, ExecuTorch falls back to the standard XNNPACK implementation. This ensures your models always run correctly, even if they do not use KleidiAI acceleration.
1515

16-
Operators that are not covered by KleidiAI fall back to the standard XNNPACK implementations during inference, ensuring functional correctness across all models.
16+
## Understand how KleidiAI micro-kernels integrate with ExecuTorch
1717

1818
In ExecuTorch v1.0.0, the following operator types are implemented through the XNNPACK backend and can potentially benefit from KleidiAI acceleration:
1919
- XNNFullyConnected – Fully connected (dense) layers

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ weight: 6
66
layout: learningpathall
77
---
88

9+
## Understand Conv2d benchmark variants and KleidiAI acceleration
10+
911
In the previous section, you saw that that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels.
1012

1113

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ weight: 7
66
layout: learningpathall
77
---
88

9+
## Learn how batch matrix multiply accelerates deep learning on Arm
910

1011
The batch matrix multiply operator (`torch.bmm`) is commonly used for efficient matrix operations in deep learning models. When running on Arm systems with XNNPACK, this operator is lowered to a general matrix multiplication (GEMM) implementation. If your input shapes and data types match supported patterns, XNNPACK can automatically dispatch these operations to KleidiAI micro-kernels, which are optimized for Arm hardware.
1112

@@ -78,7 +79,7 @@ When exporting models, the **generate_etrecord** option is enabled to produce th
7879
These ETRecord files are essential for subsequent model analysis and performance evaluation.
7980
{{%/notice%}}
8081

81-
### Run the complete benchmark model script
82+
## Run the complete benchmark model script
8283
Instead of executing each export block manually, you can download and run the full matrix-multiply benchmark script.
8384
This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation:
8485

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
---
2-
title: Analyzing ETRecord and ETDump
2+
title: Analyze ETRecord and ETDump
33
weight: 9
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
You will use the ExecuTorch Inspector to correlate runtime events from the .etdump with the lowered graph and backend mapping from the .etrecord. This lets you confirm that a node was delegated to XNNPACK and when eligible it was accelerated by KleidiAI micro-kernels.
9+
## Overview
10+
11+
In this section you will use the ExecuTorch Inspector to correlate runtime events from the .etdump with the lowered graph and backend mapping from the .etrecord. This lets you confirm that a node was delegated to XNNPACK and when eligible it was accelerated by KleidiAI micro-kernels.
1012

1113
The Inspector analyzes the runtime data from the ETDump file and maps it to the corresponding operators in the Edge Dialect Graph.
1214

content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
---
2-
title: Benchmark a KleidiAI Micro-kernel in ExecuTorch
2+
title: Benchmark a KleidiAI micro-kernel in ExecuTorch
33

44
minutes_to_complete: 30
55

6-
who_is_this_for: This is an advanced topic for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 (AArch64) platforms supporting SME/SME2 instructions.
6+
who_is_this_for: This is an advanced topic for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 platforms supporting SME/SME2 instructions.
77

88
learning_objectives:
9-
- Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled including SME/SME2 instructions
10-
- Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions
11-
- Use the executor_runner tool to run kernel workloads and collect ETDump profiling data
12-
- Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior
9+
- Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled, including SME/SME2 instructions
10+
- Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions
11+
- Use the executor_runner tool to run kernel workloads and collect ETDump profiling data.
12+
- Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior.
1313

1414
prerequisites:
15-
- An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space.
16-
- An Arm64 target system with support for SME or SME2. Refer to [Devices with native SME2 support](https://learn.arm.com/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-native-sme2-support).
15+
- An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space
16+
- An Arm64 target system with support for SME or SME2 - see the Learning Path [Devices with native SME2 support](https://learn.arm.com/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-native-sme2-support)
1717

1818
author: Qixiang Xu
1919

0 commit comments

Comments
 (0)