ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md‎
Lines changed: 14 additions & 15 deletions b/‎content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md‎
Lines changed: 14 additions & 15 deletions
diff --git a/‎content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md‎
Lines changed: 17 additions & 19 deletions b/‎content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md‎
Lines changed: 17 additions & 19 deletions
diff --git a/‎content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md‎
Lines changed: 11 additions & 11 deletions b/‎content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md‎
Lines changed: 11 additions & 11 deletions
diff --git a/‎content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md‎
Lines changed: 10 additions & 12 deletions b/‎content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md‎
Lines changed: 10 additions & 12 deletions
@@ -1,16 +1,15 @@
 ---
-title: Environment setup
+title: Set up your environment 
 weight: 2
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
 
-### Python Environment Setup 
+## Set up your Python environment 
 
-Before building ExecuTorch, it is highly recommended to create an isolated Python environment.
-This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs.
+Before building ExecuTorch, it is highly recommended to create an isolated Python environment. This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs:
 
 ```bash 
 sudo apt update
@@ -19,11 +18,11 @@ python3 -m venv pyenv
 source pyenv/bin/activate
 
 ```
-Once activated, all subsequent steps should be executed within this Python virtual environment.
+Keep your Python virtual environment activated while you complete the next steps. This ensures all dependencies install in the correct location.
 
-### Download the ExecuTorch Source Code
+## Download the ExecuTorch source code
 
-Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched.
+Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched:
 
 ```bash
 export WORKSPACE=$HOME
@@ -33,27 +32,27 @@ git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.g
 ```
 
   {{% notice Note %}}
-  The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases.
+  The instructions in this Learning Path were tested on ExecuTorch v1.0.0. Commands or configuration options might differ in later releases.
   {{% /notice %}}
 
-### Build and Install the ExecuTorch Python Components
+## Build and install the ExecuTorch Python components
 
-Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment.
-This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.
+Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment. This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.
 
 Run the following command from your ExecuTorch workspace:
 ```bash 
 cd $WORKSPACE/executorch
 CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh
 
 ```
-This will build ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector.
+This builds ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector.
+
+## Verify the Installation
+After the build completes, check that ExecuTorch is installed in your active Python environment. Run the following command:
 
-### Verify the Installation
-After the build completes successfully, verify that ExecuTorch was installed into your current Python environment:
 
 ```bash 
 python -c "import executorch; print('Executorch build and install successfully.')"
 ```
 
-If the output confirms success, you’re ready to begin cross-compilation and profiling preparation for KleidiAI micro-kernels.
+If you see the success message, your environment is ready. You can now move on to cross-compiling and preparing to profile KleidiAI micro-kernels.
@@ -6,21 +6,21 @@ weight: 3
 layout: learningpathall
 ---
 
+## Overview
 
-In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled.
-Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine.
+In this section, you'll cross-compile ExecuTorch for an Arm64 (AArch64) target with XNNPACK and KleidiAI support. Cross-compiling builds all binaries and libraries for your Arm device, even if your development system uses x86_64. This process lets you run and test ExecuTorch on Arm hardware, taking advantage of Arm-optimized performance features.
 
-### Install the Cross-Compilation Toolchain
-On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake:
+## Install the cross-compilation toolchain
+On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, which is a fast build backend commonly used by CMake:
 ```bash
 sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y
 ```
 
-### Run CMake Configuration 
+## Run CMake configuration 
 
 Use CMake to configure the ExecuTorch build for the AArch64 target.
 
-The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI.
+The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI:
 
 ```bash 
 
@@ -53,34 +53,32 @@ cmake -GNinja \
 
 ```
 
-#### Key Build Options
+## Key Build Options
 
 | **CMake Option**                            | **Description**                                                                                                                                                    |
 | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `EXECUTORCH_BUILD_XNNPACK`                  | Builds the **XNNPACK backend**, which provides highly optimized CPU operators (GEMM, convolution, etc.) for Arm64 platforms.                                 |
-| `EXECUTORCH_XNNPACK_ENABLE_KLEIDI`          | Enables **Arm KleidiAI** acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs.                                            |
-| `EXECUTORCH_BUILD_DEVTOOLS`                 | Builds **developer tools** such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging.                                                  |
-| `EXECUTORCH_BUILD_EXTENSION_MODULE`         | Builds the **Module API** extension, which provides a high-level abstraction for model loading and execution using `Module` objects.                               |
-| `EXECUTORCH_BUILD_EXTENSION_TENSOR`         | Builds the **Tensor API** extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime.                              |
-| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED`        | Enables building **optimized kernel implementations** for better performance on supported architectures.                                                           |
-| `EXECUTORCH_ENABLE_EVENT_TRACER`            | Enables the **event tracing** feature, which records performance and operator timing information for runtime analysis.                                             |
+| `EXECUTORCH_BUILD_XNNPACK`                  | Builds the XNNPACK backend, which provides highly optimized CPU operators (such as GEMM and convolution) for Arm64 platforms.                                 |
+| `EXECUTORCH_XNNPACK_ENABLE_KLEIDI`          | Enables Arm KleidiAI acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs.                                            |
+| `EXECUTORCH_BUILD_DEVTOOLS`                 | Builds developer tools such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging.                                                  |
+| `EXECUTORCH_BUILD_EXTENSION_MODULE`         | Builds the Module API extension, which provides a high-level abstraction for model loading and execution using `Module` objects.                               |
+| `EXECUTORCH_BUILD_EXTENSION_TENSOR`         | Builds the Tensor API extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime.                              |
+| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED`        | Enables building optimized kernel implementations for better performance on supported architectures.                                                           |
+| `EXECUTORCH_ENABLE_EVENT_TRACER`            | Enables the event tracing feature, which records performance and operator timing information for runtime analysis.                                             |
 
 
 
-### Build ExecuTorch 
+## Build ExecuTorch 
 Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools:
 
 ```bash 
 cmake --build . -j$(nproc)
 ```
 CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target.
 
-### Locate the executor_runner Binary
+## Locate the executor_runner binary
 If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under:
 
 ```output
 build-arm64/executor_runner
-```
-You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line.
-This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration.
+You’ll use `executor_runner` in later sections to execute and profile ExecuTorch models directly from the command line on your Arm64 target. This standalone binary lets you run models using the XNNPACK backend with KleidiAI acceleration, making it easy to benchmark and analyze performance on Arm devices.
 
@@ -1,19 +1,19 @@
 ---
-title: KleidiAI micro-kernels support in ExecuTorch
+title: Accelerate ExecuTorch operators with KleidiAI micro-kernels
 weight: 4
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers.
+## Understand how KleidiAI micro-kernels integrate with ExecuTorch
 
-Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
+ExecuTorch uses XNNPACK as its main CPU backend to run and optimize operators like convolutions, matrix multiplications, and fully connected layers.
 
-These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
+KleidiAI SME (Scalable Matrix Extension) micro-kernels are integrated into XNNPACK to boost performance on supported Arm platforms. These micro-kernels accelerate operators that use specific data types and quantization settings in ExecuTorch models.
 
-When an operator matches one of the supported configurations, ExecuTorch automatically dispatches it through the KleidiAI-optimized path.
+When an operator matches a supported configuration, ExecuTorch automatically uses the KleidiAI-optimized path for faster execution. If an operator is not supported by KleidiAI, ExecuTorch falls back to the standard XNNPACK implementation. This ensures your models always run correctly, even if they do not use KleidiAI acceleration.
 
-Operators that are not covered by KleidiAI fall back to the standard XNNPACK implementations during inference, ensuring functional correctness across all models.
+## Understand how KleidiAI micro-kernels integrate with ExecuTorch
 
 In ExecuTorch v1.0.0, the following operator types are implemented through the XNNPACK backend and can potentially benefit from KleidiAI acceleration:
 - XNNFullyConnected – Fully connected (dense) layers
@@ -23,15 +23,15 @@ In ExecuTorch v1.0.0, the following operator types are implemented through the X
 However, not all instances of these operators are accelerated by KleidiAI.
 
 Acceleration eligibility depends on several operator attributes and backend support, including:
-- Data types (e.g., float32, int8, int4)
-- Quantization schemes (e.g., symmetric/asymmetric, per-tensor/per-channel)
+- Data types (for example, float32, int8, int4)
+- Quantization schemes (for example, symmetric/asymmetric, per-tensor/per-channel)
 - Tensor memory layout and alignment
 - Kernel dimensions and stride settings
 
 The following section provides detailed information on which operator configurations can benefit from KleidiAI acceleration, along with their corresponding data type and quantization support.
 
 
-### XNNFullyConnected 
+## XNNFullyConnected 
 
 | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType                      |
 | ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
@@ -42,14 +42,14 @@ The following section provides detailed information on which operator configurat
 | qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32                         |
 
 
-### XNNConv2d
+## XNNConv2d
 | XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType                      |
 | ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
 | pf32_gemm    | FP32                         | FP32, pointwise (1×1)                   | FP32                         |
 | pqs8_qc8w_gemm | Asymmetric INT8 quantization (NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) |
 
 
-### XNNBatchMatrixMultiply
+## XNNBatchMatrixMultiply
 | XNNPACK GEMM Variant | Input A DataType| Input B DataType |Output DataType |
 | ------------------  | ---------------------------- | --------------------------------------- |--------------------------------------- |
 | pf32_gemm    | FP32                         | FP32                         | FP32 | 
 
@@ -6,18 +6,16 @@ weight: 5
 layout: learningpathall
 ---
 
+## Overview
+
 In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.
 
 To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.
 
 These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels.
 
-### Define a Simple Linear Benchmark Model
-
-The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer.
-This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels.
-
-By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
+## Define a linear benchmark model with PyTorch for ExecuTorch
+This step can be confusing at first, but building a minimal model helps you focus on the core operator performance. You’ll be able to quickly test different GEMM implementations and see how each one performs on Arm-based hardware. If you run into errors, check that your PyTorch and ExecuTorch versions are up to date and that you’re using the correct data types for your target GEMM variant. By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
 
 
 ```python
@@ -38,7 +36,7 @@ class DemoLinearModel(torch.nn.Module):
 ```
 This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants.
 
-### Export FP16/FP32 model for pf16_gemm and pf32_gemm 
+### Export FP16 and FP32 models for pf16_gemm and pf32_gemm variants 
 
 | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType                      |
 | ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
@@ -89,7 +87,7 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm")
 
 ```
 
-### Export INT8 Quantized Model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm
+### Export INT8 quantized models for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm variants
 INT8 quantized GEMMs are designed to reduce memory footprint and improve performance while maintaining acceptable accuracy.
 
 | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType                      |
@@ -152,7 +150,7 @@ export_int8_quantize_model(True,"linear_model_qp8_f32_qc8w_gemm");
 
 ```
 
-### Export INT4 quantized model for qp8_f32_qb4w_gemm 
+## Export INT4 quantized model for qp8_f32_qb4w_gemm variant  
 This final variant represents KleidiAI’s INT4 path, accelerated by SME2 micro-kernels.
 
 | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType                      |
@@ -214,7 +212,7 @@ These ETRecord files are essential for subsequent model inspection and performan
 {{%/notice%}}
 
 
-### Run the Complete Benchmark Model Export Script
+## Run the benchmark model export script for ExecuTorch 
 Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4).
 This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format.
 
@@ -224,7 +222,7 @@ chmod +x export-linear-model.py
 python3 ./export-linear-model.py
 ```
 
-### Verify the Generated Files
+## Verify exported ExecuTorch and KleidiAI model files
 After successful execution, you should see both .pte (ExecuTorch model) and .etrecord (profiling metadata) files in the model/ directory:
 
 ``` bash 
@@ -240,4 +238,4 @@ linear_model_qp8_f32_qb4w_gemm.pte
 linear_model_qp8_f32_qc8w_gemm.etrecord
 linear_model_qp8_f32_qc8w_gemm.pte
 ```
-At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels.
+Great job! You now have a complete set of benchmark models exported for multiple GEMM variants and quantization levels. You’re ready to move on and measure performance using ExecuTorch and KleidiAI micro-kernels on Arm-based hardware.