Refactor documentation: improve clarity and consistency in cross-compilation, ExecuTorch integration, and benchmarking sections

madeline-underwood · madeline-underwood · commit ed894c9f2553 · 2025-11-25T23:01:50.000Z
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 ## Overview
 
-In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled. Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine.
+In this section, you'll cross-compile ExecuTorch for an Arm64 (AArch64) target with XNNPACK and KleidiAI support. Cross-compiling builds all binaries and libraries for your Arm device, even if your development system uses x86_64. This process lets you run and test ExecuTorch on Arm hardware, taking advantage of Arm-optimized performance features.
 
 ## Install the cross-compilation toolchain
 On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, which is a fast build backend commonly used by CMake:
@@ -75,7 +75,7 @@ cmake --build . -j$(nproc)
 ```
 CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target.
 
-## Locate the executor_runner Binary
+## Locate the executor_runner binary
 If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under:
 
 ```output
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md
@@ -5,15 +5,15 @@ weight: 4
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers.
+## Understand how KleidiAI micro-kernels integrate with ExecuTorch
 
-Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
+ExecuTorch uses XNNPACK as its main CPU backend to run and optimize operators like convolutions, matrix multiplications, and fully connected layers.
 
-These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
+KleidiAI SME (Scalable Matrix Extension) micro-kernels are integrated into XNNPACK to boost performance on supported Arm platforms. These micro-kernels accelerate operators that use specific data types and quantization settings in ExecuTorch models.
 
-When an operator matches one of the supported configurations, ExecuTorch automatically dispatches it through the KleidiAI-optimized path.
+When an operator matches a supported configuration, ExecuTorch automatically uses the KleidiAI-optimized path for faster execution. If an operator is not supported by KleidiAI, ExecuTorch falls back to the standard XNNPACK implementation. This ensures your models always run correctly, even if they do not use KleidiAI acceleration.
 
-Operators that are not covered by KleidiAI fall back to the standard XNNPACK implementations during inference, ensuring functional correctness across all models.
+## Understand how KleidiAI micro-kernels integrate with ExecuTorch
 
 In ExecuTorch v1.0.0, the following operator types are implemented through the XNNPACK backend and can potentially benefit from KleidiAI acceleration:
 - XNNFullyConnected – Fully connected (dense) layers
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md
@@ -6,6 +6,8 @@ weight: 6
 layout: learningpathall
 ---
 
+## Understand Conv2d benchmark variants and KleidiAI acceleration
+
 In the previous section, you saw that that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels.
 
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md
@@ -6,6 +6,7 @@ weight: 7
 layout: learningpathall
 ---
 
+## Learn how batch matrix multiply accelerates deep learning on Arm
 
 The batch matrix multiply operator (`torch.bmm`) is commonly used for efficient matrix operations in deep learning models. When running on Arm systems with XNNPACK, this operator is lowered to a general matrix multiplication (GEMM) implementation. If your input shapes and data types match supported patterns, XNNPACK can automatically dispatch these operations to KleidiAI micro-kernels, which are optimized for Arm hardware.
 
@@ -78,7 +79,7 @@ When exporting models, the **generate_etrecord** option is enabled to produce th
 These ETRecord files are essential for subsequent model analysis and performance evaluation.
 {{%/notice%}}
 
-### Run the complete benchmark model script
+## Run the complete benchmark model script
 Instead of executing each export block manually, you can download and run the full matrix-multiply benchmark script.
 This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation:
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md
@@ -1,12 +1,14 @@
 ---
-title: Analyzing ETRecord and ETDump
+title: Analyze ETRecord and ETDump
 weight: 9
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-You will use the ExecuTorch Inspector to correlate runtime events from the .etdump with the lowered graph and backend mapping from the .etrecord. This lets you confirm that a node was delegated to XNNPACK and when eligible it was accelerated by KleidiAI micro-kernels.
+## Overview 
+
+In this section you will use the ExecuTorch Inspector to correlate runtime events from the .etdump with the lowered graph and backend mapping from the .etrecord. This lets you confirm that a node was delegated to XNNPACK and when eligible it was accelerated by KleidiAI micro-kernels.
 
 The Inspector analyzes the runtime data from the ETDump file and maps it to the corresponding operators in the Edge Dialect Graph.
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md
@@ -1,19 +1,19 @@
 ---
-title: Benchmark a KleidiAI Micro-kernel in ExecuTorch
+title: Benchmark a KleidiAI micro-kernel in ExecuTorch
 
 minutes_to_complete: 30
 
-who_is_this_for: This is an advanced topic for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 (AArch64) platforms supporting SME/SME2 instructions.
+who_is_this_for: This is an advanced topic for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 platforms supporting SME/SME2 instructions.
 
 learning_objectives:
-    - Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled including SME/SME2 instructions
-    - Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions
-    - Use the executor_runner tool to run kernel workloads and collect ETDump profiling data
-    - Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior
+  - Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled, including SME/SME2 instructions
+  - Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions
+  - Use the executor_runner tool to run kernel workloads and collect ETDump profiling data.
+  - Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior.
 
 prerequisites:
-  - An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space.
-  - An Arm64 target system with support for SME or SME2. Refer to [Devices with native SME2 support](https://learn.arm.com/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-native-sme2-support).
+  - An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space
+  - An Arm64 target system with support for SME or SME2 - see the Learning Path [Devices with native SME2 support](https://learn.arm.com/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started/#devices-with-native-sme2-support)
 
 author: Qixiang Xu