Merge pull request #2572 from pareenaverma/content_review

pareenaverma · web-flow · commit 53aabada65e0 · 2025-11-20T19:56:55.000-05:00
Tech review of ET KleidiAI profiling LP
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md
@@ -10,45 +10,50 @@ layout: learningpathall
 ### Python Environment Setup 
 
 Before building ExecuTorch, it is highly recommended to create an isolated Python environment.
-This prevents dependency conflicts with your system Python installation and ensures a clean build environment.
+This prevents dependency conflicts with your system Python installation and ensures that all required build and runtime dependencies remain consistent across runs.
 
 ```bash 
-cd $WORKSPACE
+sudo apt update
+sudo apt install -y python3 python3.12-dev python3-venv build-essential cmake
 python3 -m venv pyenv
 source pyenv/bin/activate
 
 ```
-All subsequent steps should be executed within this Python virtual environment.
+Once activated, all subsequent steps should be executed within this Python virtual environment.
 
 ### Download the ExecuTorch Source Code
 
 Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched.
 
-```bash 
+```bash
+export WORKSPACE=$HOME
 cd $WORKSPACE
 git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.git
 
 ```
 
-   > **Note:**  
-   > The instructions in this guide are based on **ExecuTorch v1.0.0**.  
-   > Commands or configuration options may differ in later releases.
+  {{% notice Note %}}
+  The instructions in this guide are based on ExecuTorch v1.0.0. Commands or configuration options may differ in later releases.
+  {{% /notice %}}
 
 ### Build and Install the ExecuTorch Python Components
 
-Next, build the Python bindings and install them into your environment. The following command uses the provided installation script to configure, compile, and install ExecuTorch with developer tools enabled.
+Next, you’ll build the ExecuTorch Python bindings and install them into your active virtual environment.
+This process compiles the C++ runtime, links hardware-optimized backends such as KleidiAI and XNNPACK, and enables optional developer utilities for debugging and profiling.
 
+Run the following command from your ExecuTorch workspace:
 ```bash 
 cd $WORKSPACE/executorch
 CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh
 
 ```
+This will build ExecuTorch and its dependencies using cmake, enabling optional developer utilities such as ETDump and Inspector.
 
-This will build ExecuTorch and its dependencies using CMake, enabling optional developer utilities such as ETDump and Inspector.
-
-After installation completes successfully, you can verify the environment by running:
+### Verify the Installation
+After the build completes successfully, verify that ExecuTorch was installed into your current Python environment:
 
 ```bash 
 python -c "import executorch; print('Executorch build and install successfully.')"
 ```
 
+If the output confirms success, you’re ready to begin cross-compilation and profiling preparation for KleidiAI micro-kernels.
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md
@@ -1,19 +1,26 @@
 ---
-title: Cross-Compile ExecuTorch for the Aarch64 platform
+title: Cross-Compile ExecuTorch for the AArch64 platform
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
 
-This section describes how to cross-compile ExecuTorch for an AArch64 target platform with XNNPACK and KleidiAI support enabled.
-All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc).
+In this section, you’ll cross-compile ExecuTorch for an AArch64 (Arm64) target platform with both XNNPACK and KleidiAI support enabled.
+Cross-compiling ensures that all binaries and libraries are built for your Arm target hardware, even when your development host is an x86_64 machine.
 
+### Install the Cross-Compilation Toolchain
+On your x86_64 Linux host, install the GNU Arm cross-compilation toolchain along with Ninja, a fast build backend commonly used by CMake:
+```bash
+sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu ninja-build -y
+```
 
 ### Run CMake Configuration 
 
-Use CMake to configure the ExecuTorch build for Aarch64. The example below enables key extensions, developer tools, and XNNPACK with KleidiAI acceleration: 
+Use CMake to configure the ExecuTorch build for the AArch64 target.
+
+The command below enables all key runtime extensions, developer tools, and optimized backends including XNNPACK and KleidiAI.
 
 ```bash 
 
@@ -61,18 +68,19 @@ cmake -GNinja \
 
 
 ### Build ExecuTorch 
+Once CMake configuration completes successfully, compile the ExecuTorch runtime and its associated developer tools:
 
 ```bash 
 cmake --build . -j$(nproc)
-
 ```
+CMake invokes Ninja to perform the actual build, generating both static libraries and executables for the AArch64 target.
 
-If the build completes successfully, you should find the executor_runner binary under the directory:
+### Locate the executor_runner Binary
+If the build completes successfully, you should see the main benchmarking and profiling utility, executor_runner, under:
 
-```bash
+```output
 build-arm64/executor_runner
-
 ```
-
+You will use executor_runner in the later sections on your Arm64 target as standalone binary used to execute and profile ExecuTorch models directly from the command line.
 This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration.
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md
@@ -5,9 +5,9 @@ weight: 4
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-ExecuTorch uses XNNPACK as its primary CPU backend for operator execution and performance optimization.
+ExecuTorch uses XNNPACK as its primary CPU backend to execute and optimize operators such as convolutions, matrix multiplications, and fully connected layers.
 
-Within this architecture, only a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
+Within this architecture, a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
 
 These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/04-create-fc-model.md
@@ -6,14 +6,16 @@ weight: 5
 layout: learningpathall
 ---
 
-In the previous section, we discussed that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.
+In the previous section, you saw that the Fully Connected operator supports multiple GEMM (General Matrix Multiplication) variants.
 
-To evaluate the performance of these variants across different hardware platforms, we will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.
+To evaluate the performance of these variants across different hardware platforms, you will construct a series of benchmark models that utilize the Fully Connected operator with different GEMM implementations for comparative analysis.
 
+These models will be used later with executor_runner to measure throughput, latency, and ETDump traces for various KleidiAI micro-kernels.
 
-### Fully connected benchmark model
+### Define a Simple Linear Benchmark Model
 
-In the following example model, we use simple model to generate nodes that can be accelerated by Kleidiai. 
+The goal is to create a minimal PyTorch model containing a single torch.nn.Linear layer.
+This allows you to generate operator nodes that can be directly mapped to KleidiAI-accelerated GEMM kernels.
 
 By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
 
@@ -34,8 +36,9 @@ class DemoLinearModel(torch.nn.Module):
         return (torch.randn(1, 256, dtype=dtype),)
 
 ```
+This model creates a single 256×256 linear layer, which can easily be exported in different data types (FP32, FP16, INT8, INT4) to match KleidiAI’s GEMM variants.
 
-### Export FP16/FP32 model for pf16_gemm/pf32_gemm Variants
+### Export FP16/FP32 model for pf16_gemm and pf32_gemm 
 
 | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType                      |
 | ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
@@ -86,15 +89,16 @@ export_executorch_model(torch.float32,"linear_model_pf32_gemm")
 
 ```
 
-### Export int8 quantized model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm variant
+### Export INT8 Quantized Model for pqs8_qc8w_gemm and qp8_f32_qc8w_gemm
+INT8 quantized GEMMs are designed to reduce memory footprint and improve performance while maintaining acceptable accuracy.
 
 | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType                      |
 | ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
 | qp8_f32_qc8w_gemm | Asymmetric INT8 per-row quantization | Per-channel symmetric INT8 quantization | FP32                         |
 | pqs8_qc8w_gemm    | Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |
 
 
-The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm  variant to accelerate computation:
+The following code demonstrates how to quantized a model that leverages the pqs8_qc8w_gemm/qp8_f32_qc8w_gemm variants to accelerate computation:
 
 ```python 
 
@@ -148,7 +152,9 @@ export_int8_quantize_model(True,"linear_model_qp8_f32_qc8w_gemm");
 
 ```
 
-### Export int4 quantized model for qp8_f32_qb4w_gemm variant
+### Export INT4 quantized model for qp8_f32_qb4w_gemm 
+This final variant represents KleidiAI’s INT4 path, accelerated by SME2 micro-kernels.
+
 | XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType                      |
 | ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
 | qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32                         |
@@ -200,17 +206,26 @@ def export_int4_quantize_model(dynamic: bool, model_name: str):
     etrecord.save(etr_file)
 
 export_int4_quantize_model(False,"linear_model_qp8_f32_qb4w_gemm");
-
-
 ```
 
-**NOTE:**
-
+{{%notice Note%}}
 When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file.
 These ETRecord files are essential for subsequent model inspection and performance analysis using the ExecuTorch Inspector API.
+{{%/notice%}}
 
 
-After running this script, both the PTE model file and the etrecord file are generated.
+### Run the Complete Benchmark Model Export Script
+Instead of manually executing each code block explained above, you can download and run the full example script that builds and exports all linear-layer benchmark models (FP16, FP32, INT8, and INT4).
+This script automatically performs quantization, partitioning, lowering, and export to ExecuTorch format.
+
+```bash
+wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/arm-learning-paths/refs/heads/main/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-linear-model.py
+chmod +x export-linear-model.py
+python3 ./export-linear-model.py
+```
+
+### Verify the Generated Files
+After successful execution, you should see both .pte (ExecuTorch model) and .etrecord (profiling metadata) files in the model/ directory:
 
 ``` bash 
 $ ls model/ -1
@@ -225,5 +240,4 @@ linear_model_qp8_f32_qb4w_gemm.pte
 linear_model_qp8_f32_qc8w_gemm.etrecord
 linear_model_qp8_f32_qc8w_gemm.pte
 ```
-
-The complete source code is available [here](../export-linear-model.py).
+At this point, you have a suite of benchmark models exported for multiple GEMM variants and quantization levels.
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/05-create-conv2d-model.md
@@ -6,22 +6,22 @@ weight: 6
 layout: learningpathall
 ---
 
-In the previous section, we discussed that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels.
+In the previous section, you saw that that both INT8-quantized Conv2d and pointwise (1×1) Conv2d operators can be accelerated using KleidiAI’s matrix-multiplication micro-kernels.
 
 
 | XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType                      |
 | ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
 | pqs8_qc8w_gemm | Asymmetric INT8 quantization(NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) |
 | pf32_gemm    | FP32                         | FP32, pointwise (1×1)                   | FP32                         |
 
-To evaluate the performance of Conv2d operators across multiple hardware platforms, we create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis.
+To evaluate the performance of Conv2d operators across multiple hardware platforms, you will create a set of benchmark models that utilize different GEMM implementation variants within the convolution operators for systematic comparative analysis.
 
 
-### INT8-quantized Conv2d benchmark model
+### INT8-Quantized Conv2d benchmark model
 
 The following example defines a simple model to generate INT8-quantized Conv2d nodes that can be accelerated by KleidiAI.
 
-By adjusting some of the model’s input parameters, we can also simulate the behavior of nodes that appear in real-world models.
+By adjusting some of the model’s input parameters, you can also simulate the behavior of nodes that appear in real-world models.
 
 
 ```python
@@ -100,7 +100,7 @@ export_int8_quantize_conv2d_model("qint8_conv2d_pqs8_qc8w_gemm");
 
 ### PointwiseConv2d benchmark model
 
-In the following example model, we use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai. 
+In the following example model, you will use simple model to generate pointwise Conv2d nodes that can be accelerated by Kleidiai. 
 
 As before, input parameters can be adjusted to simulate real-world model behavior.
 
@@ -158,10 +158,21 @@ export_pointwise_model("pointwise_conv2d_pf32_gemm")
 
 ```
 
-**NOTES:** 
-
+{{%notice Note%}}
 When exporting models, the generate_etrecord option is enabled to produce the .etrecord file alongside the .pte model file.
 These ETRecord files are essential for subsequent model analysis and performance evaluation.
+{{%/notice%}}
+
+
+### Run the Complete Benchmark Model Script
+Rather than executing each block by hand, download and run the full export script. It will generate both Conv2d variants, run quantization (INT8) where applicable, partition to XNNPACK, lower, and export to ExecuTorch .pte together with .etrecord metadata.
+
+```bash
+wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py
+chmod +x export-conv2d.py
+python3 ./export-conv2d.py
+```
+### Validate Outputs
 
 After running this script, both the PTE model file and the etrecord file are generated.
 
@@ -173,4 +184,3 @@ pointwise_conv2d_pf32_gemm.etrecord
 pointwise_conv2d_pf32_gemm.pte
 ```
 
-The complete source code is available [here](../export-conv2d.py).
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/06-create-matrix-mul-model.md
@@ -6,9 +6,9 @@ weight: 7
 layout: learningpathall
 ---
 
-In the previous section, we discussed that the Batch Matrix Multiply operator supports multiple GEMM (General Matrix Multiplication) variants.
+The Batch Matrix Multiply operator (torch.bmm) under XNNPACK lowers to GEMM and, when shapes and dtypes match supported patterns, can dispatch to KleidiAI micro-kernels on Arm. 
 
-To evaluate the performance of these variants across different hardware platforms, we construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis.
+To evaluate the performance of these variants across different hardware platforms, you will construct a set of benchmark models that utilize the batch matrix multiply operator with different GEMM implementations for comparative analysis.
 
 
 ### Matrix multiply benchmark model
@@ -72,11 +72,22 @@ export_mutrix_mul_model(torch.float32,"matrix_mul_pf32_gemm")
 
 ```
 
-**NOTE:** 
-
+{{%notice Note%}}
 When exporting models, the **generate_etrecord** option is enabled to produce the .etrecord file alongside the .pte model file.
 These ETRecord files are essential for subsequent model analysis and performance evaluation.
+{{%/notice%}}
+
+### Run the Complete Benchmark Model Script
+Instead of executing each export block manually, you can download and run the full matrix-multiply benchmark script.
+This script automatically builds and exports both FP16 and FP32 models, performing all necessary partitioning, lowering, and ETRecord generation.
+
+```bash
+wget https://raw.githubusercontent.com/pareenaverma/arm-learning-paths/refs/heads/content_review/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-matrix-mul.py
+chmod +x export-matrix-mul.py
+python3 ./export-matrix-mul.py
+```
 
+### Verify the output
 
 After running this script, both the PTE model file and the etrecord file are generated.
 
@@ -87,5 +98,6 @@ model/matrix_mul_pf16_gemm.pte
 model/matrix_mul_pf32_gemm.etrecord
 model/matrix_mul_pf32_gemm.pte
 ```
+These files are the inputs for upcoming executor_runner benchmarks, where you’ll measure and compare KleidiAI micro-kernel performance.
 
 The complete source code is available [here](../export-matrix-mul.py).
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/07-run-model.md
@@ -1,25 +1,35 @@
 ---
-title: Run model and generate the etdump
+title: Run model and generate the ETDump
 weight: 8
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-After generating the model, we can now run it on an ARM64 platform using the following command:
+### Copy artifacts to your Arm64 target
+From your x86_64 host (where you cross-compiled), copy the runner and exported models to the Arm device:
+
+```bash
+scp $WORKSPACE/build-arm64/executor_runner <arm_user>@<arm_host>:~/bench/
+scp -r model/ <arm_user>@<arm_host>:~/bench/
+```
+
+### Run a model and emit ETDump
+Use one of the models you exported earlier (e.g., FP32 linear: linear_model_pf32_gemm.pte).
+The flags below tell executor_runner where to write the ETDump and how many times to execute.
 
 ```bash 
-cd $WORKSPACE 
-/build-arm64/executor_runner -etdump_path model/linear_model_f32.etdump -model_path model/linear_model_f32.pte -num_executions=1 -cpu_threads 1
+cd ~/bench
+./executor_runner -etdump_path model/linear_model_pf32_gemm.etdump -model_path model/linear_model_pf32_gemm.pte -num_executions=1 -cpu_threads 1
 
 ```
 
 You can adjust the number of execution threads and the number of times the model is invoked.
 
 
-You should see output similar to the example below.
+You should see logs like:
 
-```bash
+```output
 D 00:00:00.015988 executorch:XNNPACKBackend.cpp:57] Creating XNN workspace
 D 00:00:00.018719 executorch:XNNPACKBackend.cpp:69] Created XNN workspace: 0xaff21c2323e0
 D 00:00:00.027595 executorch:operator_registry.cpp:96] Successfully registered all kernels from shared library: NOT_SUPPORTED
@@ -42,6 +52,6 @@ OutputX 0: tensor(sizes=[1, 256], [
 I 00:00:00.093912 executorch:executor_runner.cpp:125] ETDump written to file 'model/linear_model_f32.etdump'.
 
 ```
+If execution succeeds, an ETDump file is created next to your model. You will load the .etdump in the next section and analyze which operators dispatched to KleidiAI and how each micro-kernel performed.
 
-If the execution is successful, an etdump file will also be generated.
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/08-analyze-etdump.md
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/_index.md
diff --git a/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py b/content/learning-paths/mobile-graphics-and-gaming/measure-kleidiai-kernel-performance-on-executorch/export-conv2d.py