From 2a38798a1c8276eff1f479e9e6a0fd1813136bc9 Mon Sep 17 00:00:00 2001
From: Adrian Lundell <adrian.lundell@arm.com>
Date: Mon, 17 Feb 2025 13:55:58 +0100
Subject: [PATCH] Arm backend: Add ethos_u_minimal_example jupyter notebook

This example provides a good base understanding of all steps needed to lower
and run a Pytorch module through the EthousUBackend. It also works as a
starting point for creating scripts for lowering a custom network.

Change-Id: I1a70dec8f81dd7be6c78c5c3a8b62cd1b12ac9c1
---
 examples/arm/README.md                     |  10 +
 examples/arm/ethos_u_minimal_example.ipynb | 284 +++++++++++++++++++++
 2 files changed, 294 insertions(+)
 create mode 100644 examples/arm/ethos_u_minimal_example.ipynb

diff --git a/examples/arm/README.md b/examples/arm/README.md
index c74eda7ae2b..8762e7ccdd1 100644
--- a/examples/arm/README.md
+++ b/examples/arm/README.md
@@ -32,6 +32,16 @@ $ source  executorch/examples/arm/ethos-u-scratch/setup_path.sh
 $ executorch/examples/arm/run.sh --model_name=mv2 --target=ethos-u85-128 [--scratch-dir=same-optional-scratch-dir-as-before]
 ```
 
+### Ethos-U minimal example
+
+See the jupyter notebook `ethos_u_minimal_example.ipynb` for an explained minimal example of the full flow for running a
+PyTorch module on the EthosUDelegate. The notebook runs directly in some IDE:s s.a. VS Code, otherwise it can be run in
+your browser using
+```
+pip install jupyter
+jupyter notebook ethos_u_minimal_example.ipynb
+```
+
 ### Online Tutorial
 
 We also have a [tutorial](https://pytorch.org/executorch/stable/executorch-arm-delegate-tutorial.html) explaining the steps performed in these
diff --git a/examples/arm/ethos_u_minimal_example.ipynb b/examples/arm/ethos_u_minimal_example.ipynb
new file mode 100644
index 00000000000..d73695e9d48
--- /dev/null
+++ b/examples/arm/ethos_u_minimal_example.ipynb
@@ -0,0 +1,284 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright 2025 Arm Limited and/or its affiliates.\n",
+    "#\n",
+    "# This source code is licensed under the BSD-style license found in the\n",
+    "# LICENSE file in the root directory of this source tree."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Ethos-U delegate flow example\n",
+    "\n",
+    "This guide demonstrates the full flow for running a module on Arm Ethos-U using ExecuTorch. \n",
+    "Tested on Linux x86_64 and macOS aarch64. If something is not working for you, please raise a GitHub issue and tag Arm.\n",
+    "\n",
+    "Before you begin:\n",
+    "1. (In a clean virtual environment with a compatible Python version) Install executorch using `./install_executorch.sh`\n",
+    "2. Install Arm cross-compilation toolchain and simulators using `examples/arm/setup.sh --i-agree-to-the-contained-eula`\n",
+    "3. Add Arm cross-compilation toolchain and simulators to PATH using `examples/arm/ethos-u-scratch/setup_path.sh` \n",
+    "\n",
+    "With all commands executed from the base `executorch` folder.\n",
+    "\n",
+    "\n",
+    "\n",
+    "*Some scripts in this notebook produces long output logs: Configuring the 'Customizing Notebook Layout' settings to enable 'Output:scrolling' and setting 'Output:Text Line Limit' makes this more manageable*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## AOT Flow\n",
+    "\n",
+    "The first step is creating the PyTorch module and exporting it. Exporting converts the python code in the module into a graph structure. The result is still runnable python code, which can be displayed by printing the `graph_module` of the exported program.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "class Add(torch.nn.Module):\n",
+    "    def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:\n",
+    "        return x + y\n",
+    "\n",
+    "example_inputs = (torch.ones(1,1,1,1),torch.ones(1,1,1,1))\n",
+    "\n",
+    "model = Add()\n",
+    "model = model.eval()\n",
+    "exported_program = torch.export.export_for_training(model, example_inputs)\n",
+    "graph_module = exported_program.module()\n",
+    "\n",
+    "_ = graph_module.print_readable()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To run on Ethos-U the `graph_module` must be quantized using the `arm_quantizer`. Quantization can be done in multiple ways and it can be customized for different parts of the graph; shown here is the recommended path for the EthosUBackend. Quantization also requires calibrating the module with example inputs.\n",
+    "\n",
+    "Again printing the module, it can be seen that the quantization wraps the node in quantization/dequantization nodes which contain the computed quanitzation parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from executorch.backends.arm.arm_backend import ArmCompileSpecBuilder\n",
+    "from executorch.backends.arm.quantizer.arm_quantizer import (\n",
+    "    EthosUQuantizer,\n",
+    "    get_symmetric_quantization_config,\n",
+    ")\n",
+    "from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e\n",
+    "\n",
+    "target = \"ethos-u55-128\"\n",
+    "\n",
+    "# Create a compilation spec describing the target for configuring the quantizer\n",
+    "# Some args are used by the Arm Vela graph compiler later in the example. Refer to Arm Vela documentation for an \n",
+    "# explanation of its flags: https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/OPTIONS.md\n",
+    "spec_builder = ArmCompileSpecBuilder().ethosu_compile_spec(\n",
+    "            target,\n",
+    "            system_config=\"Ethos_U55_High_End_Embedded\",\n",
+    "            memory_mode=\"Shared_Sram\",\n",
+    "            extra_flags=\"--output-format=raw --debug-force-regor\"\n",
+    "        )\n",
+    "compile_spec = spec_builder.build()\n",
+    "\n",
+    "# Create and configure quantizer to use a symmetric quantization config globally on all nodes\n",
+    "quantizer = EthosUQuantizer(compile_spec) \n",
+    "operator_config = get_symmetric_quantization_config(is_per_channel=False)\n",
+    "quantizer.set_global(operator_config)\n",
+    "\n",
+    "# Post training quantization\n",
+    "quantized_graph_module = prepare_pt2e(graph_module, quantizer) \n",
+    "quantized_graph_module(*example_inputs) # Calibrate the graph module with the example input\n",
+    "quantized_graph_module = convert_pt2e(quantized_graph_module)\n",
+    "\n",
+    "_ = quantized_graph_module.print_readable()\n",
+    "\n",
+    "# Create a new exported program using the quantized_graph_module\n",
+    "quantized_exported_program = torch.export.export_for_training(quantized_graph_module, example_inputs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The quantization nodes created in the previous cell are not built by default with ExecuTorch but must be included in the .pte-file, and so they need to be built separately. `backends/arm/scripts/build_quantized_ops_aot_lib.sh` is a utility script which does this. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import subprocess \n",
+    "import os \n",
+    "\n",
+    "# Setup paths\n",
+    "cwd_dir = os.getcwd()\n",
+    "et_dir = os.path.join(cwd_dir, \"..\", \"..\")\n",
+    "et_dir = os.path.abspath(et_dir)\n",
+    "script_dir = os.path.join(et_dir, \"backends\", \"arm\", \"scripts\")\n",
+    "\n",
+    "# Run build_quantized_ops_aot_lib.sh\n",
+    "subprocess.run(os.path.join(script_dir, \"build_quantized_ops_aot_lib.sh\"), shell=True, cwd=et_dir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The lowering in the EthosUBackend happens in five steps:\n",
+    "\n",
+    "1. **Lowering to core Aten operator set**: Transform module to use a subset of operators applicable to edge devices. \n",
+    "2. **Partitioning**: Find subgraphs which are supported for running on Ethos-U\n",
+    "3. **Lowering to TOSA compatible operator set**: Perform transforms to make the Ethos-U subgraph(s) compatible with TOSA \n",
+    "4. **Serialization to TOSA**: Compiles the graph module into a TOSA graph \n",
+    "5. **Compilation to NPU**: Compiles the TOSA graph into an EthosU command stream using the Arm Vela graph compiler. This makes use of the `compile_spec` created earlier.\n",
+    "Step 5 also prints a Network summary for each processed subgraph.\n",
+    "\n",
+    "All of this happens behind the scenes in `to_edge_transform_and_lower`. Printing the graph module shows that what is left in the graph is two quantization nodes for `x` and `y` going into an `executorch_call_delegate` node, followed by a dequantization node."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from executorch.backends.arm.ethosu_partitioner import EthosUPartitioner\n",
+    "from executorch.exir import (\n",
+    "    EdgeCompileConfig,\n",
+    "    ExecutorchBackendConfig,\n",
+    "    to_edge_transform_and_lower,\n",
+    ")\n",
+    "from executorch.extension.export_util.utils import save_pte_program\n",
+    "import platform \n",
+    "\n",
+    "# Create partitioner from compile spec \n",
+    "partitioner = EthosUPartitioner(compile_spec)\n",
+    "\n",
+    "# Lower the exported program to the Ethos-U backend\n",
+    "edge_program_manager = to_edge_transform_and_lower(\n",
+    "            quantized_exported_program,\n",
+    "            partitioner=[partitioner],\n",
+    "            compile_config=EdgeCompileConfig(\n",
+    "                _check_ir_validity=False,\n",
+    "            ),\n",
+    "        )\n",
+    "\n",
+    "# Load quantization ops library\n",
+    "os_aot_lib_names = {\"Darwin\" : \"libquantized_ops_aot_lib.dylib\", \n",
+    "                \"Linux\"  : \"libquantized_ops_aot_lib.so\", \n",
+    "                \"Windows\": \"libquantized_ops_aot_lib.dll\"}\n",
+    "aot_lib_name = os_aot_lib_names[platform.system()]\n",
+    "\n",
+    "libquantized_ops_aot_lib_path = os.path.join(et_dir, \"cmake-out-aot-lib\", \"kernels\", \"quantized\", aot_lib_name)\n",
+    "torch.ops.load_library(libquantized_ops_aot_lib_path)\n",
+    "\n",
+    "# Convert edge program to executorch\n",
+    "executorch_program_manager = edge_program_manager.to_executorch(\n",
+    "            config=ExecutorchBackendConfig(extract_delegate_segments=False)\n",
+    "        )\n",
+    "\n",
+    "executorch_program_manager.exported_program().module().print_readable()\n",
+    "\n",
+    "# Save pte file\n",
+    "pte_base_name = \"simple_example\"\n",
+    "pte_name = pte_base_name + \".pte\"\n",
+    "pte_path = os.path.join(cwd_dir, pte_name)\n",
+    "save_pte_program(executorch_program_manager, pte_name)\n",
+    "assert os.path.exists(pte_path), \"Build failed; no .pte-file found\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Build executor runtime\n",
+    "\n",
+    "After the AOT compilation flow is done, the runtime can be cross compiled and linked to the produced .pte-file using the Arm cross-compilation toolchain. This is done in three steps:\n",
+    "1. Build the executorch library and EthosUDelegate.\n",
+    "2. Build any external kernels required. In this example this is not needed as the graph is fully delegated, but its included for completeness.\n",
+    "3. Build and link the `arm_executor_runner`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Build executorch \n",
+    "subprocess.run(os.path.join(script_dir, \"build_executorch.sh\"), shell=True, cwd=et_dir)\n",
+    "\n",
+    "# Build portable kernels\n",
+    "subprocess.run(os.path.join(script_dir, \"build_portable_kernels.sh\"), shell=True, cwd=et_dir)\n",
+    "\n",
+    "# Build executorch runner\n",
+    "args = f\"--pte={pte_path} --target={target}\"\n",
+    "subprocess.run(os.path.join(script_dir, \"build_executorch_runner.sh\") + \" \" + args, shell=True, cwd=et_dir)\n",
+    "\n",
+    "elf_path = os.path.join(cwd_dir, pte_base_name, \"cmake-out\", \"arm_executor_runner\")\n",
+    "assert os.path.exists(elf_path), \"Build failed; no .elf-file found\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Run on simulated model\n",
+    "\n",
+    "We can finally use the `backends/arm/scripts/run_fvp.sh` utility script to run the .elf-file on simulated Arm hardware. This Script runs the model with an input of ones, so the expected result of the addition should be close to 2."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "args = f\"--elf={elf_path}  --target={target}\"\n",
+    "subprocess.run(os.path.join(script_dir, \"run_fvp.sh\") + \" \" + args, shell=True, cwd=et_dir)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}