Update executorch-arm-delegate-tutorial.md with CS320 FVP

varunchariArm · web-flow · commit ffad82455f91 · 2024-10-29T20:28:22.000-07:00
Differential Revision: D65067462 Pull Request resolved: pytorch#6230
diff --git a/docs/source/executorch-arm-delegate-tutorial.md b/docs/source/executorch-arm-delegate-tutorial.md
@@ -13,7 +13,7 @@
 
 :::{grid-item-card}  What you will learn in this tutorial:
 :class-card: card-prerequisites
-In this tutorial you will learn how to export a simple PyTorch model for ExecuTorch Arm Ethos-u backend delegate and run it on a Corstone-300 FVP Simulator.
+In this tutorial you will learn how to export a simple PyTorch model for ExecuTorch Arm Ethos-u backend delegate and run it on a Corstone FVP Simulators.
 :::
 
 ::::
@@ -34,9 +34,9 @@ Let's make sure you have everything you need before we get started.
 
 To successfully complete this tutorial, you will need a Linux-based host machine with Arm aarch64 or x86_64 processor architecture.
 
-The target device will be an embedded platform with an Arm Cortex-M55 CPU and Ethos-U55 NPU (ML processor). This tutorial will show you how to run PyTorch models on both.
+The target device will be an embedded platform with an Arm Cortex-M CPUs and Ethos-U NPUs (ML processor). This tutorial will show you how to run PyTorch models on both.
 
-We will be using a [Fixed Virtual Platform (FVP)](https://www.arm.com/products/development-tools/simulation/fixed-virtual-platforms), simulating a [Corstone-300](https://developer.arm.com/Processors/Corstone-300)(cs300) system. Since we will be using the FVP (think of it as virtual hardware), we won't be requiring any real embedded hardware for this tutorial.
+We will be using a [Fixed Virtual Platform (FVP)](https://www.arm.com/products/development-tools/simulation/fixed-virtual-platforms), simulating [Corstone-300](https://developer.arm.com/Processors/Corstone-300)(cs300) and [Corstone-320](https://developer.arm.com/Processors/Corstone-320)(cs320)systems. Since we will be using the FVP (think of it as virtual hardware), we won't be requiring any real embedded hardware for this tutorial.
 
 ### Software
 
@@ -64,19 +64,19 @@ uname -m
 
 Next we will walk through the steps performed by the `setup.sh` script to better understand the development setup.
 
-### Download and Set Up the Corstone-300 FVP
+### Download and Set Up the Corstone-300 and Corstone-320 FVP
 
-Fixed Virtual Platforms (FVPs) are pre-configured, functionally accurate simulations of popular system configurations. Here in this tutorial, we are interested in the Corstone-300 system. We can download this from the Arm website.
+Fixed Virtual Platforms (FVPs) are pre-configured, functionally accurate simulations of popular system configurations. Here in this tutorial, we are interested in Corstone-300 and Corstone-320 systems. We can download this from the Arm website.
 
 ```{note}
  By downloading and running the FVP software, you will be agreeing to the FVP [End-user license agreement (EULA)](https://developer.arm.com/downloads/-/arm-ecosystem-fvps/eula).
 ```
 
-To download, we can either download `Corstone-300 Ecosystem FVP` from [here](https://developer.arm.com/downloads/-/arm-ecosystem-fvps). or `setup.sh` script will does that for you under `setup_fvp` function.
+To download, we can either download `Corstone-300 Ecosystem FVP` and `Corstone-320 Ecosystem FVP`from [here](https://developer.arm.com/downloads/-/arm-ecosystem-fvps). or `setup.sh` script does that for you under `setup_fvp` function.
 
 ### Download and Install the Arm GNU AArch32 Bare-Metal Toolchain
 
-Similar to the FVP, we would also need a tool-chain to cross-compile ExecuTorch runtime, executor-runner bare-metal application, as well as the rest of the bare-metal stack for Cortex-M55 CPU available on the Corstone-300 platform.
+Similar to the FVP, we would also need a tool-chain to cross-compile ExecuTorch runtime, executor-runner bare-metal application, as well as the rest of the bare-metal stack for Cortex-M55/M85 CPU available on the Corstone-300/Corstone-320 platform.
 
 These toolchains are available [here](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads). We will be using GCC 12.3 targeting `arm-none-eabi` here for our tutorial. Just like FVP, `setup.sh` script will down the toolchain for you. See `setup_toolchain` function.
 
@@ -103,10 +103,14 @@ At the end of the setup, if everything goes well, your top level devlopement dir
 │   ├── fetch_externals.py
 │   └── [...]
 ├── ethos-u-vela
-├── FVP
+├── FVP-corstone300
 │   ├── FVP_Corstone_SSE-300.sh
 │   └── [...]
+├── FVP-corstone320
+│   ├── FVP_Corstone_SSE-320.sh
+│   └── [...]
 ├── FVP_cs300.tgz
+├── FVP_cs320.tgz
 ├── gcc.tar.xz
 └── reference_model
 ```
@@ -239,8 +243,7 @@ cmake -DCMAKE_BUILD_TYPE=Release \
 -Bcmake-out-aot-lib \
     "${et_root_dir}"
 
-n=$(nproc)
-cmake --build cmake-out-aot-lib -j"$((n - 5))" -- quantized_ops_aot_lib
+cmake --build cmake-out-aot-lib --parallel -- quantized_ops_aot_lib
 ```
 
 After the `quantized_ops_aot_lib` build, we can run the following script to generate the `.pte` file
@@ -257,7 +260,7 @@ At the end of this, we should have three different `.pte` files.
 - The second one contains the [AddModule](#addmodule), with Arm Ethos-U backend delegate enabled.
 - The third one contains the [quantized MV2Model](#mv2module), with the Arm Ethos-U backend delegate enabled as well.
 
-Now let's try to run these `.pte` files on a Corstone-300 platform in a bare-metal environment.
+Now let's try to run these `.pte` files on a Corstone-300 and Corstone-320 platforms in a bare-metal environment.
 
 ## Getting a Bare-Metal Executable
 
@@ -269,9 +272,13 @@ The block diagram below demonstrates, at the high level, how the various build a
 
 ![](./arm-delegate-runtime-build.svg)
 
+```{tip}
+The `generate_pte_file` function in `run.sh` script produces the `.pte` files based on the models provided through `--model_name` input argument
+```
+
 ### Generating ExecuTorch Libraries
 
-ExecuTorch's CMake build system produces a set of build pieces which are critical for us to include and run the ExecuTorch runtime with-in the bare-metal environment we have for Corstone-300 from Ethos-U SDK.
+ExecuTorch's CMake build system produces a set of build pieces which are critical for us to include and run the ExecuTorch runtime with-in the bare-metal environment we have for Corstone FVPs from Ethos-U SDK.
 
 [This](./runtime-build-and-cross-compilation.md) document provides a detailed overview of each individual build piece. For running either variant of the `.pte` file, we will need a core set of libraries. Here is a list,
 
@@ -283,133 +290,106 @@ To run a `.pte` file with the Arm backend delegate call instructions, we will ne
 
 - `libexecutorch_delegate_ethos_u.a`
 
-
-These libraries are generated in `build_executorch` function of the `run.sh` script.
+These libraries are generated in `build_executorch` and `build_quantization_aot_lib` function of the `run.sh` script.
 
 In this function, `EXECUTORCH_SELECT_OPS_LIST` will decide the number of portable operators included in the build and are available at runtime. It must match with `.pte` file's requirements, otherwise you will get `Missing Operator` error at runtime.
 
 For example, there  in the command line above, to run SoftmaxModule, we only included the softmax CPU operator. Similarly, to run AddModule in a non-delegated manner you will need add op and so on. As you might have already realized, for the delegated operators, which will be executed by the Arm backend delegate, we do not need to include those operators in this list. This is only for *non-delegated* operators.
 
+```{tip}
+The `run.sh` script takes in `--portable_kernels` option, which provides a way to supply a comma seperated list of portable kernels to be included.
+```
+
 ### Building the executor_runner Bare-Metal Application
 
 The SDK dir is the same one prepared [earlier](#setup-the-arm-ethos-u-software-development). And, we will be passing the `.pte` file (any one of them) generated above.
 
-Note, you have to generate a new `executor-runner` binary if you want to change the model or the `.pte` file. This constraint is from the constrained bare-metal runtime environment we have for Corstone-300 platform.
+Note, you have to generate a new `executor-runner` binary if you want to change the model or the `.pte` file. This constraint is from the constrained bare-metal runtime environment we have for Corstone-300/Corstone-320 platforms.
 
 This is performed by the `build_executorch_runner` function in `run.sh`.
 
-## Running on Corstone-300 FVP Platform
+```{tip}
+The `run.sh` script takes in `--target` option, which provides a way to provide a specific target, Corstone-300(ethos-u55-128) or Corstone-320(ethos-u85-128)
+```
+
+## Running on Corstone FVP Platforms
 
-Once the elf is prepared, regardless of the `.pte` file variant is used to generate the bare metal elf, you can run in with following command,
+Once the elf is prepared, regardless of the `.pte` file variant is used to generate the bare metal elf. The below command is used to run the [MV2Model](#mv2module) on Corstone-320 FVP
 
 ```bash
 ethos_u_build_dir=examples/arm/executor_runner/
 
 elf=$(find ${ethos_u_build_dir} -name "arm_executor_runner")
 
-FVP_Corstone_SSE-300_Ethos-U55                          \
-    -C ethosu.num_macs=128                              \
-    -C mps3_board.visualisation.disable-visualisation=1 \
-    -C mps3_board.telnetterminal0.start_telnet=0        \
-    -C mps3_board.uart0.out_file='-'                    \
+FVP_Corstone_SSE-320_Ethos-U85                          \
+    -C mps4_board.subsystem.cpu0.CFGITCMSZ=11           \
+    -C mps4_board.subsystem.ethosu.num_macs=${num_macs} \
+    -C mps4_board.visualisation.disable-visualisation=1 \
+    -C vis_hdlcd.disable_visualisation=1                \
+    -C mps4_board.telnetterminal0.start_telnet=0        \
+    -C mps4_board.uart0.out_file='-'                    \
+    -C mps4_board.uart0.shutdown_on_eot=1               \
     -a "${elf}"                                         \
-    --timelimit 10 # seconds - after which sim will kill itself
+    --timelimit 120 || true # seconds- after which sim will kill itself
 ```
 
 If successful, the simulator should produce something like the following on the shell,
 
 ```console
-    Ethos-U rev 136b7d75 --- Apr 12 2023 13:44:01
-    (C) COPYRIGHT 2019-2023 Arm Limited
-    ALL RIGHTS RESERVED
-
-I executorch:runner.cpp:64] Model PTE file loaded. Size: 960 bytes.
-I executorch:runner.cpp:70] Model buffer loaded, has 1 methods
-I executorch:runner.cpp:78] Running method forward
-I executorch:runner.cpp:95] Setting up planned buffer 0, size 32.
-I executorch:runner.cpp:110] Method loaded.
-I executorch:runner.cpp:112] Preparing inputs...
-I executorch:runner.cpp:114] Input prepared.
-I executorch:runner.cpp:116] Starting the model execution...
-I executorch:runner.cpp:121] Model executed successfully.
-I executorch:runner.cpp:125] 1 outputs:
-Output[0][0]: 0.500000
-Output[0][1]: 0.500000
-Output[0][2]: 0.500000
-Output[0][3]: 0.500000
-Application exit code: 0.
-
-EXITTHESIM
-
-Info: Simulation is stopping. Reason: CPU time has been exceeded.
-```
-
-Here in this example, we ran the `executor_runner` binary with the `softmax.pte` file generated for the [SoftmaxModule](#softmaxmodule), we do see the expected results generated from the baremetal binary running on the Corstone-300 virtual hardware on FVP simulator.
-
-If you rerun the same FVP command with the delegated `.pte` file for the [AddModule](#addmodule), i.e. `add_arm_delegate.pte` - you may get something like following, again the expected results. Pay attention to the messages printed with prefix `ArmBackend::`, they indicate that the backend was sucecssfully initialized and the `add` operator from our AddModule in the `.pte` was exexuted on the Ethos-U55 NPU.
-
-```console
-    Ethos-U rev 136b7d75 --- Apr 12 2023 13:44:01
-    (C) COPYRIGHT 2019-2023 Arm Limited
-    ALL RIGHTS RESERVED
-
-I executorch:runner.cpp:64] Model PTE file loaded. Size: 2208 bytes.
-I executorch:runner.cpp:70] Model buffer loaded, has 1 methods
-I executorch:runner.cpp:78] Running method forward
-I executorch:runner.cpp:95] Setting up planned buffer 0, size 64.
-I executorch:ArmBackendEthosU.cpp:51] ArmBackend::init 0x11000050
-I executorch:runner.cpp:110] Method loaded.
-I executorch:runner.cpp:112] Preparing inputs...
-I executorch:runner.cpp:114] Input prepared.
-I executorch:runner.cpp:116] Starting the model execution...
-I executorch:ArmBackendEthosU.cpp:103] ArmBackend::execute 0x11000050
-I executorch:runner.cpp:121] Model executed successfully.
-I executorch:runner.cpp:125] 1 outputs:
-Output[0][0]: 2
-Output[0][1]: 2
-Output[0][2]: 2
-Output[0][3]: 2
-Output[0][4]: 2
-Application exit code: 0.
-
-EXITTHESIM
-
-Info: Simulation is stopping. Reason: CPU time has been exceeded.
-```
-
-Similarily we can get the following output for running the [MV2Model](#mv2module)
-
-```
-    Ethos-U rev 136b7d75 --- Apr 12 2023 13:44:01
-    (C) COPYRIGHT 2019-2023 Arm Limited
-    ALL RIGHTS RESERVED
-
-I executorch:arm_executor_runner.cpp:60] Model in 0x70000000 $
-I executorch:arm_executor_runner.cpp:66] Model PTE file loaded. Size: 4556832 bytes.
-I executorch:arm_executor_runner.cpp:77] Model buffer loaded, has 1 methods
-I executorch:arm_executor_runner.cpp:85] Running method forward
-I executorch:arm_executor_runner.cpp:109] Setting up planned buffer 0, size 752640.
-I executorch:ArmBackendEthosU.cpp:49] ArmBackend::init 0x70000060
-I executorch:arm_executor_runner.cpp:130] Method loaded.
-I executorch:arm_executor_runner.cpp:132] Preparing inputs...
-I executorch:arm_executor_runner.cpp:141] Input prepared.
-I executorch:arm_executor_runner.cpp:143] Starting the model execution...
-I executorch:ArmBackendEthosU.cpp:87] ArmBackend::execute 0x70000060
-I executorch:ArmBackendEthosU.cpp:234] Tensor input 0 will be permuted
+I [executorch:arm_executor_runner.cpp:364] Model in 0x70000000 $
+I [executorch:arm_executor_runner.cpp:366] Model PTE file loaded. Size: 4425968 bytes.
+I [executorch:arm_executor_runner.cpp:376] Model buffer loaded, has 1 methods
+I [executorch:arm_executor_runner.cpp:384] Running method forward
+I [executorch:arm_executor_runner.cpp:395] Setup Method allocator pool. Size: 62914560 bytes.
+I [executorch:arm_executor_runner.cpp:412] Setting up planned buffer 0, size 752640.
+I [executorch:ArmBackendEthosU.cpp:79] ArmBackend::init 0x70000070
+I [executorch:arm_executor_runner.cpp:445] Method loaded.
+I [executorch:arm_executor_runner.cpp:447] Preparing inputs...
+I [executorch:arm_executor_runner.cpp:461] Input prepared.
+I [executorch:arm_executor_runner.cpp:463] Starting the model execution...
+I [executorch:ArmBackendEthosU.cpp:118] ArmBackend::execute 0x70000070
+I [executorch:ArmBackendEthosU.cpp:298] Tensor input/output 0 will be permuted
+I [executorch:arm_perf_monitor.cpp:120] NPU Inferences : 1
+I [executorch:arm_perf_monitor.cpp:121] Profiler report, CPU cycles per operator:
+I [executorch:arm_perf_monitor.cpp:125] ethos-u : cycle_cnt : 1498202 cycles
+I [executorch:arm_perf_monitor.cpp:132] Operator(s) total: 1498202 CPU cycles
+I [executorch:arm_perf_monitor.cpp:138] Inference runtime: 6925114 CPU cycles total
+I [executorch:arm_perf_monitor.cpp:140] NOTE: CPU cycle values and ratio calculations require FPGA and identical CPU/NPU frequency
+I [executorch:arm_perf_monitor.cpp:149] Inference CPU ratio: 99.99 %
+I [executorch:arm_perf_monitor.cpp:153] Inference NPU ratio: 0.01 %
+I [executorch:arm_perf_monitor.cpp:162] cpu_wait_for_npu_cntr : 729 CPU cycles
+I [executorch:arm_perf_monitor.cpp:167] Ethos-U PMU report:
+I [executorch:arm_perf_monitor.cpp:168] ethosu_pmu_cycle_cntr : 5920305
+I [executorch:arm_perf_monitor.cpp:171] ethosu_pmu_cntr0 : 359921
+I [executorch:arm_perf_monitor.cpp:171] ethosu_pmu_cntr1 : 0
+I [executorch:arm_perf_monitor.cpp:171] ethosu_pmu_cntr2 : 0
+I [executorch:arm_perf_monitor.cpp:171] ethosu_pmu_cntr3 : 503
+I [executorch:arm_perf_monitor.cpp:178] Ethos-U PMU Events:[ETHOSU_PMU_EXT0_RD_DATA_BEAT_RECEIVED, ETHOSU_PMU_EXT1_RD_DATA_BEAT_RECEIVED, ETHOSU_PMU_EXT0_WR_DATA_BEAT_WRITTEN, ETHOSU_PMU_NPU_IDLE]
+I [executorch:arm_executor_runner.cpp:470] model_pte_loaded_size:     4425968 bytes.
+I [executorch:arm_executor_runner.cpp:484] method_allocator_used:     1355722 / 62914560  free: 61558838 ( used: 2 % ) 
+I [executorch:arm_executor_runner.cpp:491] method_allocator_planned:  752640 bytes
+I [executorch:arm_executor_runner.cpp:493] method_allocator_loaded:   966 bytes
+I [executorch:arm_executor_runner.cpp:494] method_allocator_input:    602116 bytes
+I [executorch:arm_executor_runner.cpp:495] method_allocator_executor: 0 bytes
+I [executorch:arm_executor_runner.cpp:498] temp_allocator_used:       0 / 1048576 free: 1048576 ( used: 0 % ) 
 I executorch:arm_executor_runner.cpp:152] Model executed successfully.
 I executorch:arm_executor_runner.cpp:156] 1 outputs:
-Output[0][0]: -0.639322
-Output[0][1]: 0.169232
-Output[0][2]: -0.451286
+Output[0][0]: -0.749744
+Output[0][1]: -0.019224
+Output[0][2]: 0.134570
 ...(Skipped)
-Output[0][996]: 0.150429
-Output[0][997]: -0.488894
-Output[0][998]: 0.037607
-Output[0][999]: 1.203430
+Output[0][996]: -0.230691
+Output[0][997]: -0.634399
+Output[0][998]: -0.115345
+Output[0][999]: 1.576386
 I executorch:arm_executor_runner.cpp:177] Program complete, exiting.
 I executorch:arm_executor_runner.cpp:179]
 ```
 
+```{note}
+The `run.sh` script provides various options to select a particular FVP target, use desired models, select portable kernels and can be explored using the `--help` argument
+```
+
 ## Takeaways
 Through this tutorial we've learnt how to use the ExecuTorch software to both export a standard model from PyTorch and to run it on the compact and fully functioned ExecuTorch runtime, enabling a smooth path for offloading models from PyTorch to Arm based platforms.