-
Notifications
You must be signed in to change notification settings - Fork 748
Arm backend: Document Ethos-U memory modes and add Ethos-U porting guide #14144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -72,17 +72,116 @@ with open("mv2_arm_ethos_u55.pte", "wb") as file: | |
| edge_program_manager.write_to_file(file) | ||
| ``` | ||
|
|
||
| ### Ethos-U memory modes | ||
| The Ethos-U NPU provides two distinct memory interfaces: | ||
| - One interface for **low-latency, high-bandwidth memory** | ||
| Typically on-chip memory such as **SRAM**. | ||
| - One interface for **higher-latency, lower-bandwidth memory** | ||
| Typically external (off-chip) memory such as **Flash** or **DRAM**. | ||
|
|
||
| On all Ethos-U NPUs(Ethos-U55, Ethos-U65, Ethos-U85), the low-latency interface is usually the SRAM of the SoC. | ||
| The external memory type depends on the SoC: | ||
| - On a low-power microcontorller, the external memory is usually Flash. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. microcontroller |
||
| - On systems with Cortex-A and rich operating system, the external memory is typically DRAM. | ||
|
|
||
| When running an inference, the Ethos-U compiler and Ethos-U driver make use of three logical memory regions: | ||
| - Ethos-U scratch buffer - a contiguous block of memory used by the NPU to store the intermediate tensors produced and consumed during inference. | ||
| - Neural Network - a contiguous block of memory holding constant data such as weights, biases, quantization parameters required to run an inference. | ||
| - Ethos-U fast scratch buffer - a contiguous block of memory, assumed to reside in on-chip memory in order to hide the higher latency/lower bandwidth of external memory. Only applicable for Ethos-U65 and Ethos-U85 on systems | ||
| with Cortex-A and the external memory is assumed to be DRAM. | ||
|
|
||
| The placement of the scratch buffer and the Neural Network determine the memory mode to be used in the Ethos-U | ||
| compile specificiation. We support three different placements of the scratch buffer and the ML model. | ||
|
|
||
| #### 1. Sram-Only Memory Mode | ||
| - Ethos-U scratch buffer resides in the SRAM. | ||
| - Neural Network resides in the SRAM. | ||
| - Ethos-U fast scratch buffer is not used. | ||
| - Characteristics: | ||
| - Provides the best performance since all the memory traffic passes via the low-latency/high-bandwidth memory. | ||
| - The performance uplift is especially noticeable on memory-bound workloads on the external interface. | ||
| - Available on Ethos-U55, Ethos-U65 and Ethos-U85. | ||
| - Limitations: | ||
| - Embedded SoCs often have limited SRAM and NNs are becoming larger. This memory mode may be unsuitable for a system running a big model relative to the amount of SRAM available on the SoC. | ||
| Below, you can see a visual representation of the placement of the two logical memory regions for the Sram Only configuration. | ||
|
|
||
|  | ||
|
|
||
| #### 2. Shared-Sram Memory Mode | ||
| - Ethos-U scratch buffer resides in the SRAM. | ||
| - Neural Network resides in the External memory. | ||
| - Ethos-U fast scratch buffer is not used. | ||
| - Characteristics: | ||
| - Intermediate tensors are stored in the SRAM, leveraging its low-latency and high-bandwidth. | ||
| - The Ethos-U compiler can prefetch weights from the external memory to the SRAM ahead of time so that when the NPU needs the data, it will already be avaialbe in the on-chip memory. | ||
| - In this mode, the external interface is Read-Only, the on-chip memory interface is Read/Write | ||
| - Shared-Sram offers great balance between performance and low SRAM usage. | ||
| - Available on Ethos-U55, Ethos-U65 and Ethos-U85. | ||
| - Limitations: | ||
| - You need to have enough space in the SRAM to hold the peak intermediate tensor. | ||
| Below, you can see a visual representation of the placement of the two logical memory regions for the Shared_Sram configuration. | ||
|
|
||
|  | ||
|
|
||
| #### 3. Dedicated-Sram Memory Mode | ||
| - Ethos-U scratch buffer resides in the External memory. | ||
| - Neural Network resides in the External memory. | ||
| - Ethos-U fast scratch buffer resides in the on-chip memory. | ||
| - Characteristics: | ||
| - Used when the peak intermediate tensor is too big to fit into the on-chip memory. | ||
| - Enables silicon acceleration of large models. | ||
| - The NPU stores the results from the intermediate computations in the external memory. | ||
| - The dedicated SRAM acts as a software managed cache, improving performance by pre-fetching frequently accessed tensors to the on-chip memory. | ||
| - Available on Ethos-U65 and Ethos-U85. | ||
| - Limitations: | ||
| - The SRAM space must be dedicated exculisely to the Ethos-U(the host processor should not access it). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: space after Ethos-U |
||
| - Not available on Ethos-U55. | ||
| Below, you can see a visual representation of the placement of the two logical memory regions for the Shared_Sram configuration. | ||
|
|
||
|  | ||
|
|
||
| Here is a table comparing the three memory modes: | ||
|
|
||
| | Memory Mode | Ethos-U Scratch Buffer Placement | Neural Network Placement | When to Use | Trade-off | | ||
| |--------------------|----------------------------------|----------------------------|------------ |---------------------------------------------------------------------------| | ||
| | **SRAM-Only** | On-chip SRAM | On-chip SRAM | When the ML model, the Ethos-U scratch buffer and the wider software stack fit within the SRAM of the SoC | Limited by SRAM size; often not feasible for larger NNs | | ||
| | **Shared-SRAM** | On-chip SRAM | External memory (Flash/DRAM) | Most common mode on Cortex-M and Ethos-U systems; balances good performance and SRAM usage | Requires enough SRAM to hold the largest intermediate tensor | | ||
| | **Dedicated-SRAM** | External memory | External memory (Flash/DRAM) | Most common mode for Cortex-A and Ethos-U systems. For very large models where the peak intermediates cannot fit in SRAM | Need high-bandwidth external memory to deliver good performance | | ||
|
|
||
|
|
||
| The memory modes are defined within the [vela.ini file](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/ethosu/config_files/Arm/vela.ini?ref_type=heads). When you install | ||
| ExecuTorch for the Ethos-U backend, you automatically install the compiler containing the vela.ini file so you can directly create a compile specification with these memory modes. | ||
|
|
||
| #### Interpreting the output from the Ethos-U compiler regarding the memory footprint | ||
| As part of the `to_edge_transform_and_lower` step, you will see a memory footprint information presented as: | ||
|
|
||
| ``` | ||
| Total SRAM used 2467.27 KiB | ||
| Total Off-chip Flash used 12.20 KiB | ||
| ```` | ||
| The `Total SRAM used` indicates the peak SRAM utilization needed by the NPU in order to perform an inference. In the snippet above, the Ethos-U compiler requires 2467.27 KiB of SRAM in order to schedule the inference. | ||
| Therefore, from an application standpoint, you need to ensure you have at least 2467.27 KiB of SRAM on the SoC to run this model. The Ethos-U compiler provides a scheduling algorithm allowing to | ||
| lower the peak SRAM usage within reasonable limits, you need to add the `--optimise Size` or `--arena-cache-size` CLI options for to the compile spec. You can read more about the options of the | ||
| Ethos-U compiler in the documentation [here](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/OPTIONS.md#optimise). If the peak SRAM usage remains too high in | ||
| Shared Sram memory mode, you would need to us the Dedicated Sram mode in order to store the Neural Network and the Ethos-U scratch buffer in the external memory. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. all-caps SRAM? |
||
| The main advantage of the Dedicated_Sram memory mode is that you can run large models and still benefit from the low-latency/high-bandwidth of the SRAM, used as a cache. | ||
|
|
||
| It is important to highlight that when you specify a memory mode in the compile spec, in the runtime, the user is expected to place the scratch buffer and NN in the correct memory location. | ||
| In other words, when you specify for ex. Shared Sram memory mode, the runtime application logic should place the ethos-U scratch buffer in the on-chip memory and the NN in the external memory for optimal performance. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SRAM, Ethos-U |
||
|
|
||
| You can see how we are doing this coupling between the memory mode and runtime application the [Ethos-U porting guide](../../examples/arm/ethos-u-porting-guide.md). | ||
|
|
||
| ### Partitioner API | ||
|
|
||
| `EthosUPartitioner` tries to partition as much of the model as possible. It will never delegate unsupported operators, but a user can pass additional checks to the constructor to avoid partitioning additional operators. To do this, subclass `OperatorSupportBase` and implement the function `is_node_supported`. A few such checks exist in `executorch.exir.backend.operator_support`: | ||
|
|
||
| - `DontPartition`: Don't partition operators based on operator type. | ||
| - `DontPartitionModule`: Don't partition operators based on which python module the operator comes from. | ||
| - `DontPartitionName`: Don't partition opertors based on the operator name. | ||
| - `DontPartitionName`: Don't partition operators based on the operator name. | ||
|
|
||
| ### Quantization | ||
|
|
||
| A fully integer model is required for using the Arm Ethos-U backend. As discussed above, you can quantize floating point models with the the `EthosUQuantizer`. Quantizers are backend specific, which means the `EthosUQuantizer` is configured to quantize models correctly for the target. | ||
| A fully integer model is required for using the Arm Ethos-U backend. As discussed above, you can quantize floating point models with the `EthosUQuantizer`. Quantizers are backend specific, which means the `EthosUQuantizer` is configured to quantize models correctly for the target. | ||
|
|
||
| ## Runtime Integration | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing space after "NPUs"