You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Provide a brief overview/description of the backend. At a high-level, what does it do? Consider linking to top-level vendor documentation for the target hardware family and/or framework (Core ML, XNNPACK, etc.).
4
+
3
5
## Features
4
6
7
+
List high-level features of backend, such as general operator and hardware support.
8
+
5
9
## Target Requirements
6
10
11
+
What hardware and software is required to run the backend on a specific device? For example, does it require specific iOS or Android OS versions? If it's an NPU, what hardware models are supported?
12
+
7
13
## Development Requirements
8
14
9
-
## Lowering a Model to *Backend Name*
15
+
What software and hardware is needed to create a .PTE file targeting this backend? Are there any additional dependencies that need to be installed that are not included with the ExecuTorch pip package? How does the user install them?
16
+
17
+
## Using *Backend Name*
18
+
19
+
This section describes the steps users need to take in order to generate a .PTE targeting this backend. Include a full code sample for exporting and lowering a model to this backend. Make sure relevant imports for the backend partitioner are included.
10
20
11
21
### Partitioner API
12
22
23
+
What options, if any, does the partitioner take? Are there any other export-time configurations that can be applied? Document each option.
24
+
13
25
### Quantization
14
26
27
+
What quantization schemes does this backend support? Consider including the following, as appropriate.
28
+
- What operators are supported?
29
+
- Number of bits?
30
+
- Static vs dynamic activations?
31
+
- Weight only vs activations + weights?
32
+
- Symmetric vs asymmetric weights?
33
+
- Per-tensor, per-chanel, group/blockwise?
34
+
35
+
Include a code snippet demonstrating how to perform quantization for this backend. Document, or link to, a description of the parameters that the user can specify.
36
+
15
37
## Runtime Integration
38
+
39
+
This section is intended to tell the user all of the steps they'll need to take to be able to run a .PTE file on-device that is targeting the given backend.
40
+
- What CMake targets should they link to?
41
+
- How is this backend compiled from source?
42
+
- Is the backend bundled by default in iOS and/or Android pre-built libraries?
Copy file name to clipboardExpand all lines: docs/source/backends-xnnpack.md
+54-44Lines changed: 54 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,13 @@
1
1
# XNNPACK Backend
2
2
3
-
The XNNPACK delegate is the ExecuTorch solution for CPU execution on mobile CPUs. XNNPACK is a library that provides optimized kernels for machine learning operators on Arm and x86 CPUs.
3
+
The XNNPACK delegate is the ExecuTorch solution for CPU execution on mobile CPUs. [XNNPACK](https://github.com/google/XNNPACK/tree/master) is a library that provides optimized kernels for machine learning operators on Arm and x86 CPUs.
4
4
5
5
## Features
6
6
7
7
- Wide operator support on Arm and x86 CPUs, available on any modern mobile phone.
8
8
- Support for a wide variety of quantization schemes and quantized operators.
9
+
- Supports fp32 and fp16 activations.
10
+
- Supports 8-bit quantization.
9
11
10
12
## Target Requirements
11
13
@@ -16,9 +18,12 @@ The XNNPACK delegate is the ExecuTorch solution for CPU execution on mobile CPUs
16
18
17
19
## Development Requirements
18
20
19
-
The XNNPACK delegate does not introduce any development system requirements beyond those required by the core ExecuTorch runtime.
21
+
The XNNPACK delegate does not introduce any development system requirements beyond those required by
22
+
the core ExecuTorch runtime.
20
23
21
-
## Lowering a Model to XNNPACK
24
+
----
25
+
26
+
## Using the XNNPACK Backend
22
27
23
28
To target the XNNPACK backend during the export and lowering process, pass an instance of the `XnnpackPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision.
24
29
@@ -49,59 +54,64 @@ The XNNPACK partitioner API allows for configuration of the model delegation to
49
54
-`per_op_mode`: If true, emit individual delegate calls for every operator. This is an advanced option intended to reduce memory overhead in some contexts at the cost of a small amount of runtime overhead. Defaults to false.
50
55
-`verbose`: If true, print additional information during lowering.
51
56
52
-
### Quantization
57
+
### Testing the Model
53
58
54
-
The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. To quantize a PyTorch model for the XNNPACK backend, use the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library.
59
+
After generating the XNNPACK-delegated .pte, the model can be tested from Python using the ExecuTorch runtime python bindings. This can be used to sanity check the model and evaluate numerical accuracy. See [Testing the Model](using-executorch-export.md#testing-the-model) for more information.
55
60
56
-
### Configuring the XNNPACKQuantizer
61
+
----
57
62
58
-
```python
59
-
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import (
Here, the `XNNPACKQuantizer` is configured for symmetric quantization, indicating that the quantized zero point is set to zero with `qmin = -127` and `qmax = 127`. `get_symmetric_quantization_config()` can be configured with the following arguments:
67
-
*`is_per_channel`
68
-
* Weights are quantized across channels
69
-
*`is_qat`
70
-
* Quantize aware training
71
-
*`is_dynamic`
72
-
* Dynamic quantization
63
+
## Quantization
73
64
74
-
```python
75
-
quantizer.set_global(quantization_config)
76
-
.set_object_type(torch.nn.Conv2d, quantization_config) # can configure by module type
77
-
.set_object_type(torch.nn.functional.linear, quantization_config) # or torch functional op typea
78
-
.set_module_name("foo.bar", quantization_config) # or by module fully qualified name
79
-
```
65
+
The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. To quantize a PyTorch model for the XNNPACK backend, use the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library.
66
+
67
+
### Supported Quantization Schemes
68
+
The XNNPACK delegate supports the following quantization schemes:
69
+
- 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow).
70
+
- Supports both static and dynamic activations.
71
+
- Supports per-channel and per-tensor schemes.
72
+
- Supports linear, convolution, add, mul, cat, and adaptive avg pool 2d operators.
73
+
74
+
Weight-only quantization is not currently supported on XNNPACK.
75
+
76
+
### 8-bit Quantization using the PT2E Flow
77
+
78
+
To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model:
79
+
80
+
1) Create an instance of the `XnnpackQuantizer` class. Set quantization parameters.
81
+
2) Use `torch.export.export_for_training` to prepare for quantization.
82
+
3) Call `prepare_pt2e` to prepare the model for quantization.
83
+
4) For static quantization, run the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
84
+
5) Call `convert_pt2e` to quantize the model.
85
+
6) Export and lower the model using the standard flow.
86
+
87
+
The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques.
80
88
81
-
#### Quantizing a model with the XNNPACKQuantizer
82
-
After configuring the quantizer, the model can be quantized by via the `prepare_pt2e` and `convert_pt2e` APIs.
83
89
```python
84
-
from torch.ao.quantization.quantize_pt2e import (
85
-
prepare_pt2e,
86
-
convert_pt2e,
87
-
)
88
-
from torch.export import export_for_training
90
+
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import XNNPACKQuantizer
91
+
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e
92
+
from torch.ao.quantization.quantizer.xnnpack_quantizer import get_symmetric_quantization_config
For static, post-training quantization (PTQ), the post-prepare\_pt2e model should be run with a representative set of samples, which are used to determine the quantization parameters.
101
+
for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs
102
+
prepared_model(cal_sample) # (4) Calibrate
99
103
100
-
After `convert_pt2e`, the model can be exported and lowered using the normal ExecuTorch XNNPACK flow. For more information on PyTorch 2 quantization [here](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html).
After generating the XNNPACK-delegated .pte, the model can be tested from Python using the ExecuTorch runtime python bindings. This can be used to sanity check the model and evaluate numerical accuracy. See [Testing the Model](using-executorch-export.md#testing-the-model) for more information.
112
+
See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
0 commit comments