Skip to content

Commit df0e7a0

Browse files
committed
up
1 parent 5a4c839 commit df0e7a0

File tree

5 files changed

+86
-48
lines changed

5 files changed

+86
-48
lines changed

docs/source/backend-template.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ What quantization schemes does this backend support? Consider including the foll
3232
- Symmetric vs asymmetric weights?
3333
- Per-tensor, per-chanel, group/blockwise?
3434

35+
If using a PT2E quantizer, document how to initialize the quantizer and all relevant configs and options.
36+
3537
Include a code snippet demonstrating how to perform quantization for this backend. Document, or link to, a description of the parameters that the user can specify.
3638

3739
## Runtime Integration

docs/source/backends-coreml.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,8 @@ quantized_model = convert_pt2e(prepared_model)
170170

171171
Note that static quantization requires exporting the model for iOS17 or later.
172172

173-
See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
173+
See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
174+
174175

175176
----
176177

@@ -220,7 +221,7 @@ This happens because the model is in FP16, but CoreML interprets some of the arg
220221
2. coremltools/converters/mil/backend/mil/load.py", line 499, in export
221222
raise RuntimeError("BlobWriter not loaded")
222223

223-
If you're using Python 3.13, try reducing your python version to Python 3.12. coremltools does not support Python 3.13, see this [issue](https://github.com/apple/coremltools/issues/2487).
224+
If you're using Python 3.13, try reducing your python version to Python 3.12. coremltools does not support Python 3.13, see this [issue](https://github.com/apple/coremltools/issues/2487).
224225

225226
### At runtime
226227
1. [ETCoreMLModelCompiler.mm:55] [Core ML] Failed to compile model, error = Error Domain=com.apple.mlassetio Code=1 "Failed to parse the model specification. Error: Unable to parse ML Program: at unknown location: Unknown opset 'CoreML7'." UserInfo={NSLocalizedDescription=Failed to par$

docs/source/backends-xnnpack.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,43 @@ et_program = to_edge_transform_and_lower( # (6)
117117
).to_executorch()
118118
```
119119

120-
See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
120+
See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
121+
122+
### LLM quantization with quantize_
123+
124+
The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK:
125+
126+
* Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity)
127+
* Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity)
128+
129+
Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch).
130+
131+
```python
132+
from torchao.quantization.granularity import PerGroup, PerAxis
133+
from torchao.quantization.quant_api import (
134+
IntxWeightOnlyConfig,
135+
Int8DynamicActivationIntxWeightConfig,
136+
quantize_,
137+
)
138+
139+
# Quantize embeddings with 8-bits, per channel
140+
embedding_config = IntxWeightOnlyConfig(
141+
weight_dtype=torch.int8,
142+
granularity=PerAxis(0),
143+
)
144+
qunatize_(
145+
eager_model,
146+
lambda m, fqn: isinstance(m, torch.nn.Embedding),
147+
)
148+
149+
150+
# Quatize linear layers with 8-bit dynamic activations and 4-bit weights
151+
linear_config = Int8DynamicActivationIntxWeightConfig(
152+
weight_dtype=torch.int4,
153+
weight_granularity=PerGroup(32),
154+
)
155+
quantize_(eager_model, linear_config)
156+
```
121157

122158
----
123159

docs/source/index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ ExecuTorch provides support for:
3939
- [Runtime Integration](using-executorch-runtime-integration)
4040
- [Troubleshooting](using-executorch-troubleshooting)
4141
- [Building from Source](using-executorch-building-from-source)
42+
- [Quantization](quantization-overview)
4243
- [FAQs](using-executorch-faqs)
4344
#### Examples
4445
- [Android Demo Apps](https://github.com/pytorch-labs/executorch-examples/tree/main/dl3/android/DeepLabV3Demo#executorch-android-demo-app)
@@ -80,8 +81,6 @@ ExecuTorch provides support for:
8081
- [Runtime Python API Reference](runtime-python-api-reference)
8182
- [API Life Cycle](api-life-cycle)
8283
- [Javadoc](https://pytorch.org/executorch/main/javadoc/)
83-
#### Quantization
84-
- [Overview](quantization-overview)
8584
#### Kernel Library
8685
- [Overview](kernel-library-overview)
8786
- [Custom ATen Kernel](kernel-library-custom-aten-kernel)
Lines changed: 43 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,72 +1,72 @@
1-
The current quantization overview page is a bit sparse: https://pytorch.org/executorch/main/quantization-overview.html. I'd like to update it as follows:
2-
3-
Move under Usage/ since it's the only page under Quantization/ currently.
4-
Split out information intended for backend authors (info about writing a quantizer, for example). Focus on user-facing APIs.
5-
Document backend-invariant quantization flows (embeddings, ao kernels, etc.). Include info (and example) on composable quantizer.
6-
Document PT2E and quantize_ flows.
7-
Cover the general, high level approach to quantizing different types of models.
8-
CV models
9-
Transformers / language models
10-
Talk briefly about options for evaluating quantized model accuracy (running in eager mode vs pybindings vs on-device, for example)
11-
-----
12-
13-
# Quantizing ExecuTorch Models
14-
15-
ExecuTorch uses [torchao](https://github.com/pytorch/ao) for quantization. In general, ExecuTorch quantization is backend specific, and we allow each backned to define exactly how model quantization is done based on the capability of the underlying hardware.
1+
# Quantization Overview
162

3+
Quantization is a technique that reduces the precision of numbers used in a model’s computations and stored weights—typically from 32-bit floats to 8-bit integers. This reduces the model’s memory footprint, speeds up inference, and lowers power consumption, often with minimal loss in accuracy.
174

18-
Each backend defines its own PT2E quantizers.
5+
Quantization is especially important for deploying models on edge devices such as wearables, embedded systems, and microcontrollers, which often have limited compute, memory, and battery capacity. By quantizing models, we can make them significantly more efficient and suitable for these resource-constrained environments.
196

20-
PT2E quantization happens after model export, but before lowering to a backend.
217

8+
# Quantization in ExecuTorch
9+
ExecuTorch uses [torchao](https://github.com/pytorch/ao/tree/main/torchao) as its quantization library. This integration allows ExecuTorch to leverage PyTorch-native tools for preparing, calibrating, and converting quantized models.
2210

23-
* [XNNPACK quantization example](backends-xnnpack.md#quantization)
24-
* [CoreML quantization example](backends-coreml.md#quantization)
2511

12+
Quantization in ExecuTorch is backend-specific. Each backend defines how models should be quantized based on its hardware capabilities. Most ExecuTorch backends use the torchao [PT2E quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) flow, which works on models exported with torch.export and enables quantization that is tailored for each backend.
2613

27-
```
14+
The PT2E quantization workflow has three main steps:
2815

29-
```
16+
1. Create a backend-specific quantizer.
17+
2. Prepare, calibrate, convert, and evalute the quantized model in PyTorch
18+
3. Lower the model to the target backend
3019

20+
## 1. Create a Backend-Specific Quantizer
3121

22+
Each backend provides its own quantizer (e.g., XNNPACKQuantizer, CoreMLQuantizer) that defines how quantization should be applied to a model in a way that is compatible with the target hardware.
23+
These quantizers usually support configs that allow users to specify quantization options such as:
3224

25+
* Precision (e.g., 8-bit or 4-bit)
26+
* Quantization type (e.g., dynamic, static, or weight-only quantization)
27+
* Granularity (e.g., per-tensor, per-channel)
3328

29+
Not all quantization options are supported by all backends. Consult backend-specific guides for supported quantization modes and configuration, and how to initialize the backend-specific PT2E quantizer:
3430

35-
# Quantization Overview
36-
Quantization is a process that reduces the precision of computations and lowers memory footprint in the model. To learn more, please visit the [ExecuTorch concepts page](concepts.md#quantization). This is particularly useful for edge devices including wearables, embedded devices and microcontrollers, which typically have limited resources such as processing power, memory, and battery life. By using quantization, we can make our models more efficient and enable them to run effectively on these devices.
31+
* [XNNPACK quantization](backends-xnnpack.md#quantization)
32+
* [CoreML quantization](backends-coreml.md#quantization)
3733

38-
In terms of flow, quantization happens early in the ExecuTorch stack:
3934

40-
![ExecuTorch Entry Points](_static/img/executorch-entry-points.png)
4135

42-
A more detailed workflow can be found in the [ExecuTorch tutorial](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial).
36+
## 2. Quantize and evaluate the model
4337

44-
Quantization is usually tied to execution backends that have quantized operators implemented. Thus each backend is opinionated about how the model should be quantized, expressed in a backend specific ``Quantizer`` class. ``Quantizer`` provides API for modeling users in terms of how they want their model to be quantized and also passes on the user intention to quantization workflow.
38+
After the backend specific quantizer is defined, the PT2E quantization flow is the same for all backends. A generic example is provided below, but specific examples are given in backend documentation:
4539

46-
Backend developers will need to implement their own ``Quantizer`` to express how different operators or operator patterns are quantized in their backend. This is accomplished via [Annotation API](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html) provided by quantization workflow. Since ``Quantizer`` is also user facing, it will expose specific APIs for modeling users to configure how they want the model to be quantized. Each backend should provide their own API documentation for their ``Quantizer``.
40+
```python
41+
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
4742

48-
Modeling users will use the ``Quantizer`` specific to their target backend to quantize their model, e.g. ``XNNPACKQuantizer``.
43+
training_gm = torch.export.export(model, sample_inputs).module()
4944

50-
For an example quantization flow with ``XNNPACKQuantizer``, more documentation and tutorials, please see ``Performing Quantization`` section in [ExecuTorch tutorial](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial).
45+
# Prepare the model for quantization using the backend-specific quantizer instance
46+
prepared_model = prepare_pt2e(training_gm, quantizer)
5147

52-
## Source Quantization: Int8DynActInt4WeightQuantizer
5348

54-
In addition to export based quantization (described above), ExecuTorch wants to highlight source based quantizations, accomplished via [torchao](https://github.com/pytorch/ao). Unlike export based quantization, source based quantization directly modifies the model prior to export. One specific example is `Int8DynActInt4WeightQuantizer`.
49+
# Calibrate the model on representative data
50+
for sample in calibration_data:
51+
prepared_model(sample)
5552

56-
This scheme represents 4-bit weight quantization with 8-bit dynamic quantization of activation during inference.
53+
# Convert the calibrated model to a quantized model
54+
quantized_model = convert_pt2e(prepared_model)
55+
```
5756

58-
Imported with ``from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer``, this class uses a quantization instance constructed with a specified dtype precision and groupsize, to mutate a provided ``nn.Module``.
57+
The quantized_model is a PyTorch model like any other, and can be evaluated on different tasks for accuracy.
58+
Tasks specific benchmarks are the recommended way to evaluate your quantized model, but as crude alternative you can compare to outputs with the original model using generic error metrics like SQNR:
5959

60+
```python
61+
from torchao.quantization.utils import compute_error
62+
out_reference = model(sample)
63+
out_quantized = quantized_model(sample)
64+
sqnr = compute_error(out_reference, out_quantized) # SQNR error
6065
```
61-
# Source Quant
62-
from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer
6366

64-
model = Int8DynActInt4WeightQuantizer(precision=torch_dtype, groupsize=group_size).quantize(model)
67+
Note that numerics on device can differ those in PyTorch even for unquantized models, and accuracy evaluation can also be done with pybindings or on device.
6568

66-
# Export to ExecuTorch
67-
from executorch.exir import to_edge
68-
from torch.export import export
6969

70-
exported_model = export(model, ...)
71-
et_program = to_edge(exported_model, ...).to_executorch(...)
72-
```
70+
## 3. Lower the model
71+
72+
The final step is to lower the quantized_model to the desired backend, as you would an unquantized one. See backend-specific pages for lowering information.

0 commit comments

Comments
 (0)