up

metascroy · metascroy · commit df0e7a0e1567 · 2025-07-04T14:27:37.000-07:00
diff --git a/docs/source/backend-template.md b/docs/source/backend-template.md
@@ -32,6 +32,8 @@ What quantization schemes does this backend support? Consider including the foll
 - Symmetric vs asymmetric weights?
 - Per-tensor, per-chanel, group/blockwise?
 
+If using a PT2E quantizer, document how to initialize the quantizer and all relevant configs and options.
+
 Include a code snippet demonstrating how to perform quantization for this backend. Document, or link to, a description of the parameters that the user can specify.
 
 ## Runtime Integration
diff --git a/docs/source/backends-coreml.md b/docs/source/backends-coreml.md
@@ -170,7 +170,8 @@ quantized_model = convert_pt2e(prepared_model)
 
 Note that static quantization requires exporting the model for iOS17 or later.
 
-See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
+See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
+
 
 ----
 
@@ -220,7 +221,7 @@ This happens because the model is in FP16, but CoreML interprets some of the arg
 2. coremltools/converters/mil/backend/mil/load.py", line 499, in export
     raise RuntimeError("BlobWriter not loaded")
 
-If you're using Python 3.13, try reducing your python version to Python 3.12.  coremltools does not support Python 3.13, see this [issue](https://github.com/apple/coremltools/issues/2487).  
+If you're using Python 3.13, try reducing your python version to Python 3.12.  coremltools does not support Python 3.13, see this [issue](https://github.com/apple/coremltools/issues/2487).
 
 ### At runtime
 1. [ETCoreMLModelCompiler.mm:55] [Core ML]  Failed to compile model, error = Error Domain=com.apple.mlassetio Code=1 "Failed to parse the model specification. Error: Unable to parse ML Program: at unknown location: Unknown opset 'CoreML7'." UserInfo={NSLocalizedDescription=Failed to par$
diff --git a/docs/source/backends-xnnpack.md b/docs/source/backends-xnnpack.md
@@ -117,7 +117,43 @@ et_program = to_edge_transform_and_lower( # (6)
 ).to_executorch()
 ```
 
-See [PyTorch 2 Export Post Training Quantization](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html) for more information.
+See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.
+
+### LLM quantization with quantize_
+
+The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API.  This is most commonly used for LLMs, requiring more advanced quantization.  Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK:
+
+* Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity)
+* Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity)
+
+Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch).
+
+```python
+from torchao.quantization.granularity import PerGroup, PerAxis
+from torchao.quantization.quant_api import (
+    IntxWeightOnlyConfig,
+    Int8DynamicActivationIntxWeightConfig,
+    quantize_,
+)
+
+# Quantize embeddings with 8-bits, per channel
+embedding_config = IntxWeightOnlyConfig(
+    weight_dtype=torch.int8,
+    granularity=PerAxis(0),
+)
+qunatize_(
+    eager_model,
+    lambda m, fqn: isinstance(m, torch.nn.Embedding),
+)
+
+
+# Quatize linear layers with 8-bit dynamic activations and 4-bit weights
+linear_config = Int8DynamicActivationIntxWeightConfig(
+    weight_dtype=torch.int4,
+    weight_granularity=PerGroup(32),
+)
+quantize_(eager_model, linear_config)
+```
 
 ----
 
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -39,6 +39,7 @@ ExecuTorch provides support for:
 - [Runtime Integration](using-executorch-runtime-integration)
 - [Troubleshooting](using-executorch-troubleshooting)
 - [Building from Source](using-executorch-building-from-source)
+- [Quantization](quantization-overview)
 - [FAQs](using-executorch-faqs)
 #### Examples
 - [Android Demo Apps](https://github.com/pytorch-labs/executorch-examples/tree/main/dl3/android/DeepLabV3Demo#executorch-android-demo-app)
@@ -80,8 +81,6 @@ ExecuTorch provides support for:
 - [Runtime Python API Reference](runtime-python-api-reference)
 - [API Life Cycle](api-life-cycle)
 - [Javadoc](https://pytorch.org/executorch/main/javadoc/)
-#### Quantization
-- [Overview](quantization-overview)
 #### Kernel Library
 - [Overview](kernel-library-overview)
 - [Custom ATen Kernel](kernel-library-custom-aten-kernel)
diff --git a/docs/source/quantization-overview.md b/docs/source/quantization-overview.md
@@ -1,72 +1,72 @@
-The current quantization overview page is a bit sparse: https://pytorch.org/executorch/main/quantization-overview.html. I'd like to update it as follows:
-
-Move under Usage/ since it's the only page under Quantization/ currently.
-Split out information intended for backend authors (info about writing a quantizer, for example). Focus on user-facing APIs.
-Document backend-invariant quantization flows (embeddings, ao kernels, etc.). Include info (and example) on composable quantizer.
-Document PT2E and quantize_ flows.
-Cover the general, high level approach to quantizing different types of models.
-CV models
-Transformers / language models
-Talk briefly about options for evaluating quantized model accuracy (running in eager mode vs pybindings vs on-device, for example)
------
-
-# Quantizing ExecuTorch Models
-
-ExecuTorch uses [torchao](https://github.com/pytorch/ao) for quantization.  In general, ExecuTorch quantization is backend specific, and we allow each backned to define exactly how model quantization is done based on the capability of the underlying hardware.
+# Quantization Overview
 
+Quantization is a technique that reduces the precision of numbers used in a model’s computations and stored weights—typically from 32-bit floats to 8-bit integers. This reduces the model’s memory footprint, speeds up inference, and lowers power consumption, often with minimal loss in accuracy.
 
-Each backend defines its own PT2E quantizers.
+Quantization is especially important for deploying models on edge devices such as wearables, embedded systems, and microcontrollers, which often have limited compute, memory, and battery capacity. By quantizing models, we can make them significantly more efficient and suitable for these resource-constrained environments.
 
-PT2E quantization happens after model export, but before lowering to a backend.
 
+# Quantization in ExecuTorch
+ExecuTorch uses [torchao](https://github.com/pytorch/ao/tree/main/torchao) as its quantization library. This integration allows ExecuTorch to leverage PyTorch-native tools for preparing, calibrating, and converting quantized models.
 
-* [XNNPACK quantization example](backends-xnnpack.md#quantization)
-* [CoreML quantization example](backends-coreml.md#quantization)
 
+Quantization in ExecuTorch is backend-specific. Each backend defines how models should be quantized based on its hardware capabilities. Most ExecuTorch backends use the torchao [PT2E quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) flow, which works on models exported with torch.export and enables quantization that is tailored for each backend.
 
-```
+The PT2E quantization workflow has three main steps:
 
-```
+1. Create a backend-specific quantizer.
+2. Prepare, calibrate, convert, and evalute the quantized model in PyTorch
+3. Lower the model to the target backend
 
+## 1. Create a Backend-Specific Quantizer
 
+Each backend provides its own quantizer (e.g., XNNPACKQuantizer, CoreMLQuantizer) that defines how quantization should be applied to a model in a way that is compatible with the target hardware.
+These quantizers usually support configs that allow users to specify quantization options such as:
 
+* Precision (e.g., 8-bit or 4-bit)
+* Quantization type (e.g., dynamic, static, or weight-only quantization)
+* Granularity (e.g., per-tensor, per-channel)
 
+Not all quantization options are supported by all backends. Consult backend-specific guides for supported quantization modes and configuration, and how to initialize the backend-specific PT2E quantizer:
 
-# Quantization Overview
-Quantization is a process that reduces the precision of computations and lowers memory footprint in the model. To learn more, please visit the [ExecuTorch concepts page](concepts.md#quantization). This is particularly useful for edge devices including wearables, embedded devices and microcontrollers, which typically have limited resources such as processing power, memory, and battery life. By using quantization, we can make our models more efficient and enable them to run effectively on these devices.
+* [XNNPACK quantization](backends-xnnpack.md#quantization)
+* [CoreML quantization](backends-coreml.md#quantization)
 
-In terms of flow, quantization happens early in the ExecuTorch stack:
 
-![ExecuTorch Entry Points](_static/img/executorch-entry-points.png)
 
-A more detailed workflow can be found in the [ExecuTorch tutorial](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial).
+## 2. Quantize and evaluate the model
 
-Quantization is usually tied to execution backends that have quantized operators implemented. Thus each backend is opinionated about how the model should be quantized, expressed in a backend specific ``Quantizer`` class. ``Quantizer`` provides API for modeling users in terms of how they want their model to be quantized and also passes on the user intention to quantization workflow.
+After the backend specific quantizer is defined, the PT2E quantization flow is the same for all backends.  A generic example is provided below, but specific examples are given in backend documentation:
 
-Backend developers will need to implement their own ``Quantizer`` to express how different operators or operator patterns are quantized in their backend. This is accomplished via [Annotation API](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html) provided by quantization workflow. Since ``Quantizer`` is also user facing, it will expose specific APIs for modeling users to configure how they want the model to be quantized. Each backend should provide their own API documentation for their ``Quantizer``.
+```python
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
 
-Modeling users will use the ``Quantizer`` specific to their target backend to quantize their model, e.g. ``XNNPACKQuantizer``.
+training_gm = torch.export.export(model, sample_inputs).module()
 
-For an example quantization flow with ``XNNPACKQuantizer``, more documentation and tutorials, please see ``Performing Quantization`` section in [ExecuTorch tutorial](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial).
+# Prepare the model for quantization using the backend-specific quantizer instance
+prepared_model = prepare_pt2e(training_gm, quantizer)
 
-## Source Quantization: Int8DynActInt4WeightQuantizer
 
-In addition to export based quantization (described above), ExecuTorch wants to highlight source based quantizations, accomplished via [torchao](https://github.com/pytorch/ao). Unlike export based quantization, source based quantization directly modifies the model prior to export. One specific example is `Int8DynActInt4WeightQuantizer`.
+# Calibrate the model on representative data
+for sample in calibration_data:
+	prepared_model(sample)
 
-This scheme represents 4-bit weight quantization with 8-bit dynamic quantization of activation during inference.
+# Convert the calibrated model to a quantized model
+quantized_model = convert_pt2e(prepared_model)
+```
 
-Imported with ``from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer``, this class uses a quantization instance constructed with a specified dtype precision and groupsize, to mutate a provided ``nn.Module``.
+The quantized_model is a PyTorch model like any other, and can be evaluated on different tasks for accuracy.
+Tasks specific benchmarks are the recommended way to evaluate your quantized model, but as crude alternative you can compare to outputs with the original model using generic error metrics like SQNR:
 
+```python
+from torchao.quantization.utils import compute_error
+out_reference = model(sample)
+out_quantized = quantized_model(sample)
+sqnr = compute_error(out_reference, out_quantized) # SQNR error
 ```
-# Source Quant
-from torchao.quantization.quant_api import Int8DynActInt4WeightQuantizer
 
-model = Int8DynActInt4WeightQuantizer(precision=torch_dtype, groupsize=group_size).quantize(model)
+Note that numerics on device can differ those in PyTorch even for unquantized models, and accuracy evaluation can also be done with pybindings or on device.
 
-# Export to ExecuTorch
-from executorch.exir import to_edge
-from torch.export import export
 
-exported_model = export(model, ...)
-et_program = to_edge(exported_model, ...).to_executorch(...)
-```
+## 3. Lower the model
+
+The final step is to lower the quantized_model to the desired backend, as you would an unquantized one.  See backend-specific pages for lowering information.