Skip to content

Commit 174a6a6

Browse files
committed
update quant overview page
1 parent 9d599c9 commit 174a6a6

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

docs/source/quantization-overview.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,37 @@
1+
The current quantization overview page is a bit sparse: https://pytorch.org/executorch/main/quantization-overview.html. I'd like to update it as follows:
2+
3+
Move under Usage/ since it's the only page under Quantization/ currently.
4+
Split out information intended for backend authors (info about writing a quantizer, for example). Focus on user-facing APIs.
5+
Document backend-invariant quantization flows (embeddings, ao kernels, etc.). Include info (and example) on composable quantizer.
6+
Document PT2E and quantize_ flows.
7+
Cover the general, high level approach to quantizing different types of models.
8+
CV models
9+
Transformers / language models
10+
Talk briefly about options for evaluating quantized model accuracy (running in eager mode vs pybindings vs on-device, for example)
11+
-----
12+
13+
# Quantizing ExecuTorch Models
14+
15+
ExecuTorch uses [torchao](https://github.com/pytorch/ao) for quantization. In general, ExecuTorch quantization is backend specific, and we allow each backned to define exactly how models are quantization based on the capability of the underlying hardware.
16+
17+
Each backend defines its own PT2E quantizers.
18+
19+
PT2E quantization happens after model export, but before lowering to a backend.
20+
21+
22+
* XNNPACK quantization example
23+
* CoreML quantization example
24+
* Vulkan quantization example
25+
26+
27+
```
28+
29+
```
30+
31+
32+
33+
34+
135
# Quantization Overview
236
Quantization is a process that reduces the precision of computations and lowers memory footprint in the model. To learn more, please visit the [ExecuTorch concepts page](concepts.md#quantization). This is particularly useful for edge devices including wearables, embedded devices and microcontrollers, which typically have limited resources such as processing power, memory, and battery life. By using quantization, we can make our models more efficient and enable them to run effectively on these devices.
337

0 commit comments

Comments
 (0)