You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -19,7 +19,7 @@ utilze PyTorch 2 Export Quantization flow and lower the quantized model into the
19
19
20
20
The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
21
21
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
22
-
TorchInductor is the new compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
22
+
TorchInductor is the compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
23
23
24
24
The quantization flow mainly includes three steps:
25
25
@@ -28,9 +28,9 @@ The quantization flow mainly includes three steps:
28
28
performing the prepared model's calibration or quantization-aware training, and converting the prepared model into the quantized model.
29
29
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``.
30
30
31
-
During Step3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton fusion. oneDNN libray contains
32
-
highly-optimized kernels for quantized Conv/GEMM. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
33
-
For other operators that does not call oneDNN or fallback to ATen implementation, triton would be responsible to generate kernels on our GPUs, like operators `quantize` and `dequantize`.
31
+
During Step3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>` contains
32
+
highly-optimized quantized Cong/GEMM kernels for bot CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
33
+
Besides oneDNN kernels, triton would be responsible to generate kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`
34
34
35
35
36
36
The high-level architecture of this flow could look like this:
@@ -64,7 +64,7 @@ The high-level architecture of this flow could look like this:
@@ -117,7 +120,7 @@ Next, we will have the FX Module to be quantized.
117
120
2. Apply Quantization
118
121
^^^^^^^^^^^^^^^^^^^^^^^
119
122
120
-
After we capture the FX Module to be quantized, we will import the Backend Quantizer for X86 CPU and configure how to
123
+
After we capture the FX Module to be quantized, we will import the Backend Quantizer for Intel GPU and configure how to
121
124
quantize the model.
122
125
123
126
::
@@ -127,11 +130,66 @@ quantize the model.
127
130
128
131
.. note::
129
132
130
-
The default quantization configuration in ``XPUInductorQuantizer`` uses signed 8-bits for both activations and weights. The tensor is per-tensor quantized, while weight is per-channel quantized.
133
+
The default quantization configuration in ``XPUInductorQuantizer`` uses signed 8-bits for both activations and weights. The tensor is per-tensor quantized, while weight is signed 8-bit per-channel quantized.
131
134
135
+
Besides the default quant configuration, we also support signed 8-bits symmetric quantized activation, which has the potential
136
+
to provide better performance.
132
137
133
-
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
134
-
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
138
+
::
139
+
from torch.ao.quantization.observer import HistogramObserver, PerChannelMinMaxObserver
140
+
from torch.ao.quantization.quantizer.quantizer import QuantizationSpec
141
+
from torch.ao.quantization.quantizer.xnnpack_quantizer_utils import QuantizationConfig
142
+
from typing import Any, Optional, TYPE_CHECKING
143
+
if TYPE_CHECKING:
144
+
from torch.ao.quantization.qconfig import _ObserverOrFakeQuantizeConstructor
0 commit comments