You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -23,12 +23,12 @@ The quantization flow mainly includes three steps:
23
23
24
24
- Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
25
25
- Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
26
-
performing the prepared model's calibration or quantization-aware training, and converting the prepared model into the quantized model.
26
+
performing the prepared model's calibration, and converting the prepared model into the quantized model.
27
27
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``.
28
28
29
-
During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>` contains
30
-
highly-optimized quantized Cong/GEMM kernels for both CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
31
-
Besides oneDNN kernels, triton would be responsible for generating kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`
29
+
During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>`_ contains
30
+
highly-optimized quantized Conv/GEMM kernels for both CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
31
+
Besides oneDNN kernels, triton would be responsible for generating kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`_
32
32
33
33
34
34
The high-level architecture of this flow could look like this:
@@ -178,6 +178,7 @@ Besides the default quant configuration (asymmetric quantized activation), we al
178
178
Then, we can set the quantization configuration to the quantizer.
0 commit comments