Skip to content

Commit 1b3cf01

Browse files
committed
refine
1 parent 63a63cc commit 1b3cf01

File tree

1 file changed

+6
-5
lines changed

1 file changed

+6
-5
lines changed

prototype_source/pt2e_quant_xpu_inductor.rst

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
22
==================================================================
33

4-
**Author**: `Yan Zhiwei <https://github.com/ZhiweiYan-96>`_, `Wang Eikan <https://github.com/EikanWang>`_, `Liu River <https://github.com/riverliuintel>`_, `Cui Yifeng <https://github.com/CuiYifeng>`_
4+
**Author**: `Yan Zhiwei <https://github.com/ZhiweiYan-96>`_, `Wang Eikan <https://github.com/EikanWang>`_, `Zhang, Liangang <https://github.com/liangan1>`_, `Liu River <https://github.com/riverliuintel>`_, `Cui Yifeng <https://github.com/CuiYifeng>`_
55

66
Prerequisites
77
---------------
@@ -23,12 +23,12 @@ The quantization flow mainly includes three steps:
2323

2424
- Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
2525
- Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
26-
performing the prepared model's calibration or quantization-aware training, and converting the prepared model into the quantized model.
26+
performing the prepared model's calibration, and converting the prepared model into the quantized model.
2727
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``.
2828

29-
During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>` contains
30-
highly-optimized quantized Cong/GEMM kernels for both CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
31-
Besides oneDNN kernels, triton would be responsible for generating kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`
29+
During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>`_ contains
30+
highly-optimized quantized Conv/GEMM kernels for both CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
31+
Besides oneDNN kernels, triton would be responsible for generating kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`_
3232

3333

3434
The high-level architecture of this flow could look like this:
@@ -178,6 +178,7 @@ Besides the default quant configuration (asymmetric quantized activation), we al
178178
Then, we can set the quantization configuration to the quantizer.
179179

180180
::
181+
181182
quantizer = XPUInductorQuantizer()
182183
quantizer.set_global(get_xpu_inductor_symm_quantization_config())
183184

0 commit comments

Comments
 (0)