refine

ZhiweiYan-96 · ZhiweiYan-96 · commit 6bfa4d7a4474 · 2025-03-18T06:38:16.000Z
diff --git a/prototype_source/pt2e_quant_xpu_inductor.rst b/prototype_source/pt2e_quant_xpu_inductor.rst
@@ -1,7 +1,7 @@
 PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
 ==================================================================
 
-** Author**: `Yan, Zhiwei`, `Wang, Eikan`, `Liu River`, `Cui, Yifeng`
+**Author**: `Yan Zhiwei <https://github.com/ZhiweiYan-96>`, `Wang Eikan <https://github.com/EikanWang>`, `Liu River <https://github.com/riverliuintel>`, `Cui Yifeng <https://github.com/CuiYifeng>`
 
 
 Prerequisites
@@ -19,7 +19,7 @@ utilze PyTorch 2 Export Quantization flow and lower the quantized model into the
 
 The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
 This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
-TorchInductor is the new compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
+TorchInductor is the compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
 
 The quantization flow mainly includes three steps:
 
@@ -28,9 +28,9 @@ The quantization flow mainly includes three steps:
   performing the prepared model's calibration or quantization-aware training, and converting the prepared model into the quantized model.
 - Step 3: Lower the quantized model into inductor with the API ``torch.compile``. 
 
-During Step3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton fusion. oneDNN libray contains 
-highly-optimized kernels for quantized Conv/GEMM. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
-For other operators that does not call oneDNN or fallback to ATen implementation, triton would be responsible to generate kernels on our GPUs, like operators `quantize` and `dequantize`.
+During Step3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>` contains 
+highly-optimized quantized Cong/GEMM kernels for bot CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
+Besides oneDNN kernels, triton would be responsible to generate kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`
 
 
 The high-level architecture of this flow could look like this:
@@ -64,7 +64,7 @@ The high-level architecture of this flow could look like this:
                              Inductor
                                 |
     —--------------------------------------------------------
-    |  oneDNN Kernels            Triton Kernels             |
+    |  oneDNN Kernels       ATen Ops     Triton Kernels     |
     —--------------------------------------------------------
 
 
@@ -75,7 +75,10 @@ Post Training Quantization
 Static quantization is the only method we support currently. QAT and dynami quantization will be avaliable in later versions.
 
 Please install dependencies package through Intel GPU channels as follows
-`pip install torchvision pytorch-triton-xpu --index-url https://download.pytorch.org/whl/nightly/xpu`
+
+::
+
+    pip install torchvision pytorch-triton-xpu --index-url https://download.pytorch.org/whl/nightly/xpu
 
 
 1. Capture FX Graph
@@ -117,7 +120,7 @@ Next, we will have the FX Module to be quantized.
 2. Apply Quantization
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-After we capture the FX Module to be quantized, we will import the Backend Quantizer for X86 CPU and configure how to
+After we capture the FX Module to be quantized, we will import the Backend Quantizer for Intel GPU and configure how to
 quantize the model.
 
 ::
@@ -127,11 +130,66 @@ quantize the model.
 
 .. note::
 
-   The default quantization configuration in ``XPUInductorQuantizer`` uses signed 8-bits for both activations and weights. The tensor is per-tensor quantized, while weight is per-channel quantized.
+    The default quantization configuration in ``XPUInductorQuantizer`` uses signed 8-bits for both activations and weights. The tensor is per-tensor quantized, while weight is signed 8-bit per-channel quantized.
 
+    Besides the default quant configuration, we also support signed 8-bits symmetric quantized activation, which has the potential
+    to provide better performance.
 
-After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
-``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
+::
+    from torch.ao.quantization.observer import HistogramObserver, PerChannelMinMaxObserver
+    from torch.ao.quantization.quantizer.quantizer import QuantizationSpec
+    from torch.ao.quantization.quantizer.xnnpack_quantizer_utils import QuantizationConfig
+    from typing import Any, Optional, TYPE_CHECKING
+    if TYPE_CHECKING:
+        from torch.ao.quantization.qconfig import _ObserverOrFakeQuantizeConstructor
+    def get_xpu_inductor_symm_quantization_config():
+        extra_args: dict[str, Any] = {"eps": 2**-12}
+        act_observer_or_fake_quant_ctr = HistogramObserver
+        act_quantization_spec = QuantizationSpec(
+            dtype=torch.int8,
+            quant_min=-128,
+            quant_max=127,
+            qscheme=torch.per_tensor_symmetric,
+            is_dynamic=False,
+            observer_or_fake_quant_ctr=act_observer_or_fake_quant_ctr.with_args(
+                **extra_args
+            ),
+        )
+
+        weight_observer_or_fake_quant_ctr: _ObserverOrFakeQuantizeConstructor = (
+            PerChannelMinMaxObserver
+        )
+
+        weight_quantization_spec = QuantizationSpec(
+            dtype=torch.int8,
+            quant_min=-128,
+            quant_max=127,
+            qscheme=torch.per_channel_symmetric,
+            ch_axis=0,  # 0 corresponding to weight shape = (oc, ic, kh, kw) of conv
+            is_dynamic=False,
+            observer_or_fake_quant_ctr=weight_observer_or_fake_quant_ctr.with_args(
+                **extra_args
+            ),
+        )
+
+        bias_quantization_spec = None  # will use placeholder observer by default
+        quantization_config = QuantizationConfig(
+            act_quantization_spec,
+            act_quantization_spec,
+            weight_quantization_spec,
+            bias_quantization_spec,
+            False,
+        )
+        return quantization_config
+
+    Then, the user can set the quantization configuration to the quantizer.
+
+::
+    quantizer = XPUInductorQuantizer()
+    quantizer.set_global(get_xpu_inductor_symm_quantization_config())
+
+    After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
+    ``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
 
 ::
 
@@ -200,6 +258,7 @@ script within the BFloat16 Autocast context.
             # Running some benchmark
             optimized_model(*example_inputs)
 
+
 Put all these codes together, we will have the toy example code.
 Please note that since the Inductor ``freeze`` feature does not turn on by default yet, run your example code with ``TORCHINDUCTOR_FREEZING=1``.