You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This tutorial introduces XPUInductorQuantizer aiming for serving the quantized model inference on Intel GPUs. The tutorial will cover how it
16
-
utilze PyTorch 2 Export Quantization flow and lower the quantized model into the inductor.
16
+
utilzes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.
17
17
18
18
The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
19
19
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
@@ -26,9 +26,9 @@ The quantization flow mainly includes three steps:
26
26
performing the prepared model's calibration or quantization-aware training, and converting the prepared model into the quantized model.
27
27
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``.
28
28
29
-
During Step3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>` contains
30
-
highly-optimized quantized Cong/GEMM kernels for bot CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
31
-
Besides oneDNN kernels, triton would be responsible to generate kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`
29
+
During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>` contains
30
+
highly-optimized quantized Cong/GEMM kernels for both CPU and GPU. Furthermore, oneDNN supports extra operator fusion on these operators, like quantized linear with eltwise activation function(ReLU) and binary operation(add, inplace sum).
31
+
Besides oneDNN kernels, triton would be responsible for generating kernels on our GPUs, like operators `quantize` and `dequantize`. The triton kernels are optimized by `Intel XPU Backend for Triton <https://github.com/intel/intel-xpu-backend-for-triton>`
32
32
33
33
34
34
The high-level architecture of this flow could look like this:
@@ -68,9 +68,9 @@ The high-level architecture of this flow could look like this:
68
68
Post Training Quantization
69
69
----------------------------
70
70
71
-
Static quantization is the only method we support currently. QAT and dynami quantization will be avaliable in later versions.
71
+
Static quantization is the only method we support currently. QAT and dynamic quantization will be available in later versions.
72
72
73
-
The dependencies packages are recommend to be installed through Intel GPU channel as follows
73
+
The dependencies packages are recommended to be installed through Intel GPU channel as follows
The default quantization configuration in ``XPUInductorQuantizer`` uses signed 8-bits for both activations and weights. The tensor is per-tensor quantized, while weight is signed 8-bit per-channel quantized.
126
+
The default quantization configuration in ``XPUInductorQuantizer`` uses signed 8-bits for both activations and weights. The tensor is per-tensor quantized, while the weight is signed 8-bit per-channel quantized.
127
127
128
128
Besides the default quant configuration (asymmetric quantized activation), we also support signed 8-bits symmetric quantized activation, which has the potential to provide better performance.
0 commit comments