Skip to content

Commit 847288a

Browse files
ZhiweiYan-96tye1
andauthored
Add GPU int8 docs (#3481)
* Add GPU int8 docs Co-authored-by: Ye Ting <[email protected]>
1 parent c7be080 commit 847288a

File tree

1 file changed

+101
-0
lines changed

1 file changed

+101
-0
lines changed
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
Intel® Extension for PyTorch\* optimizations for quantization [GPU]
2+
===================================================================
3+
4+
Intel® Extension for PyTorch\* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This tutorial illustrates the work flow of quantization on Intel GPUs.
5+
6+
The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with to('xpu') is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. `conv_relu`) would be enabled automatically. This would deliver the best performance for int8 workloads.
7+
8+
## Imperative Mode
9+
```python
10+
import torch
11+
import intel_extension_for_pytorch
12+
13+
# Define model
14+
model = Model().to("xpu")
15+
model.eval()
16+
modelImpe = torch.quantization.QuantWrapper(model)
17+
18+
# Define QConfig
19+
qconfig = torch.quantization.QConfig(activation=torch.quantization.observer.MinMaxObserver .with_args(qscheme=torch.per_tensor_symmetric),
20+
weight=torch.quantization.default_weight_observer) # weight could also be perchannel
21+
22+
modelImpe.qconfig = qconfig
23+
24+
# Prepare model for inserting observer
25+
torch.quantization.prepare(modelImpe, inplace=True)
26+
27+
# Calibration to obtain statistics for Observer
28+
for data in calib_dataset:
29+
modelImpe(data)
30+
31+
# Convert model to create a quantized module
32+
torch.quantization.convert(modelImpe, inplace=True)
33+
34+
# Inference
35+
modelImpe(inference_data)
36+
```
37+
38+
Imperative mode usage follows official Pytorch and more details can be found at [PyTorch doc](https://pytorch.org/docs/1.9.1/quantization.html).
39+
40+
Defining the quantized config (QConfig) for model is the first stage of quantization. Per-tensor quantization is supported for activation quantization, while both per-tensor and per-channel are supported for weight quantization. Weight can be quantized to `int8` data type only. As for activation quantization, both symmetric and asymmetric are supported. Also, both `uint8` and `int8` data types are supported.
41+
42+
If the best performance is desired, we recommend using the `symmetric+int8` combination. Other configuration may have lower performance due to the existence of `zero_point`.
43+
44+
After defining a QConfig, the `prepare` function is used to insert observer in models. The observer is responsible for collecting statistics for quantization. A calibration stage is needed for observer to collect info.
45+
46+
After calibration, function `convert` would quantize weight in module and swap FP32 module to quantized ones. Then, an int8 module is created. Be free to use it for inference.
47+
48+
## TorchScript Mode
49+
```python
50+
import torch
51+
import intel_extension_for_pytorch
52+
from torch.quantization.quantize_jit import (
53+
convert_jit,
54+
prepare_jit,
55+
)
56+
57+
# Define model
58+
model = Model().to("xpu")
59+
model.eval()
60+
61+
# Generate a ScriptModule
62+
modelJit = torch.jit.trace(model, example_input) # or torch.jit.script(model)
63+
64+
# Defin QConfig
65+
qconfig = torch.quantization.QConfig(
66+
activation=torch.quantization.observer.MinMaxObserver.with_args(
67+
qscheme=qscheme,
68+
reduce_range=False,
69+
dtype=dtype
70+
),
71+
weight=torch.quantization.default_weight_observer
72+
)
73+
74+
# Prepare model for inserting observer
75+
modelJit = prepare_jit(modelJit, {'': qconfig}, inplace=True)
76+
77+
# Calibration
78+
for data in calib_dataset:
79+
modelJit(data)
80+
81+
# Convert model to quantized one
82+
modelJit = convert_jit(modelJit)
83+
84+
# Warmup to fully trigger fusion patterns
85+
for i in range(5):
86+
modelJit(warmup_data)
87+
# Inference
88+
modelJit(inference_data)
89+
90+
# Debug
91+
print(modelJit.graph_for(inference_dta))
92+
```
93+
94+
We need define QConfig for TorchScript module, use `prepare_jit` for inserting observer and use `convert_jit` for replacing FP32 modules.
95+
96+
Before `prepare_jit`, create a ScriptModule using `torch.jit.script` or `torch.jit.trace`. `jit.trace` is recommended for capable of catching the whole graph in most scenarios.
97+
98+
Fusion ops like conv_unary, conv_binary, linear_unary (e.g. `conv_relu`, `conv_sum_relu`) are automatically enabled after model conversion (`convert_jit`). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration.
99+
100+
`modelJit.graph_for(input)` is useful to dump the inference graph and other graph related information for performance analysis.
101+

0 commit comments

Comments
 (0)