Skip to content

Commit 55fc501

Browse files
lidanqing-intelWojciech Uss
andauthored
Update QAT INT8 1.8 doc (#24127) (#24248)
* update local data preprocess doc * update for 1.8 QAT * update benchmark data Co-authored-by: Wojciech Uss <[email protected]> test=release/1.8 test=document_fix
1 parent 43c5626 commit 55fc501

File tree

2 files changed

+77
-48
lines changed

2 files changed

+77
-48
lines changed

paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md

Lines changed: 38 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ Follow PaddlePaddle [installation instruction](https://github.com/PaddlePaddle/m
88

99
```bash
1010
cmake .. -DWITH_TESTING=ON -WITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_MKL=ON -DWITH_MKLDNN=ON -DWITH_INFERENCE_API_TEST=ON -DON_INFER=ON
11-
1211
```
1312

1413
Note: MKL-DNN and MKL are required.
@@ -64,14 +63,32 @@ We provide the results of accuracy and performance measured on Intel(R) Xeon(R)
6463
6564
* ## Prepare dataset
6665
67-
Run the following commands to download and preprocess the ILSVRC2012 Validation dataset.
66+
* Download and preprocess the full ILSVRC2012 Validation dataset.
6867
6968
```bash
70-
cd /PATH/TO/PADDLE/build
71-
python ../paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py
69+
cd /PATH/TO/PADDLE
70+
python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py
7271
```
7372

74-
Then the ILSVRC2012 Validation dataset will be preprocessed and saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin`
73+
Then the ILSVRC2012 Validation dataset binary file is saved by default in `$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin`
74+
75+
* Prepare user local dataset.
76+
77+
```bash
78+
cd /PATH/TO/PADDLE/
79+
python paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py --local --data_dir=/PATH/TO/USER/DATASET --output_file=/PATH/TO/OUTPUT/BINARY
80+
```
81+
82+
Available options in the above command and their descriptions are as follows:
83+
- **No parameters set:** The script will download the ILSVRC2012_img_val data from server and convert it into a binary file.
84+
- **local:** Once set, the script will process user local data.
85+
- **data_dir:** Path to user local dataset. Default value: None.
86+
- **label_list:** Path to image_label list file. Default value: `val_list.txt`.
87+
- **output_file:** Path to the generated binary file. Default value: `imagenet_small.bin`.
88+
- **data_dim:** The length and width of the preprocessed image. The default value: 224.
89+
90+
The user dataset preprocessed binary file by default is saved in `imagenet_small.bin`.
91+
7592

7693
* ## Commands to reproduce image classification benchmark
7794

@@ -108,30 +125,30 @@ MODEL_NAME=googlenet, mobilenetv1, mobilenetv2, resnet101, resnet50, vgg16, vgg1
108125

109126
* ## Prepare dataset
110127

111-
* Run the following commands to download and preprocess the Pascal VOC2007 test set.
128+
* Download and preprocess the full Pascal VOC2007 test set.
112129

113130
```bash
114-
cd /PATH/TO/PADDLE/build
115-
python ../paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py --choice=VOC_test_2007
131+
cd /PATH/TO/PADDLE
132+
python paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py
116133
```
117134

118-
Then the Pascal VOC2007 test set will be preprocessed and saved by default in `$HOME/.cache/paddle/dataset/pascalvoc/pascalvoc_full.bin`
135+
The Pascal VOC2007 test set binary file is saved by default in `$HOME/.cache/paddle/dataset/pascalvoc/pascalvoc_full.bin`
119136

120-
* Run the following commands to prepare your own dataset.
137+
* Prepare user local dataset.
121138

122139
```bash
123-
cd /PATH/TO/PADDLE/build
124-
python ../paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py --choice=local \\
125-
--data_dir=./third_party/inference_demo/int8v2/pascalvoc_small \\
126-
--img_annotation_list=test_100.txt \\
127-
--label_file=label_list \\
128-
--output_file=pascalvoc_small.bin \\
129-
--resize_h=300 \\
130-
--resize_w=300 \\
131-
--mean_value=[127.5, 127.5, 127.5] \\
132-
--ap_version=11point \\
140+
cd /PATH/TO/PADDLE
141+
python paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py --local --data_dir=/PATH/TO/USER/DATASET --img_annotation_list=/PATH/TO/ANNOTATION/LIST --label_file=/PATH/TO/LABEL/FILE --output_file=/PATH/TO/OUTPUT/FILE
133142
```
134-
Then the user dataset will be preprocessed and saved by default in `/PATH/TO/PADDLE/build/third_party/inference_demo/int8v2/pascalvoc_small/pascalvoc_small.bin`
143+
Available options in the above command and their descriptions are as follows:
144+
- **No parameters set:** The script will download the full pascalvoc test dataset and preprocess and convert it into a binary file.
145+
- **local:** Once set, the script will process user local data.
146+
- **data_dir:** Path to user local dataset. Default value: None.
147+
- **img_annotation_list:** Path to img_annotation list file. Default value: `test_100.txt`.
148+
- **label_file:** Path to labels list. Default value: `label_list`.
149+
- **output_file:** Path to generated binary file. Default value: `pascalvoc_small.bin`.
150+
151+
The user dataset preprocessed binary file by default is saved in `pascalvoc_small.bin`.
135152

136153
* ## Commands to reproduce object detection benchmark
137154

python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md

Lines changed: 39 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,12 @@ In **Release 1.6**, a new approach was introduced, called QAT2, which adds suppo
88

99
In **Release 1.7**, a support for [Ernie (NLP) QAT trained model](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn) was added to the QAT2.
1010

11+
In **Release 1.8**, further optimizations were added to the QAT2: INT8 `matmul` kernel, inplace execution of activation and `elementwise_add` operators, and broader support for quantization aware strategy from PaddleSlim.
12+
1113
In this document we focus on the QAT2 approach only.
1214

1315
## 0. Prerequisites
14-
* PaddlePaddle in version 1.7.1 or higher is required. For instructions on how to install it see the [installation document](https://www.paddlepaddle.org.cn/install/quick).
16+
* PaddlePaddle in version 1.8 or higher is required. For instructions on how to install it see the [installation document](https://www.paddlepaddle.org.cn/install/quick).
1517

1618
* MKL-DNN and MKL are required. The highest performance gain can be observed using CPU servers supporting AVX512 instructions.
1719
* INT8 accuracy is best on CPU servers supporting AVX512 VNNI extension (e.g. CLX class Intel processors). A linux server supports AVX512 VNNI instructions if the output of the command `lscpu` contains the `avx512_vnni` entry in the `Flags` section. AVX512 VNNI support on Windows can be checked using the [`coreinfo`]( https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo) tool.
@@ -30,11 +32,18 @@ A QAT model can be transformed into an INT8 quantized model if it contains enoug
3032

3133
### Gathering scales
3234

33-
The information about the quantization scales is being collected from three types of operators:
35+
The information about the quantization scales is collected from two sources:
36+
37+
1. the `out_threshold` attribute of quantizable operators - it contains a single value quantization scale for the operator's output,
38+
2. fake quantize/dequantize operators - they imitate quantization from FP32 into INT8, or dequantization in reverse direction, but keep the quantized tensor values as floats.
39+
40+
There are three types of fake quantize/dequantize operators:
3441

35-
* `fake_quantize_moving_average_abs_max` - imitates INT8 quantization of FP32 tensors, but keeps quantized output values as floats; is used before quantized operator (e.g. `conv2d`) to gather scale information for the op's input.
36-
* `fake_dequantize_max_abs` - imitates dequantization of INT8 tensors back into floats; it is used after quantized operator, and contains scale used for the op's weights dequantization.
37-
* `fake_quantize_dequantize_moving_average_abs_max` - imitates immediate quantization and dequantization; it can be used after a quantized operator to get the scale value for the op's output.
42+
* `fake_quantize_moving_average_abs_max` and `fake_quantize_range_abs_max` - used before quantized operator (e.g. `conv2d`), gather single value scale information for the op's input,
43+
* `fake_dequantize_max_abs` and `fake_channel_wise_dequantize_max_abs` - used after quantized operators, contain scales used for the operators' weights dequantization; the first one collects a single value scale for the weights tensor, whereas the second one collects a vector of scales for each output channel of the weights,
44+
* `fake_quantize_dequantize_moving_average_abs_max` - used after a quantized operator to get the scale value for the op's output; imitates immediate quantization and dequantization.
45+
46+
Scale values gathered from the fake quantize/dequantize operators have precedence over the scales collected from the `out_threshold` attributes.
3847

3948
Notes:
4049

@@ -43,7 +52,7 @@ Notes:
4352
and we want to quantize the `conv2d` op, then after applying FP32 optimizations the sequence will become
4453
```... → input1 → conv2d → output3 → ...```
4554
and the quantization scales have to be collected for the `input1` and `outpu3` tensors in the QAT model.
46-
2. Quantization of the following operators is supported: `conv2d`, `depthwise_conv2d`, `mul`, `fc`, `pool2d`, `reshape2`, `transpose2`, `concat`.
55+
2. Quantization of the following operators is supported: `conv2d`, `depthwise_conv2d`, `mul`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`, `concat`.
4756
3. The longest sequence of consecutive quantizable operators in the model, the biggest performance boost can be achieved through quantization:
4857
```... → conv2d → conv2d → pool2d → conv2d → conv2d → ...```
4958
Quantizing single operator separated from other quantizable operators can give no performance benefits or even slow down the inference:
@@ -55,7 +64,7 @@ All the `fake_quantize_*` and `fake_dequantize_*` operators are being removed fr
5564

5665
### Dequantizing weights
5766

58-
Weights of `conv2d`, `depthwise_conv2d` and `mul` operators are assumed to be fake-quantized (with integer values in the `int8` range, but kept as `float`s) in QAT models. Here, the information about the scale from `fake_dequantize_max_abs` operators is used to fake-dequantize the weights back to the full float range of values. At this moment the model becomes an unoptimized clean FP32 inference model.
67+
Weights of `conv2d`, `depthwise_conv2d` and `mul` operators are assumed to be fake-quantized (with integer values in the `int8` range, but kept as `float`s) in QAT models. Here, the information about the scale from `fake_dequantize_max_abs` and `fake_channel_wise_dequantize_max_abs` operators is used to fake-dequantize the weights back to the full float range of values. At this moment the model becomes an unoptimized clean FP32 inference model.
5968

6069
### Optimizing FP32 graph
6170

@@ -71,7 +80,7 @@ The basic datatype used during INT8 inference is signed INT8, with possible valu
7180

7281
### Propagation of scales
7382

74-
Some of the operators (e.g. `reshape2`, `transpose2`, `pool2d` with max pooling) transform the data without changing the quantization scale. For this reason we propagate the quantization scale values through these operators without any modifications. We propagate the quantization scales also through the `scale` operator, updating the quantization scale accordingly. This approach lets us minimize the number of `fake_quantize` and `fake_dequantize` operators in the graph, because the information about the scales required for the quantization process to succeed spreads between quantized operators.
83+
Some of the operators (e.g. `reshape2`, `transpose2`, `pool2d` with max pooling) transform the data without changing the quantization scale. For this reason we propagate the quantization scale values through these operators without any modifications. We propagate the quantization scales also through the `scale` operator, updating the quantization scale accordingly. This approach lets us minimize the number of fake quantize/dequantize operators in the graph, because the information about the scales required for the quantization process to succeed spreads between quantized operators.
7584

7685
### Applying quantization passes
7786

@@ -153,25 +162,25 @@ Image classification models performance was measured using a single thread. The
153162

154163
>**Intel(R) Xeon(R) Gold 6271**
155164
156-
| Model | FP32 (images/s) | INT8 QAT (images/s) | Ratio (INT8/FP32) |
157-
| :----------: | :-------------: | :-----------------: | :---------------: |
158-
| MobileNet-V1 | 74.36 | 210.68 | 2.83 |
159-
| MobileNet-V2 | 89.59 | 186.55 | 2.08 |
160-
| ResNet101 | 7.21 | 26.41 | 3.67 |
161-
| ResNet50 | 13.23 | 48.89 | 3.70 |
162-
| VGG16 | 3.49 | 10.11 | 2.90 |
163-
| VGG19 | 2.84 | 8.69 | 3.06 |
165+
| Model | FP32 (images/s) | INT8 QAT (images/s) | Ratio (INT8/FP32) |
166+
| :----------: | :-------------: | :-----------------: | :---------------: |
167+
| MobileNet-V1 | 77.00 | 210.76 | 2.74 |
168+
| MobileNet-V2 | 88.43 | 182.47 | 2.06 |
169+
| ResNet101 | 7.20 | 25.88 | 3.60 |
170+
| ResNet50 | 13.26 | 47.44 | 3.58 |
171+
| VGG16 | 3.48 | 10.11 | 2.90 |
172+
| VGG19 | 2.83 | 8.77 | 3.10 |
164173

165174
>**Intel(R) Xeon(R) Gold 6148**
166175
167176
| Model | FP32 (images/s) | INT8 QAT (images/s) | Ratio (INT8/FP32) |
168177
| :----------: | :-------------: | :-----------------: | :---------------: |
169-
| MobileNet-V1 | 75.23 | 111.15 | 1.48 |
170-
| MobileNet-V2 | 86.65 | 127.21 | 1.47 |
171-
| ResNet101 | 6.61 | 10.60 | 1.60 |
172-
| ResNet50 | 12.42 | 19.74 | 1.59 |
173-
| VGG16 | 3.31 | 4.74 | 1.43 |
174-
| VGG19 | 2.68 | 3.91 | 1.46 |
178+
| MobileNet-V1 | 75.23 | 103.63 | 1.38 |
179+
| MobileNet-V2 | 86.65 | 128.14 | 1.48 |
180+
| ResNet101 | 6.61 | 10.79 | 1.63 |
181+
| ResNet50 | 12.42 | 19.65 | 1.58 |
182+
| VGG16 | 3.31 | 4.74 | 1.43 |
183+
| VGG19 | 2.68 | 3.91 | 1.46 |
175184

176185
Notes:
177186

@@ -200,16 +209,16 @@ Notes:
200209
201210
| Model | Threads | FP32 Latency (ms) | QAT INT8 Latency (ms) | Ratio (FP32/INT8) |
202211
|:------------:|:----------------------:|:-------------------:|:---------:|:---------:|
203-
| Ernie | 1 thread | 256.11 | 93.80 | 2.73 |
204-
| Ernie | 20 threads | 30.06 | 16.88 | 1.78 |
212+
| Ernie | 1 thread | 236.72 | 83.70 | 2.82x |
213+
| Ernie | 20 threads | 27.40 | 15.01 | 1.83x |
205214

206215

207216
>**Intel(R) Xeon(R) Gold 6148**
208217
209218
| Model | Threads | FP32 Latency (ms) | QAT INT8 Latency (ms) | Ratio (FP32/INT8) |
210219
| :---: | :--------: | :---------------: | :-------------------: | :---------------: |
211-
| Ernie | 1 thread | 254.20 | 169.54 | 1.50 |
212-
| Ernie | 20 threads | 30.99 | 21.81 | 1.42 |
220+
| Ernie | 1 thread | 248.42 | 169.30 | 1.46 |
221+
| Ernie | 20 threads | 28.92 | 20.83 | 1.39 |
213222

214223
## 6. How to reproduce the results
215224

@@ -261,7 +270,10 @@ You can use the `qat2_int8_image_classification_comparison.py` script to reprodu
261270

262271
* `--qat_model` - a path to a QAT model that will be transformed into INT8 model.
263272
* `--fp32_model` - a path to an FP32 model whose accuracy will be measured and compared to the accuracy of the INT8 model.
264-
* `--quantized_ops` - a comma-separated list of names of operators to be quantized. The list depends on which operators have quantization scales provided in the model. Also, it may be more optimal in terms of performance to choose only certain types of operators for quantization. For Image Classification models mentioned above the list comprises of `conv2d` and `pool2d` operators.
273+
* `--quantized_ops` - a comma-separated list of names of operators to be quantized. When deciding which operators to put on the list, the following have to be considered:
274+
* Only operators which support quantization will be taken into account.
275+
* All the quantizable operators from the list, which are present in the model, must have quantization scales provided in the model. Otherwise, the quantization procedure will fail with a message saying which variable is missing a quantization scale.
276+
* Sometimes it may be suboptimal to quantize all quantizable operators in the model (cf. *Notes* in the **Gathering scales** section above). To find the optimal configuration for this option, user can run benchmark a few times with different lists of quantized operators present in the model and compare the results. For Image Classification models mentioned above the list comprises of `conv2d` and `pool2d` operators.
265277
* `--infer_data` - a path to the validation dataset.
266278

267279
```bash

0 commit comments

Comments
 (0)