Skip to content

Commit 809869e

Browse files
committed
editing QUANTIZED_OP.md
1 parent 90349ee commit 809869e

File tree

1 file changed

+49
-29
lines changed

1 file changed

+49
-29
lines changed

QUANTIZED_OP.md

Lines changed: 49 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,41 @@
1-
# Quantized models in different frameworks
2-
As we keep floating-point `scale` and integer `zero-point` for quantized models, and meanwhile some has `Quantize/Dequantize` operators. This section briefs the data types they use.
3-
| | TFLite | ONNX | Caffe2 |
4-
| ----- | ------ | ----- | ------ |
5-
| scale | double | float | float |
6-
| fp | [template](https://github.com/tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/lite/kernels/internal/reference/dequantize.h#L41) | [float](https://github.com/onnx/onnx/blob/master/docs/Operators.md#outputs-29) | float |
1+
# Quantized models in different frameworks (Editing)
2+
As we keep floating-point `scale` and integer `zero-point` for quantized models, and meanwhile some has `Quantize/Dequantize` operators that having floating-point input or output. This section briefs the data types they use, and in SNPS-Caffe implementation we use the target floating-precision.
3+
| | TFLite | ONNX | Caffe2 | SNPS Caffe |
4+
| ----- | ------ | ----- | ------ | ---------- |
5+
| scale | double | float | float |
6+
| fp | [template](https://github.com/tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/lite/kernels/internal/reference/dequantize.h#L41) | [float](https://github.com/onnx/onnx/blob/master/docs/Operators.md#outputs-29) | float | float |
77
| round half | away zero | toward even | toward even
88
| std:: | [round](https://github.com/tensorflow/tensorflow/blob/b58b895a5f64663b88177b1935d39c09fb6278ae/tensorflow/lite/kernels/internal/cppmath.h#L36) | rint | [nearbyint](https://github.com/pytorch/pytorch/blob/c371542efc31b1abfe6f388042aa3ab0cef935f2/caffe2/operators/quantized/int8_utils.h#L51)
99

1010
fp generally denotes the data_type of
1111
* Input tensor for `Quantize` and
1212
* Output tensor for `Dequantize`,
13+
* Intermediate tensor type if some specific operators use floating-point registers for computation or handling output_scale.
14+
* e.g. ONNXruntime generally handles the `input_scale-to-output_scale` transformation by [MlasRequantizeOutput](https://github.com/microsoft/onnxruntime/blob/8d737f977056444a307f1b7f0bcd402fba62d790/onnxruntime/core/mlas/lib/quantize.cpp#L357)(int Input, int Output, float scale); which uses intermediate floating-point representation -- `float`.
1315

14-
Some operators (in some frameworks) would invoke intermediate floating-point representation, it's usually `float`; not seen double used so far.
16+
## Quick Look-Up for Implementations in SNPS Caffe
17+
We support the implementations from different frameworks, which leverages the layer parameter `quantize_method` when their results fail bit-exactness. You can also refer to [FEATURES.md](https://github.com/foss-for-synopsys-dwc-arc-processors/synopsys-caffe/blob/development/FEATURES.md#custom-quantization-related) for other quantization-related parameters.
18+
We denote TFLite/ONNXruntime/Caffe2 implementations by **t**/**o**/**c**. Since some quantized operators may have bit-exactness results between the frameworks, we don't elaborate the specific implementation.
19+
20+
| `operator` \ `quantize_method` | TFLite | ONNX | Caffe2|
21+
| ----------- | ------ | ----- | ----- |
22+
| AveragePooling | **t** | **o** | **c** |
23+
| Bias | | | **c** |
24+
| Convolution | **t** | **o** | **c** |
25+
| EltwiseSum | **t** | **c** | **c** |
26+
| InnerProduct| **t** | **o** | |
27+
| Power* | **t** | **o** | **c** |
28+
| Concat* | | | |
29+
| ResizeBilinear*| | | |
30+
31+
32+
#### Notes
33+
1. Our model zoo doesn't cover all quantized operators over the frameworks. The entry is left empty if the `(framework,operator)` combination is not seen yet.
34+
* Quantized bias_layer only occurs in ONNX (does not support FC+Bias fusion yet).
35+
2. Only Quantize and Dequantize operators are mapped to Power_layer.
36+
3. For ResizeBilinear/Concat layers, we use Dequantize+Quantize to implment the affine transformation.
37+
1538

16-
ONNXruntime generally handles output_scale by [MlasRequantizeOutput](https://github.com/microsoft/onnxruntime/blob/8d737f977056444a307f1b7f0bcd402fba62d790/onnxruntime/core/mlas/lib/quantize.cpp#L357)(int Input, int Output, float scale); which uses intermediate floating-point representation -- `float`.
1739

1840
## Quantized Convolutions
1941
`output_multiplier` = `input_scale` * `weight_scale` / `output_scale`.
@@ -23,53 +45,51 @@ The quantized multiplier is calculated as (the `shift` is a power-of-two normali
2345
```cpp=
2446
output_multiplier = <double>input_scale * <double>weight_scale / <double> output_scale;
2547
quantized_multiplier = std::round(std::frexp(output_multiplier, &shift) * (1<<31));
48+
// or for channel-wise quantization
49+
// output_multiplier[ch] = <double>input_scale * <double>weight_scale[ch] / <double> output_scale;
50+
// quantized_multiplier[ch] = std::round(std::frexp(output_multiplier[ch], &shift[ch]) * (1<<31));
2651
```
2752

2853
For convolutions, TFLite transfrom to DepthwiseConv if `group` = `in_ch` = `out_ch`.
29-
Then, different roundings are derived in SNPS-Caffe to match TFLite:
54+
Then, different implementations are derived in SNPS-Caffe to match TFLite:
3055

3156
| Scales \ group | 1 | Depthwise | Pointwise* |
3257
| --------- | ------ | --------- | --------- |
33-
| PerTensor | A1 | A2 | A1* |
34-
| PerChannel| B1 | B2 | B2*
58+
| PerTensor | D2 | F2 | F2* |
59+
| PerChannel| D1 | D2 | D1*
3560

36-
Two kinds of rounding are used to multiply quantized numbers.
37-
* [SaturatingRoundingDoublingHighMul](https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L340)
38-
* [RoundingDivideByPOT](https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L368)
61+
Two kinds of rounding are used to approximate the affine transformation (from `input_scale` to `output_scale`, using the quantized multiplier).
62+
1. The first splits it into two steps, denoted by **2-steps-rounding**
63+
* [SaturatingRoundingDoublingHighMul](https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L340), and
64+
* [RoundingDivideByPOT](https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L368)
65+
2. The second implments `rounding half toward positive infinity`, denoted by **1-step-rounding**
3966

40-
#### **A1** (Double Precision + Double Roundings)
67+
#### **D2** (Double Precision + 2-Steps-Rounding)
4168
```cpp=
4269
scaled_acc = SaturatingRoundingDoublingHighMul(<int>acc,<int>quantized_multiplier)
4370
out_acc = RoundingDivideByPOT(scaled_acc, shift)
4471
// The approximate result := out_acc = scaled_acc / (1<<31) / (1<<shift),
4572
// where roundings are used
4673
```
4774

48-
#### **A2** (Single Precision + Double Roundings)
49-
Use **`<float>`** to calculate output_multiplier, then apply **A1**.
75+
#### **F2** (Single Precision + 2-Steps-Rounding)
76+
Use **`<float>`** to calculate output_multiplier, then apply 2-steps-rounding in **D2**.
5077

51-
#### **B1** (Double Precision + Single Rounding)
78+
#### **D1** (Double Precision + 1-Step-Rounding)
5279
Calculate the `output_multiplier` as per channel
53-
```cpp=
54-
output_multiplier[ch] = <double>input_scale * <double>weight_scale[ch] / <double> output_scale;
55-
```
56-
But it uses simpler rounding to calculate the approximate result
80+
81+
Also it uses simpler rounding to calculate the approximate result
5782
```cpp=
5883
scaled_acc = <int>acc * <int>quantized_multiplier
5984
out_acc = (scaled_acc + (1<<(31+shift-1)) >> (31+shift-1)
6085
// which is, it rounds (only once) half toward positive inf
6186
```
6287

63-
#### **B2** (Double Precision + Double Roundings)
64-
The per-channel `output_multiplier` is calculated as **B1**.
65-
But it applies the `Roundings` in **A1**.
66-
6788
#### **Pointwise Convolution***
6889
When I try to match bit-exactness result, the combination of `PerTensor-A1` and `PerChannel-B2` is found by brute-force.
6990

7091
### ONNX runtime
71-
It casts `<int>acc` to `<float>`, multiply by `<float>output_multiplier`, and requantize the result.
92+
It casts `<int>acc` to `<float>`, multiply by `<float>output_multiplier`, then requantize the result.
7293

7394
### Caffe2
74-
It uses single-precision scales, the computation is the same as mentioned **A2**.
75-
95+
It uses single-precision scales, the computation is the same as mentioned **F2**.

0 commit comments

Comments
 (0)