Skip to content

Commit 7d8de8a

Browse files
committed
add breif description about quantized Convolution
1 parent 7e21bc6 commit 7d8de8a

File tree

1 file changed

+75
-0
lines changed

1 file changed

+75
-0
lines changed

QUANTIZED_OP.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Quantized models in different frameworks
2+
As we keep floating-point `scale` and integer `zero-point` for quantized models, and meanwhile some has `Quantize/Dequantize` operators. This section briefs the data types they use.
3+
| | TFLite | ONNX | Caffe2 |
4+
| ----- | ------ | ----- | ------ |
5+
| scale | double | float | float |
6+
| fp | [template](https://github.com/tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/lite/kernels/internal/reference/dequantize.h#L41) | [float](https://github.com/onnx/onnx/blob/master/docs/Operators.md#outputs-29) | float |
7+
| round half | away zero | toward even | toward even
8+
| std:: | [round](https://github.com/tensorflow/tensorflow/blob/b58b895a5f64663b88177b1935d39c09fb6278ae/tensorflow/lite/kernels/internal/cppmath.h#L36) | rint | [nearbyint](https://github.com/pytorch/pytorch/blob/c371542efc31b1abfe6f388042aa3ab0cef935f2/caffe2/operators/quantized/int8_utils.h#L51)
9+
10+
fp generally denotes the data_type of
11+
* Input tensor for `Quantize` and
12+
* Output tensor for `Dequantize`,
13+
14+
Some operators (in some frameworks) would invoke intermediate floating-point representation, it's usually `float`; not seen double used so far.
15+
16+
ONNXruntime generally handles output_scale by [MlasRequantizeOutput](https://github.com/microsoft/onnxruntime/blob/8d737f977056444a307f1b7f0bcd402fba62d790/onnxruntime/core/mlas/lib/quantize.cpp#L357)(int Input, int Output, float scale); which uses intermediate floating-point representation -- `float`.
17+
18+
## Quantized Convolutions
19+
`output_multiplier` = `input_scale` * `weight_scale` / `output_scale`
20+
Reminded that TFLite uses <double>, while ONNXruntime and Caffe2 use <float> for scales.
21+
### TFLite
22+
The quantized multiplier is calculated as (the `shift` is a power-of-two normalizer to normalize output_multiplier in [0.5,1) )
23+
```cpp=
24+
output_multiplier = <double>input_scale * <double>weight_scale / <double> output_scale;
25+
quantized_multiplier = std::round(std::frexp(output_multiplier, &shift) * (1<<31));
26+
```
27+
28+
For convolutions, TFLite transfrom to DepthwiseConv if `group` = `in_ch` = `out_ch`.
29+
Then, different roundings are derived in SNPS-Caffe to match TFLite:
30+
31+
| Scales \ group | 1 | Depthwise | Pointwise* |
32+
| --------- | ------ | --------- | --------- |
33+
| PerTensor | A1 | A2 | A1* |
34+
| PerChannel| B1 | B2 | B2*
35+
36+
Two kinds of rounding are used to multiply quantized numbers.
37+
* [SaturatingRoundingDoublingHighMul](https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L340)
38+
* [RoundingDivideByPOT](https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L368)
39+
40+
#### **A1** (Double Precision + Double Roundings)
41+
```cpp=
42+
scaled_acc = SaturatingRoundingDoublingHighMul(<int>acc,<int>quantized_multiplier)
43+
out_acc = RoundingDivideByPOT(scaled_acc, shift)
44+
// The approximate result := out_acc = scaled_acc / (1<<31) / (1<<shift),
45+
// where roundings are used
46+
```
47+
48+
#### **A2** (Single Precision + Double Roundings)
49+
Use **`<float>`** to calculate output_multiplier, then apply **A1**.
50+
51+
#### **B1** (Double Precision + Single Rounding)
52+
Calculate the `output_multiplier` as per channel
53+
```cpp=
54+
output_multiplier[ch] = <double>input_scale * <double>weight_scale[ch] / <double> output_scale;
55+
```
56+
But it uses simpler rounding to calculate the approximate result
57+
```cpp=
58+
scaled_acc = <int>acc * <int>quantized_multiplier
59+
out_acc = (scaled_acc + (1<<(31+shift-1)) >> (31+shift-1)
60+
// which is, it rounds (only once) half toward positive inf
61+
```
62+
63+
#### **B2** (Double Precision + Double Roundings)
64+
The per-channel `output_multiplier` is calculated as **B1**.
65+
But it applies the `Roundings` in **A1**.
66+
67+
#### **Pointwise Convolution***
68+
When I try to match bit-exactness result, the combination of `PerTensor-A1` and `PerChannel-B2` is found by brute-force.
69+
70+
### ONNX runtime
71+
It casts `<int>acc` to `<float>`, multiply by <float>output_multiplier, and requantize the result.
72+
73+
### Caffe2
74+
It uses single-precision scales, the computation is the same as mentioned **A2**.
75+

0 commit comments

Comments
 (0)