|
| 1 | +# Quantized models in different frameworks |
| 2 | +As we keep floating-point `scale` and integer `zero-point` for quantized models, and meanwhile some has `Quantize/Dequantize` operators. This section briefs the data types they use. |
| 3 | +| | TFLite | ONNX | Caffe2 | |
| 4 | +| ----- | ------ | ----- | ------ | |
| 5 | +| scale | double | float | float | |
| 6 | +| fp | [template](https://github.com/tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/lite/kernels/internal/reference/dequantize.h#L41) | [float](https://github.com/onnx/onnx/blob/master/docs/Operators.md#outputs-29) | float | |
| 7 | +| round half | away zero | toward even | toward even |
| 8 | +| std:: | [round](https://github.com/tensorflow/tensorflow/blob/b58b895a5f64663b88177b1935d39c09fb6278ae/tensorflow/lite/kernels/internal/cppmath.h#L36) | rint | [nearbyint](https://github.com/pytorch/pytorch/blob/c371542efc31b1abfe6f388042aa3ab0cef935f2/caffe2/operators/quantized/int8_utils.h#L51) |
| 9 | + |
| 10 | +fp generally denotes the data_type of |
| 11 | + * Input tensor for `Quantize` and |
| 12 | + * Output tensor for `Dequantize`, |
| 13 | + |
| 14 | +Some operators (in some frameworks) would invoke intermediate floating-point representation, it's usually `float`; not seen double used so far. |
| 15 | + |
| 16 | +ONNXruntime generally handles output_scale by [MlasRequantizeOutput](https://github.com/microsoft/onnxruntime/blob/8d737f977056444a307f1b7f0bcd402fba62d790/onnxruntime/core/mlas/lib/quantize.cpp#L357)(int Input, int Output, float scale); which uses intermediate floating-point representation -- `float`. |
| 17 | + |
| 18 | +## Quantized Convolutions |
| 19 | +`output_multiplier` = `input_scale` * `weight_scale` / `output_scale` |
| 20 | +Reminded that TFLite uses <double>, while ONNXruntime and Caffe2 use <float> for scales. |
| 21 | +### TFLite |
| 22 | +The quantized multiplier is calculated as (the `shift` is a power-of-two normalizer to normalize output_multiplier in [0.5,1) ) |
| 23 | +```cpp= |
| 24 | +output_multiplier = <double>input_scale * <double>weight_scale / <double> output_scale; |
| 25 | +quantized_multiplier = std::round(std::frexp(output_multiplier, &shift) * (1<<31)); |
| 26 | +``` |
| 27 | + |
| 28 | +For convolutions, TFLite transfrom to DepthwiseConv if `group` = `in_ch` = `out_ch`. |
| 29 | +Then, different roundings are derived in SNPS-Caffe to match TFLite: |
| 30 | + |
| 31 | +| Scales \ group | 1 | Depthwise | Pointwise* | |
| 32 | +| --------- | ------ | --------- | --------- | |
| 33 | +| PerTensor | A1 | A2 | A1* | |
| 34 | +| PerChannel| B1 | B2 | B2* |
| 35 | + |
| 36 | +Two kinds of rounding are used to multiply quantized numbers. |
| 37 | +* [SaturatingRoundingDoublingHighMul](https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L340) |
| 38 | +* [RoundingDivideByPOT](https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h#L368) |
| 39 | + |
| 40 | +#### **A1** (Double Precision + Double Roundings) |
| 41 | +```cpp= |
| 42 | +scaled_acc = SaturatingRoundingDoublingHighMul(<int>acc,<int>quantized_multiplier) |
| 43 | +out_acc = RoundingDivideByPOT(scaled_acc, shift) |
| 44 | +// The approximate result := out_acc = scaled_acc / (1<<31) / (1<<shift), |
| 45 | +// where roundings are used |
| 46 | +``` |
| 47 | + |
| 48 | +#### **A2** (Single Precision + Double Roundings) |
| 49 | +Use **`<float>`** to calculate output_multiplier, then apply **A1**. |
| 50 | + |
| 51 | +#### **B1** (Double Precision + Single Rounding) |
| 52 | +Calculate the `output_multiplier` as per channel |
| 53 | +```cpp= |
| 54 | +output_multiplier[ch] = <double>input_scale * <double>weight_scale[ch] / <double> output_scale; |
| 55 | +``` |
| 56 | +But it uses simpler rounding to calculate the approximate result |
| 57 | +```cpp= |
| 58 | +scaled_acc = <int>acc * <int>quantized_multiplier |
| 59 | +out_acc = (scaled_acc + (1<<(31+shift-1)) >> (31+shift-1) |
| 60 | +// which is, it rounds (only once) half toward positive inf |
| 61 | +``` |
| 62 | + |
| 63 | +#### **B2** (Double Precision + Double Roundings) |
| 64 | +The per-channel `output_multiplier` is calculated as **B1**. |
| 65 | +But it applies the `Roundings` in **A1**. |
| 66 | + |
| 67 | +#### **Pointwise Convolution*** |
| 68 | +When I try to match bit-exactness result, the combination of `PerTensor-A1` and `PerChannel-B2` is found by brute-force. |
| 69 | + |
| 70 | +### ONNX runtime |
| 71 | +It casts `<int>acc` to `<float>`, multiply by <float>output_multiplier, and requantize the result. |
| 72 | + |
| 73 | +### Caffe2 |
| 74 | +It uses single-precision scales, the computation is the same as mentioned **A2**. |
| 75 | + |
0 commit comments