You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we keep floating-point `scale` and integer `zero-point` for quantized models, and meanwhile some has `Quantize/Dequantize` operators. This section briefs the data types they use.
# Quantized models in different frameworks (Editing)
2
+
As we keep floating-point `scale` and integer `zero-point` for quantized models, and meanwhile some has `Quantize/Dequantize` operators that having floating-point input or output. This section briefs the data types they use, and in SNPS-Caffe implementation we use the target floating-precision.
| round half | away zero | toward even | toward even
8
8
| std:: | [round](https://github.com/tensorflow/tensorflow/blob/b58b895a5f64663b88177b1935d39c09fb6278ae/tensorflow/lite/kernels/internal/cppmath.h#L36) | rint | [nearbyint](https://github.com/pytorch/pytorch/blob/c371542efc31b1abfe6f388042aa3ab0cef935f2/caffe2/operators/quantized/int8_utils.h#L51)
9
9
10
10
fp generally denotes the data_type of
11
11
* Input tensor for `Quantize` and
12
12
* Output tensor for `Dequantize`,
13
+
* Intermediate tensor type if some specific operators use floating-point registers for computation or handling output_scale.
14
+
* e.g. ONNXruntime generally handles the `input_scale-to-output_scale` transformation by [MlasRequantizeOutput](https://github.com/microsoft/onnxruntime/blob/8d737f977056444a307f1b7f0bcd402fba62d790/onnxruntime/core/mlas/lib/quantize.cpp#L357)(int Input, int Output, float scale); which uses intermediate floating-point representation -- `float`.
13
15
14
-
Some operators (in some frameworks) would invoke intermediate floating-point representation, it's usually `float`; not seen double used so far.
16
+
## Quick Look-Up for Implementations in SNPS Caffe
17
+
We support the implementations from different frameworks, which leverages the layer parameter `quantize_method` when their results fail bit-exactness. You can also refer to [FEATURES.md](https://github.com/foss-for-synopsys-dwc-arc-processors/synopsys-caffe/blob/development/FEATURES.md#custom-quantization-related) for other quantization-related parameters.
18
+
We denote TFLite/ONNXruntime/Caffe2 implementations by **t**/**o**/**c**. Since some quantized operators may have bit-exactness results between the frameworks, we don't elaborate the specific implementation.
1. Our model zoo doesn't cover all quantized operators over the frameworks. The entry is left empty if the `(framework,operator)` combination is not seen yet.
34
+
* Quantized bias_layer only occurs in ONNX (does not support FC+Bias fusion yet).
35
+
2. Only Quantize and Dequantize operators are mapped to Power_layer.
36
+
3. For ResizeBilinear/Concat layers, we use Dequantize+Quantize to implment the affine transformation.
37
+
15
38
16
-
ONNXruntime generally handles output_scale by [MlasRequantizeOutput](https://github.com/microsoft/onnxruntime/blob/8d737f977056444a307f1b7f0bcd402fba62d790/onnxruntime/core/mlas/lib/quantize.cpp#L357)(int Input, int Output, float scale); which uses intermediate floating-point representation -- `float`.
0 commit comments