A uniform "fake" quantization method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. The method performs differentiable sampling of the continuous signal (for example, activations or weights) during forward pass, simulating inference with integer arithmetic.
Quantization is parametrized by clamping range and number of quantization levels. The sampling formula is the following:
During the training, we optimize the scale parameter that represents the range [input_low, input_range] of the original signal using gradient descent:
In the formula above,
-
For weights:
$level\_low=-2^{bits-1}+1$ $level\_high=2^{bits-1}-1$ $levels=255$ -
For unsigned activations:
$level\_low=0$ $level\_high=2^{bits}-1$ $levels=256$ -
For signed activations:
$level\_low=-2^{bits-1}$ $level\_high=2^{bits-1}-1$ $levels=256$
For all the cases listed above, the common quantization formula is simplified after substitution of
Use the num_init_samples parameter from the initializer group to initialize the values of scale and determine which activation should be signed or unsigned from the collected statistics using given number of samples.
During the training we optimize the input_low and input_range parameters using gradient descent:
For better accuracy, floating-point zero should be within quantization range and strictly mapped into quant (without rounding). Therefore, the following scheme is applied to ranges of weight and activation quantizers before applying actual quantization:
You can use the num_init_samples parameter from the initializer group to initialize the values of input_low and input_range from the collected statistics using given number of samples.
NNCF allows to quantize models for best results on a given Intel hardware type when executed using OpenVINO runtime. To achieve this, the quantizer setup should be performed with following considerations in mind:
- every operation that can accept quantized inputs on a given HW (i.e. can be executed using quantized input values) should have its inputs quantized in NNCF
- the quantized inputs should be quantized with a configuration that is supported on a given HW for a given operation (e.g. per-tensor vs per-channel quantization, or 8 bits vs. 4 bits)
- for operations that are agnostic to quantization, the execution should handle quantized tensors rather than full-precision tensors.
- certain operation sequences will be runtime-optimized to execute in a single kernel call ("fused"), and additional quantizer insertion/quantization simulation within such operation sequences will be detrimental to overall performance
These requirements are fulfilled by the quantizer propagation algorithm. The algorithm first searches the internal NNCF representation of the model's control flow graph for predefined patterns that are "fusible", and apply the fusing to the internal graph representation as well. Next, the operations in the graph that can be associated to input-quantizable operations on a given target hardware are assigned a single quantizer for each its quantizable activation input, with a number of possible quantizer configurations attached to it (that are feasible on target HW). The quantizers are then "propagated" against the data flow in the model's control flow graph as far as possible, potentially merging with other quantizers. Once all quantizers have reached a standstill in their propagation process, each will have a final (possibly reduced) set of possible quantizer configurations, from which a single one is either chosen manually, or using a precision initialization algorithm (which accepts the potential quantizer locations and associated potential quantizer configuration sets). The resulting configuration is then applied as a final quantizer setup configuration.
Note that this algorithm applies to activation quantization only - the weight quantizers do not require propagation. However, the possible configurations of weight quantizers themselves are also sourced from the HW config file definitions.
The HW to target for a given quantization algorithm run can be specified in NNCF config using the global "target_device" option.
The default corresponds to CPU-friendly quantization.
"TRIAL" corresponds to a configuration that uses the general quantizer propagation algorithm, but does not use any HW-specific information about quantizability of given operation types or possible quantizer configs for associated inputs or operation weights.
Instead it uses a default, basic 8-bit symmetric per-tensor quantization configuration for each quantizer, and quantizes inputs of a certain default operation set, which at the moment is defined internally in NNCF.
The quantization configuration in the "target_device": "TRIAL" case may be overridden using the regular "activations" and "weights" sections in the quantization compression algorithm sub-config, see below.
For all target HW types, parts of the model graph can be marked as non-quantizable by using the "ignored_scopes" field - inputs and weights of matching nodes in the NNCF internal graph representation will not be quantized, and the downstream quantizers will not propagate upwards through such nodes.
In our implementation, we use a slightly transformed formula. It is equivalent by order of floating-point operations to simplified symmetric formula and the asymmetric one. The small difference is addition of small positive number eps to prevent division by zero and taking absolute value of range, since it might become negative on backward:
For asymmetric:
For symmetric:
The most common case of applying quantization is 8-bit uniform quantization.
The forward quantization formula contains two non-differentiable operations: clamping and rounding. To enable gradient-based optimization of the quantization parameters during QAT, NNCF defines custom surrogate gradients using a Straight-Through Estimator (STE) for rounding and piecewise-defined surrogate gradients for the clamp boundaries.
This approach is a form of learned-range fake quantization — it is related to Learned Step Size Quantization (LSQ), but uses a different parameterization (input_low, input_range) instead of (step size, zero point), and omits LSQ's gradient scaling factor.
In this section,
The input tensor is partitioned element-wise into three regions based on the quantization range:
-
Below range:
$x < input\_low$ -
In range:
$input\_low \le x \le input\_low + input\_range$ -
Above range:
$x > input\_low + input\_range$
The upstream gradient is passed through unchanged when
The gradient with respect to
For in-range
where
where
To re-express
This gradient nudges
For
For
Per-element gradients are summed to match the shape of
In-range term. Under the STE, shifting
Below- and above-range terms. Outside the range, the clamped output is either
Note: In symmetric quantization mode,
NOTE
There is a known issue with AVX2 and AVX512 CPU devices. The issue appears with 8-bit matrix calculations with tensors which elements are close to the maximum or saturated. AVX2 and AVX512 utilize a 16-bit register to store the result of operations on tensors. In case when tensors are saturated the buffer overflow happens. This leads to accuracy degradation. For more details of the overflow issue please refer here.
To fix this issue inside NNCF, by default, all weight tensors are quantized in 8 bits but only 7 bits are effectively used.
This regime is used when target_device=TargetDevice.CPU or target_device=TargetDevice.ANY set. This fix, potentially, requires longer fine-tuning.
To control the application of overflow fix, nncf.AdvancedQuantizationParameters(overflow_fix=OverflowFix.ENABLE) config option is introduced.