You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
Note: GPTQ and LoRA Correction algorithms can't be applied simultaneously.
120
338
121
-
## Static quantization
339
+
## Full quantization
122
340
123
-
When applying post-training static quantization, both the weights and the activations are quantized.
341
+
When applying post-training full quantization, both the weights and the activations are quantized.
124
342
To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
125
343
126
-
Here is how to apply static quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
344
+
Here is how to apply full quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
127
345
128
346
```python
129
347
from transformers import AutoTokenizer
@@ -137,7 +355,7 @@ save_dir = "ptq_model"
137
355
138
356
quantizer = OVQuantizer.from_pretrained(model)
139
357
140
-
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
358
+
# Apply full quantization and export the resulting quantized model to OpenVINO IR format
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
384
+
The `quantize()` method applies post-training quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
167
385
168
386
169
387
### Speech-to-text Models Quantization
@@ -209,3 +427,32 @@ model = OVStableDiffusionPipeline.from_pretrained(
209
427
210
428
211
429
For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md).
430
+
431
+
432
+
## Mixed Quantization
433
+
434
+
Mixed quantization is a technique that combines weight-only quantization with full quantization. During mixed quantization we separately quantize:
435
+
1. weights of weighted layers to one precision, and
436
+
2. activations (and possibly, weights, if some were skipped at the first step) of other supported layers to another precision.
437
+
438
+
By default, weights of all weighted layers are quantized in the first step. In the second step activations of weighted and non-weighted layers are quantized. If some layers are instructed to be ignored in the first step with `weight_quantization_config.ignored_scope` parameter, both weights and activations of these layers are quantized to the precision given in the `full_quantization_config`.
439
+
440
+
When running this kind of optimization through Python API, `OVMixedQuantizationConfig` should be used. In such case the precision for the first step should be provided with `weight_quantization_config` argument and the precision for the second step with `full_quantization_config` argument. For example:
0 commit comments