Add OpenVINO Optimization Support Matrix table (#1219)

nikita-savelyevv · web-flow · commit 6957a77c0e17 · 2025-04-03T13:13:20.000+02:00
* Add table

* Update table

* Try button to copy

* Try advanced copying

* Update

* Update

* Back to working copy

* Fix

* Revert

* Update

* Add first row

* Fill the rest of rows

* Replace model ids with real ones. Add mixed quantization description

* Add green check marks

* Update dash symbols

* Remove blue color. Fix table description.

* Tweak

* Add vertical alignment

* Update Python commands
diff --git a/docs/source/openvino/optimization.mdx b/docs/source/openvino/optimization.mdx
@@ -21,7 +21,225 @@ limitations under the License.
 
 Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.
 
-## Weight-only quantization
+## Optimization Support Matrix
+
+Click on a &#x2705 to copy the command/code for the corresponding optimization case.
+
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <style>
+        table {
+            border-collapse: collapse;
+            width: 100%;
+            font-family: Arial, sans-serif;
+        }
+        th, td {
+            border: 1px solid #cccccc;
+            text-align: center;
+            vertical-align: middle;
+            padding: 8px;
+        }
+        th {
+            background-color: #FFD56B;
+        }
+    </style>
+</head>
+<body>
+
+<table>
+    <thead>
+        <tr>
+            <th rowspan="3">Task<br>(OV Model Class)</th>
+            <th colspan="4">Weight-only Quantization</th>
+            <th colspan="2" rowspan="2">Hybrid Quantization</th>
+            <th colspan="2" rowspan="2">Full Quantization</th>
+            <th colspan="2" rowspan="2">Mixed Quantization</th>
+        </tr>
+        <tr>
+            <th colspan="2">Data-free</th>
+            <th colspan="2">Data-aware</th>
+        </tr>
+        <tr>
+            <th>CLI</td>
+            <th>Python</td>
+            <th>CLI</td>
+            <th>Python</td>
+            <th>CLI</td>
+            <th>Python</td>
+            <th>CLI</td>
+            <th>Python</td>
+            <th>CLI</td>
+            <th>Python</td>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>text-generation<br>(OVModelForCausalLM)</td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --dataset wikitext2 ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>&#8211</td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=4, quant_method=\'hybrid\', dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode int8 --dataset wikitext2 ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode nf4_f8e4m3 --dataset wikitext2 ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'nf4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+        </tr>
+        <tr>
+            <td>image-text-to-text<br>(OVModelForVisualCausalLM)</td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --weight-format int4 ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVWeightQuantizationConfig(bits=4)).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --weight-format int4 --dataset contextual ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'contextual\', trust_remote_code=True)).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+        </tr>
+        <tr>
+            <td>text-to-image<br>(OVStableDiffusionPipeline)</td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --weight-format int8 ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVStableDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --weight-format int8 --dataset conceptual_captions ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVStableDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVWeightQuantizationConfig(bits=8, quant_method=\'hybrid\', dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --quant-mode int8 --dataset conceptual_captions ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVStableDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+        </tr>
+        <tr>
+            <td>automatic-speech-recognition<br>(OVModelForSpeechSeq2Seq)</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 10 ./save_dir')">
+                    &#x2705
+                </button>
+            </td>
+            <td>
+                <button
+                    onclick="navigator.clipboard.writeText('OVModelForSpeechSeq2Seq.from_pretrained(\'openai/whisper-large-v3-turbo\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'librispeech\', num_samples=10)).save_pretrained(\'save_dir\')')">
+                    &#x2705
+                </button>
+            </td>
+            <td>&#8211</td>
+            <td>&#8211</td>
+        </tr>
+    </tbody>
+</table>
+
+</body>
+</html>
+
+
+## Weight-only Quantization
 
 Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
 
@@ -118,12 +336,12 @@ quantization_config = OVWeightQuantizationConfig(
 
 Note: GPTQ and LoRA Correction algorithms can't be applied simultaneously.
 
-## Static quantization
+## Full quantization
 
-When applying post-training static quantization, both the weights and the activations are quantized.
+When applying post-training full quantization, both the weights and the activations are quantized.
 To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
 
-Here is how to apply static quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
+Here is how to apply full quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
 
 ```python
 from transformers import AutoTokenizer
@@ -137,7 +355,7 @@ save_dir = "ptq_model"
 
 quantizer = OVQuantizer.from_pretrained(model)
 
-# Apply static quantization and export the resulting quantized model to OpenVINO IR format
+# Apply full quantization and export the resulting quantized model to OpenVINO IR format
 ov_config = OVConfig(quantization_config=OVQuantizationConfig())
 quantizer.quantize(ov_config=ov_config, calibration_dataset=calibration_dataset, save_directory=save_dir)
 # Save the tokenizer
@@ -152,7 +370,7 @@ from functools import partial
 def preprocess_function(examples, tokenizer):
     return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)
 
-# Create the calibration dataset used to perform static quantization
+# Create the calibration dataset used to perform full quantization
 calibration_dataset = quantizer.get_calibration_dataset(
     "glue",
     dataset_config_name="sst2",
@@ -163,7 +381,7 @@ calibration_dataset = quantizer.get_calibration_dataset(
 ```
 
 
-The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
+The `quantize()` method applies post-training quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
 
 
 ### Speech-to-text Models Quantization
@@ -209,3 +427,32 @@ model = OVStableDiffusionPipeline.from_pretrained(
 
 
 For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md).
+
+
+## Mixed Quantization
+
+Mixed quantization is a technique that combines weight-only quantization with full quantization. During mixed quantization we separately quantize:
+1. weights of weighted layers to one precision, and
+2. activations (and possibly, weights, if some were skipped at the first step) of other supported layers to another precision.
+
+By default, weights of all weighted layers are quantized in the first step. In the second step activations of weighted and non-weighted layers are quantized. If some layers are instructed to be ignored in the first step with `weight_quantization_config.ignored_scope` parameter, both weights and activations of these layers are quantized to the precision given in the `full_quantization_config`.
+
+When running this kind of optimization through Python API, `OVMixedQuantizationConfig` should be used. In such case the precision for the first step should be provided with `weight_quantization_config` argument and the precision for the second step with `full_quantization_config` argument. For example:
+
+```python
+model = OVModelForCausalLM.from_pretrained(
+    'TinyLlama/TinyLlama-1.1B-Chat-v1.0',
+    quantization_config=OVMixedQuantizationConfig(
+        weight_quantization_config=OVWeightQuantizationConfig(bits=4, dtype='nf4'),
+        full_quantization_config=OVQuantizationConfig(dtype='f8e4m3', dataset='wikitext2')
+    )
+)
+```
+
+To apply mixed quantization through CLI, the `--quant-mode` argument should be used. For example:
+
+```bash
+optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode nf4_f8e4m3 --dataset wikitext2 ./save_dir
+```
+
+Don't forget to provide a dataset since it is required for the calibration procedure during full quantization.