Skip to content

Commit 6957a77

Browse files
Add OpenVINO Optimization Support Matrix table (#1219)
* Add table * Update table * Try button to copy * Try advanced copying * Update * Update * Back to working copy * Fix * Revert * Update * Add first row * Fill the rest of rows * Replace model ids with real ones. Add mixed quantization description * Add green check marks * Update dash symbols * Remove blue color. Fix table description. * Tweak * Add vertical alignment * Update Python commands
1 parent beff91c commit 6957a77

File tree

1 file changed

+254
-7
lines changed

1 file changed

+254
-7
lines changed

docs/source/openvino/optimization.mdx

Lines changed: 254 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,225 @@ limitations under the License.
2121

2222
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.
2323

24-
## Weight-only quantization
24+
## Optimization Support Matrix
25+
26+
Click on a &#x2705 to copy the command/code for the corresponding optimization case.
27+
28+
<html lang="en">
29+
<head>
30+
<meta charset="UTF-8">
31+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
32+
<style>
33+
table {
34+
border-collapse: collapse;
35+
width: 100%;
36+
font-family: Arial, sans-serif;
37+
}
38+
th, td {
39+
border: 1px solid #cccccc;
40+
text-align: center;
41+
vertical-align: middle;
42+
padding: 8px;
43+
}
44+
th {
45+
background-color: #FFD56B;
46+
}
47+
</style>
48+
</head>
49+
<body>
50+
51+
<table>
52+
<thead>
53+
<tr>
54+
<th rowspan="3">Task<br>(OV Model Class)</th>
55+
<th colspan="4">Weight-only Quantization</th>
56+
<th colspan="2" rowspan="2">Hybrid Quantization</th>
57+
<th colspan="2" rowspan="2">Full Quantization</th>
58+
<th colspan="2" rowspan="2">Mixed Quantization</th>
59+
</tr>
60+
<tr>
61+
<th colspan="2">Data-free</th>
62+
<th colspan="2">Data-aware</th>
63+
</tr>
64+
<tr>
65+
<th>CLI</td>
66+
<th>Python</td>
67+
<th>CLI</td>
68+
<th>Python</td>
69+
<th>CLI</td>
70+
<th>Python</td>
71+
<th>CLI</td>
72+
<th>Python</td>
73+
<th>CLI</td>
74+
<th>Python</td>
75+
</tr>
76+
</thead>
77+
<tbody>
78+
<tr>
79+
<td>text-generation<br>(OVModelForCausalLM)</td>
80+
<td>
81+
<button
82+
onclick="navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 ./save_dir')">
83+
&#x2705
84+
</button>
85+
</td>
86+
<td>
87+
<button
88+
onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')')">
89+
&#x2705
90+
</button>
91+
</td>
92+
<td>
93+
<button
94+
onclick="navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --dataset wikitext2 ./save_dir')">
95+
&#x2705
96+
</button>
97+
</td>
98+
<td>
99+
<button
100+
onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
101+
&#x2705
102+
</button>
103+
</td>
104+
<td>&#8211</td>
105+
<td>
106+
<button
107+
onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=4, quant_method=\'hybrid\', dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
108+
&#x2705
109+
</button>
110+
</td>
111+
<td>
112+
<button
113+
onclick="navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode int8 --dataset wikitext2 ./save_dir')">
114+
&#x2705
115+
</button>
116+
</td>
117+
<td>
118+
<button
119+
onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
120+
&#x2705
121+
</button>
122+
</td>
123+
<td>
124+
<button
125+
onclick="navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode nf4_f8e4m3 --dataset wikitext2 ./save_dir')">
126+
&#x2705
127+
</button>
128+
</td>
129+
<td>
130+
<button
131+
onclick="navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'nf4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
132+
&#x2705
133+
</button>
134+
</td>
135+
</tr>
136+
<tr>
137+
<td>image-text-to-text<br>(OVModelForVisualCausalLM)</td>
138+
<td>
139+
<button
140+
onclick="navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --weight-format int4 ./save_dir')">
141+
&#x2705
142+
</button>
143+
</td>
144+
<td>
145+
<button
146+
onclick="navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVWeightQuantizationConfig(bits=4)).save_pretrained(\'save_dir\')')">
147+
&#x2705
148+
</button>
149+
</td>
150+
<td>
151+
<button
152+
onclick="navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --weight-format int4 --dataset contextual ./save_dir')">
153+
&#x2705
154+
</button>
155+
</td>
156+
<td>
157+
<button
158+
onclick="navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'contextual\', trust_remote_code=True)).save_pretrained(\'save_dir\')')">
159+
&#x2705
160+
</button>
161+
</td>
162+
<td>&#8211</td>
163+
<td>&#8211</td>
164+
<td>&#8211</td>
165+
<td>&#8211</td>
166+
<td>&#8211</td>
167+
<td>&#8211</td>
168+
</tr>
169+
<tr>
170+
<td>text-to-image<br>(OVStableDiffusionPipeline)</td>
171+
<td>
172+
<button
173+
onclick="navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --weight-format int8 ./save_dir')">
174+
&#x2705
175+
</button>
176+
</td>
177+
<td>
178+
<button
179+
onclick="navigator.clipboard.writeText('OVStableDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')')">
180+
&#x2705
181+
</button>
182+
</td>
183+
<td>&#8211</td>
184+
<td>&#8211</td>
185+
<td>
186+
<button
187+
onclick="navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --weight-format int8 --dataset conceptual_captions ./save_dir')">
188+
&#x2705
189+
</button>
190+
</td>
191+
<td>
192+
<button
193+
onclick="navigator.clipboard.writeText('OVStableDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVWeightQuantizationConfig(bits=8, quant_method=\'hybrid\', dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')')">
194+
&#x2705
195+
</button>
196+
</td>
197+
<td>
198+
<button
199+
onclick="navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --quant-mode int8 --dataset conceptual_captions ./save_dir')">
200+
&#x2705
201+
</button>
202+
</td>
203+
<td>
204+
<button
205+
onclick="navigator.clipboard.writeText('OVStableDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')')">
206+
&#x2705
207+
</button>
208+
</td>
209+
<td>&#8211</td>
210+
<td>&#8211</td>
211+
</tr>
212+
<tr>
213+
<td>automatic-speech-recognition<br>(OVModelForSpeechSeq2Seq)</td>
214+
<td>&#8211</td>
215+
<td>&#8211</td>
216+
<td>&#8211</td>
217+
<td>&#8211</td>
218+
<td>&#8211</td>
219+
<td>&#8211</td>
220+
<td>
221+
<button
222+
onclick="navigator.clipboard.writeText('optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 10 ./save_dir')">
223+
&#x2705
224+
</button>
225+
</td>
226+
<td>
227+
<button
228+
onclick="navigator.clipboard.writeText('OVModelForSpeechSeq2Seq.from_pretrained(\'openai/whisper-large-v3-turbo\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'librispeech\', num_samples=10)).save_pretrained(\'save_dir\')')">
229+
&#x2705
230+
</button>
231+
</td>
232+
<td>&#8211</td>
233+
<td>&#8211</td>
234+
</tr>
235+
</tbody>
236+
</table>
237+
238+
</body>
239+
</html>
240+
241+
242+
## Weight-only Quantization
25243

26244
Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
27245

@@ -118,12 +336,12 @@ quantization_config = OVWeightQuantizationConfig(
118336

119337
Note: GPTQ and LoRA Correction algorithms can't be applied simultaneously.
120338

121-
## Static quantization
339+
## Full quantization
122340

123-
When applying post-training static quantization, both the weights and the activations are quantized.
341+
When applying post-training full quantization, both the weights and the activations are quantized.
124342
To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
125343

126-
Here is how to apply static quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
344+
Here is how to apply full quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
127345

128346
```python
129347
from transformers import AutoTokenizer
@@ -137,7 +355,7 @@ save_dir = "ptq_model"
137355

138356
quantizer = OVQuantizer.from_pretrained(model)
139357

140-
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
358+
# Apply full quantization and export the resulting quantized model to OpenVINO IR format
141359
ov_config = OVConfig(quantization_config=OVQuantizationConfig())
142360
quantizer.quantize(ov_config=ov_config, calibration_dataset=calibration_dataset, save_directory=save_dir)
143361
# Save the tokenizer
@@ -152,7 +370,7 @@ from functools import partial
152370
def preprocess_function(examples, tokenizer):
153371
return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)
154372

155-
# Create the calibration dataset used to perform static quantization
373+
# Create the calibration dataset used to perform full quantization
156374
calibration_dataset = quantizer.get_calibration_dataset(
157375
"glue",
158376
dataset_config_name="sst2",
@@ -163,7 +381,7 @@ calibration_dataset = quantizer.get_calibration_dataset(
163381
```
164382

165383

166-
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
384+
The `quantize()` method applies post-training quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
167385

168386

169387
### Speech-to-text Models Quantization
@@ -209,3 +427,32 @@ model = OVStableDiffusionPipeline.from_pretrained(
209427

210428

211429
For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md).
430+
431+
432+
## Mixed Quantization
433+
434+
Mixed quantization is a technique that combines weight-only quantization with full quantization. During mixed quantization we separately quantize:
435+
1. weights of weighted layers to one precision, and
436+
2. activations (and possibly, weights, if some were skipped at the first step) of other supported layers to another precision.
437+
438+
By default, weights of all weighted layers are quantized in the first step. In the second step activations of weighted and non-weighted layers are quantized. If some layers are instructed to be ignored in the first step with `weight_quantization_config.ignored_scope` parameter, both weights and activations of these layers are quantized to the precision given in the `full_quantization_config`.
439+
440+
When running this kind of optimization through Python API, `OVMixedQuantizationConfig` should be used. In such case the precision for the first step should be provided with `weight_quantization_config` argument and the precision for the second step with `full_quantization_config` argument. For example:
441+
442+
```python
443+
model = OVModelForCausalLM.from_pretrained(
444+
'TinyLlama/TinyLlama-1.1B-Chat-v1.0',
445+
quantization_config=OVMixedQuantizationConfig(
446+
weight_quantization_config=OVWeightQuantizationConfig(bits=4, dtype='nf4'),
447+
full_quantization_config=OVQuantizationConfig(dtype='f8e4m3', dataset='wikitext2')
448+
)
449+
)
450+
```
451+
452+
To apply mixed quantization through CLI, the `--quant-mode` argument should be used. For example:
453+
454+
```bash
455+
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode nf4_f8e4m3 --dataset wikitext2 ./save_dir
456+
```
457+
458+
Don't forget to provide a dataset since it is required for the calibration procedure during full quantization.

0 commit comments

Comments
 (0)