Skip to content

Commit 0a9a10b

Browse files
[OV] Update optimization docs page with information about VLMs (#1007)
* Update optimization page * Update cli usage * Add more export options to export page * Update * Adopt for narrower text fields * Update docs/source/openvino/export.mdx * Update docs/source/openvino/export.mdx Co-authored-by: Alexander Kozlov <[email protected]> * Update docs/source/openvino/optimization.mdx Co-authored-by: Alexander Kozlov <[email protected]> * Update docs/source/openvino/export.mdx Co-authored-by: Alexander Kozlov <[email protected]> * Update docs/source/openvino/export.mdx Co-authored-by: Alexander Kozlov <[email protected]> * Update docs/source/openvino/export.mdx Co-authored-by: Alexander Kozlov <[email protected]> --------- Co-authored-by: Alexander Kozlov <[email protected]>
1 parent 6a3b1ba commit 0a9a10b

File tree

2 files changed

+125
-48
lines changed

2 files changed

+125
-48
lines changed

docs/source/openvino/export.mdx

Lines changed: 83 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -30,70 +30,108 @@ optimum-cli export openvino --model local_llama --task text-generation-with-past
3030
Check out the help for more options:
3131

3232
```bash
33-
optimum-cli export openvino --help
34-
35-
usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code] [--weight-format {fp32,fp16,int8,int4}]
36-
[--library {transformers,diffusers,timm,sentence_transformers}] [--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
37-
[--group-size GROUP_SIZE] [--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--sensitivity-metric SENSITIVITY_METRIC] [--num-samples NUM_SAMPLES]
38-
[--disable-stateful] [--disable-convert-tokenizer]
33+
usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code]
34+
[--weight-format {fp32,fp16,int8,int4,mxfp4,nf4}]
35+
[--library {transformers,diffusers,timm,sentence_transformers,open_clip}]
36+
[--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
37+
[--group-size GROUP_SIZE] [--dataset DATASET] [--all-layers] [--awq]
38+
[--scale-estimation] [--gptq] [--sensitivity-metric SENSITIVITY_METRIC]
39+
[--num-samples NUM_SAMPLES] [--disable-stateful] [--disable-convert-tokenizer]
3940
output
4041

4142
optional arguments:
4243
-h, --help show this help message and exit
4344

4445
Required arguments:
45-
--model MODEL Model ID on huggingface.co or path on disk to load model from.
46-
46+
-m MODEL, --model MODEL
47+
Model ID on huggingface.co or path on disk to load model from.
4748
output Path indicating the directory where to store the generated OV model.
4849

4950
Optional arguments:
50-
--task TASK The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['image-segmentation',
51-
'feature-extraction', 'mask-generation', 'audio-classification', 'conversational', 'stable-diffusion-xl', 'question-answering', 'sentence-similarity', 'text2text-generation',
52-
'masked-im', 'automatic-speech-recognition', 'fill-mask', 'image-to-text', 'text-generation', 'zero-shot-object-detection', 'multiple-choice', 'object-detection', 'stable-
53-
diffusion', 'audio-xvector', 'text-to-audio', 'zero-shot-image-classification', 'token-classification', 'image-classification', 'depth-estimation', 'image-to-image', 'audio-
54-
frame-classification', 'semantic-segmentation', 'text-classification']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder.
55-
--framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local checkpoints original framework or what is available in the environment.
56-
--trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it
57-
will execute on your local machine arbitrary code present in the model repository.
58-
--weight-format {fp32,fp16,int8,int4}
51+
--task TASK The task to export the model for. If not specified, the task will be auto-inferred based on
52+
the model. Available tasks depend on the model, but are among: ['fill-mask', 'masked-im',
53+
'audio-classification', 'automatic-speech-recognition', 'text-to-audio', 'image-text-to-text',
54+
'depth-estimation', 'image-to-image', 'text-generation', 'text-to-image', 'mask-generation',
55+
'audio-frame-classification', 'sentence-similarity', 'image-classification', 'multiple-
56+
choice', 'text-classification', 'text2text-generation', 'token-classification', 'feature-
57+
extraction', 'zero-shot-image-classification', 'zero-shot-object-detection', 'object-
58+
detection', 'inpainting', 'question-answering', 'semantic-segmentation', 'image-segmentation',
59+
'audio-xvector', 'image-to-text']. For decoder models, use `xxx-with-past` to export the model
60+
using past key values in the decoder.
61+
--framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local
62+
checkpoint's original framework or what is available in the environment.
63+
--trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should
64+
only be set for repositories you trust and in which you have read the code, as it will execute
65+
on your local machine arbitrary code present in the model repository.
66+
--weight-format {fp32,fp16,int8,int4,mxfp4,nf4}
5967
The weight format of the exported model.
60-
--library {transformers,diffusers,timm,sentence_transformers}
61-
The library used to load the model before export. If not provided, will attempt to infer the local checkpoints library.
68+
--library {transformers,diffusers,timm,sentence_transformers,open_clip}
69+
The library used to load the model before export. If not provided, will attempt to infer the
70+
local checkpoint's library
6271
--cache_dir CACHE_DIR
63-
The path to a directory in which the downloaded model should be cached if the standard cache should not be used.
72+
The path to a directory in which the downloaded model should be cached if the standard cache
73+
should not be used.
6474
--pad-token-id PAD_TOKEN_ID
65-
This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
66-
--ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80% of the layers will be quantized to int4 while
67-
20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 1.0.
75+
This is needed by some models, for some tasks. If not provided, will attempt to use the
76+
tokenizer to guess it.
77+
--ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit
78+
quantization. If set to 0.8, 80% of the layers will be quantized to int4 while 20% will be
79+
quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size
80+
and inference latency. Default value is 1.0.
6881
--sym Whether to apply symmetric quantization
6982
--group-size GROUP_SIZE
70-
The group size to use for int4 quantization. Recommended value is 128 and -1 will results in per-column quantization.
71-
--dataset DATASET The dataset used for data-aware compression or quantization with NNCF. You can use the one from the list ['wikitext2','c4','c4-new'] for language models or
72-
['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models.
73-
--all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8.
74-
--awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires additional time for tuning weights on a calibration dataset. To run AWQ,
75-
please also provide a dataset argument. Note: it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
76-
--scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale
77-
estimation. Please note, that applying scale estimation takes additional memory and time.
83+
The group size to use for quantization. Recommended value is 128 and -1 uses per-column
84+
quantization.
85+
--dataset DATASET The dataset used for data-aware compression or quantization with NNCF. You can use the one
86+
from the list ['wikitext2','c4','c4-new'] for language models or
87+
['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for
88+
diffusion models.
89+
--all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an
90+
weight compression is applied, they are compressed to INT8.
91+
--awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but
92+
requires additional time for tuning weights on a calibration dataset. To run AWQ, please also
93+
provide a dataset argument. Note: it is possible that there will be no matching patterns in the
94+
model to apply AWQ, in such case it will be skipped.
95+
--scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between
96+
the original and compressed layers. Providing a dataset is required to run scale estimation.
97+
Please note, that applying scale estimation takes additional memory and time.
98+
--gptq Indicates whether to apply GPTQ algorithm that optimizes compressed weights in a layer-wise
99+
fashion to minimize the difference between activations of a compressed and original layer.
100+
Please note, that applying GPTQ takes additional memory and time.
78101
--sensitivity-metric SENSITIVITY_METRIC
79-
The sensitivity metric for assigning quantization precision to layers. Can be one of the following: ['weight_quantization_error', 'hessian_input_activation',
102+
The sensitivity metric for assigning quantization precision to layers. It can be one of the
103+
following: ['weight_quantization_error', 'hessian_input_activation',
80104
'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude'].
81105
--num-samples NUM_SAMPLES
82106
The maximum number of samples to take from the dataset for quantization.
83-
--disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. In stateful models all kv-cache
84-
inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-stateful option is used, it may result in sub-optimal inference
85-
performance. Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native inference code that expects kv-cache inputs
86-
and outputs in the model.
107+
--disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models
108+
are produced by default when this key is not used. In stateful models all kv-cache inputs and
109+
outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-
110+
stateful option is used, it may result in sub-optimal inference performance. Use it when you
111+
intentionally want to use a stateless model, for example, to be compatible with existing
112+
OpenVINO native inference code that expects KV-cache inputs and outputs in the model.
87113
--disable-convert-tokenizer
88114
Do not add converted tokenizer and detokenizer OpenVINO models.
89115
```
90116

91-
You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
117+
You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`.
92118

119+
Export with INT8 weights compression:
93120
```bash
94121
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int8 ov_model/
95122
```
96123

124+
Export with INT4 weights compression:
125+
```bash
126+
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 ov_model/
127+
```
128+
129+
Export with INT4 weights compression and a data-aware AWQ and Scale Estimation algorithms:
130+
```bash
131+
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B \
132+
--weight-format int4 --awq --scale-estimation --dataset wikitext2 ov_model/
133+
```
134+
97135
For more information on the quantization parameters checkout the [documentation](inference#weight-only-quantization)
98136

99137

@@ -130,6 +168,14 @@ To export your Stable Diffusion XL model to the OpenVINO IR format with the CLI
130168
optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 ov_sdxl/
131169
```
132170

171+
You can also apply hybrid quantization during model export. For example:
172+
```bash
173+
optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 \
174+
--weight-format int8 --dataset conceptual_captions ov_sdxl/
175+
```
176+
177+
For more information about hybrid quantization, take a look at this jupyter [notebook](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb).
178+
133179
## When loading your model
134180

135181
You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model.

docs/source/openvino/optimization.mdx

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -30,29 +30,37 @@ Quantization can be applied on the model's Linear, Convolutional and Embedding l
3030

3131
#### 8-bit
3232

33-
For the 8-bit weight quantization you can set `load_in_8bit=True` to load your model's weights in 8-bit:
33+
For the 8-bit weight quantization you can provide `quantization_config` equal to `OVWeightQuantizationConfig(bits=8)` to load your model's weights in 8-bit:
3434

3535
```python
36-
from optimum.intel import OVModelForCausalLM
36+
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
3737

3838
model_id = "helenai/gpt2-ov"
39-
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
39+
quantization_config = OVWeightQuantizationConfig(bits=8)
40+
model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
4041

4142
# Saves the int8 model that will be x4 smaller than its fp32 counterpart
4243
model.save_pretrained(saving_directory)
4344
```
4445

46+
Weights of language models inside vision-language pipelines can be quantized in a similar way:
47+
```python
48+
model = OVModelForVisualCausalLM.from_pretrained(
49+
"llava-hf/llava-v1.6-mistral-7b-hf",
50+
quantization_config=quantization_config
51+
)
52+
```
53+
4554
<Tip warning={true}>
4655

47-
If not specified, `load_in_8bit` will be set to `True` by default when models larger than 1 billion parameters are exported to the OpenVINO format (with `export=True`). You can disable it with `load_in_8bit=False`.
56+
If quantization_config is not provided, model will be exported in 8 bits by default when it has more than 1 billion parameters. You can disable it with `load_in_8bit=False`.
4857

4958
</Tip>
5059

51-
You can also provide a `quantization_config` instead to specify additional optimization parameters.
5260

5361
#### 4-bit
5462

55-
For the 4-bit weight quantization, you need a `quantization_config` to define the optimization parameters, for example:
63+
4-bit weight quantization can be achieved in a similar way:
5664

5765
```python
5866
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
@@ -61,10 +69,24 @@ quantization_config = OVWeightQuantizationConfig(bits=4)
6169
model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
6270
```
6371

72+
Or for vision-language pipelines:
73+
```python
74+
model = OVModelForVisualCausalLM.from_pretrained(
75+
"llava-hf/llava-v1.6-mistral-7b-hf",
76+
quantization_config=quantization_config
77+
)
78+
```
79+
6480
You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
6581

6682
```python
67-
quantization_config = OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="wikitext2")
83+
quantization_config = OVWeightQuantizationConfig(
84+
bits=4,
85+
sym=False,
86+
ratio=0.8,
87+
quant_method="awq",
88+
dataset="wikitext2"
89+
)
6890
```
6991

7092
By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) you can add `sym=True`.
@@ -76,12 +98,21 @@ For 4-bit quantization you can also specify the following arguments in the quant
7698
Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
7799

78100
Quality of 4-bit weight compressed model can further be improved by employing one of the following data-dependent methods:
79-
* AWQ which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
80-
* Scale Estimation is a method that tunes quantization scales to minimize the `L2` error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead.
101+
* **AWQ** which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
102+
* **Scale Estimation** is a method that tunes quantization scales to minimize the `L2` error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead.
103+
* **GPTQ** optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer.
81104

82-
AWQ and Scale Estimation algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit `OVWeightQuantizationConfig` together with a dataset. For example:
105+
Data-aware algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit `OVWeightQuantizationConfig` together with a dataset. For example:
83106
```python
84-
quantization_config = OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, quant_method="awq", scale_estimation=True, dataset="wikitext2")
107+
quantization_config = OVWeightQuantizationConfig(
108+
bits=4,
109+
sym=False,
110+
ratio=0.8,
111+
quant_method="awq",
112+
scale_estimation=True,
113+
gptq=True,
114+
dataset="wikitext2"
115+
)
85116
```
86117

87118
### Static quantization

0 commit comments

Comments
 (0)