Skip to content

Commit f6a7b83

Browse files
Add new features from NNCF 2.14 (#1021)
* Add LoRA * Add backup precision * Fixes * Add 'auto' dataset option * Update docs * Update help code block * Update help code block * Rename lora to lora_correction * Tweak description * Address comments * Update minimal NNCF version in requirements
1 parent 16c27ca commit f6a7b83

File tree

8 files changed

+163
-49
lines changed

8 files changed

+163
-49
lines changed

docs/source/openvino/export.mdx

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,14 @@ optimum-cli export openvino --model local_llama --task text-generation-with-past
2929

3030
Check out the help for more options:
3131

32-
```bash
32+
```text
3333
usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code]
3434
[--weight-format {fp32,fp16,int8,int4,mxfp4,nf4}]
3535
[--library {transformers,diffusers,timm,sentence_transformers,open_clip}]
3636
[--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
37-
[--group-size GROUP_SIZE] [--dataset DATASET] [--all-layers] [--awq]
38-
[--scale-estimation] [--gptq] [--sensitivity-metric SENSITIVITY_METRIC]
37+
[--group-size GROUP_SIZE] [--backup-precision {none,int8_sym,int8_asym}]
38+
[--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--gptq]
39+
[--lora-correction] [--sensitivity-metric SENSITIVITY_METRIC]
3940
[--num-samples NUM_SAMPLES] [--disable-stateful] [--disable-convert-tokenizer]
4041
output
4142

@@ -49,15 +50,15 @@ Required arguments:
4950

5051
Optional arguments:
5152
--task TASK The task to export the model for. If not specified, the task will be auto-inferred based on
52-
the model. Available tasks depend on the model, but are among: ['fill-mask', 'masked-im',
53-
'audio-classification', 'automatic-speech-recognition', 'text-to-audio', 'image-text-to-text',
54-
'depth-estimation', 'image-to-image', 'text-generation', 'text-to-image', 'mask-generation',
55-
'audio-frame-classification', 'sentence-similarity', 'image-classification', 'multiple-
56-
choice', 'text-classification', 'text2text-generation', 'token-classification', 'feature-
57-
extraction', 'zero-shot-image-classification', 'zero-shot-object-detection', 'object-
58-
detection', 'inpainting', 'question-answering', 'semantic-segmentation', 'image-segmentation',
59-
'audio-xvector', 'image-to-text']. For decoder models, use `xxx-with-past` to export the model
60-
using past key values in the decoder.
53+
the model. Available tasks depend on the model, but are among: ['image-to-image',
54+
'image-segmentation', 'inpainting', 'sentence-similarity', 'text-to-audio', 'image-to-text',
55+
'automatic-speech-recognition', 'token-classification', 'text-to-image', 'audio-classification',
56+
'feature-extraction', 'semantic-segmentation', 'masked-im', 'audio-xvector',
57+
'audio-frame-classification', 'text2text-generation', 'multiple-choice', 'depth-estimation',
58+
'image-classification', 'fill-mask', 'zero-shot-object-detection', 'object-detection',
59+
'question-answering', 'zero-shot-image-classification', 'mask-generation', 'text-generation',
60+
'text-classification']. For decoder models, use 'xxx-with-past' to export the model using past
61+
key values in the decoder.
6162
--framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local
6263
checkpoint's original framework or what is available in the environment.
6364
--trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should
@@ -82,10 +83,18 @@ Optional arguments:
8283
--group-size GROUP_SIZE
8384
The group size to use for quantization. Recommended value is 128 and -1 uses per-column
8485
quantization.
85-
--dataset DATASET The dataset used for data-aware compression or quantization with NNCF. You can use the one
86-
from the list ['wikitext2','c4','c4-new'] for language models or
87-
['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for
88-
diffusion models.
86+
--backup-precision {none,int8_sym,int8_asym}
87+
Defines a backup precision for mixed-precision weight compression. Only valid for int4 weight
88+
format. If not provided, backup precision is int8_asym. 'none' stands for original floating-
89+
point precision of the model weights, in this case weights are retained in their original
90+
precision without any quantization. 'int8_sym' stands for 8-bit integer symmetric quantization
91+
without zero point. 'int8_asym' stands for 8-bit integer asymmetric quantization with zero
92+
points per each quantization group.
93+
--dataset DATASET The dataset used for data-aware compression or quantization with NNCF. For language models you
94+
can use the one from the list ['auto','wikitext2','c4','c4-new']. With 'auto' the dataset will
95+
be collected from model's generations. For diffusion models it should be on of
96+
['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit']. For
97+
visual language models the dataset must be set to 'contextual'.
8998
--all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an
9099
weight compression is applied, they are compressed to INT8.
91100
--awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but
@@ -98,6 +107,10 @@ Optional arguments:
98107
--gptq Indicates whether to apply GPTQ algorithm that optimizes compressed weights in a layer-wise
99108
fashion to minimize the difference between activations of a compressed and original layer.
100109
Please note, that applying GPTQ takes additional memory and time.
110+
--lora-correction Indicates whether to apply LoRA Correction algorithm. When enabled, this algorithm introduces
111+
low-rank adaptation layers in the model that can recover accuracy after weight compression at
112+
some cost of inference latency. Please note, that applying LoRA Correction algorithm takes
113+
additional memory and time.
101114
--sensitivity-metric SENSITIVITY_METRIC
102115
The sensitivity metric for assigning quantization precision to layers. It can be one of the
103116
following: ['weight_quantization_error', 'hessian_input_activation',

docs/source/openvino/optimization.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ Quality of 4-bit weight compressed model can further be improved by employing on
101101
* **AWQ** which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
102102
* **Scale Estimation** is a method that tunes quantization scales to minimize the `L2` error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead.
103103
* **GPTQ** optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer.
104+
* **LoRA Correction** mitigates quantization noise introduced during weight compression by leveraging low-rank adaptation.
104105

105106
Data-aware algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit `OVWeightQuantizationConfig` together with a dataset. For example:
106107
```python
@@ -115,6 +116,8 @@ quantization_config = OVWeightQuantizationConfig(
115116
)
116117
```
117118

119+
Note: GPTQ and LoRA Correction algorithms can't be applied simultaneously.
120+
118121
### Static quantization
119122

120123
When applying post-training static quantization, both the weights and the activations are quantized.

optimum/commands/export/openvino.py

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -117,14 +117,30 @@ def parse_args_openvino(parser: "ArgumentParser"):
117117
default=None,
118118
help=("The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization."),
119119
)
120+
optional_group.add_argument(
121+
"--backup-precision",
122+
type=str,
123+
choices=["none", "int8_sym", "int8_asym"],
124+
default=None,
125+
help=(
126+
"Defines a backup precision for mixed-precision weight compression. Only valid for int4 weight format. "
127+
"If not provided, backup precision is int8_asym. 'none' stands for original floating-point precision of "
128+
"the model weights, in this case weights are retained in their original precision without any "
129+
"quantization. 'int8_sym' stands for 8-bit integer symmetric quantization without zero point. 'int8_asym' "
130+
"stands for 8-bit integer asymmetric quantization with zero points per each quantization group."
131+
),
132+
)
120133
optional_group.add_argument(
121134
"--dataset",
122135
type=str,
123136
default=None,
124137
help=(
125138
"The dataset used for data-aware compression or quantization with NNCF. "
126-
"You can use the one from the list ['wikitext2','c4','c4-new'] for language models "
127-
"or ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models."
139+
"For language models you can use the one from the list ['auto','wikitext2','c4','c4-new']. With 'auto' the "
140+
"dataset will be collected from model's generations. "
141+
"For diffusion models it should be on of ['conceptual_captions',"
142+
"'laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit']. "
143+
"For visual language models the dataset must be set to 'contextual'."
128144
),
129145
)
130146
optional_group.add_argument(
@@ -143,7 +159,7 @@ def parse_args_openvino(parser: "ArgumentParser"):
143159
help=(
144160
"Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires "
145161
"additional time for tuning weights on a calibration dataset. To run AWQ, please also provide a dataset "
146-
"argument. Note: it's possible that there will be no matching patterns in the model to apply AWQ, in such "
162+
"argument. Note: it is possible that there will be no matching patterns in the model to apply AWQ, in such "
147163
"case it will be skipped."
148164
),
149165
)
@@ -167,12 +183,22 @@ def parse_args_openvino(parser: "ArgumentParser"):
167183
"applying GPTQ takes additional memory and time."
168184
),
169185
)
186+
optional_group.add_argument(
187+
"--lora-correction",
188+
action="store_true",
189+
default=None,
190+
help=(
191+
"Indicates whether to apply LoRA Correction algorithm. When enabled, this algorithm introduces low-rank "
192+
"adaptation layers in the model that can recover accuracy after weight compression at some cost of "
193+
"inference latency. Please note, that applying LoRA Correction algorithm takes additional memory and time."
194+
),
195+
)
170196
optional_group.add_argument(
171197
"--sensitivity-metric",
172198
type=str,
173199
default=None,
174200
help=(
175-
"The sensitivity metric for assigning quantization precision to layers. Can be one of the following: "
201+
"The sensitivity metric for assigning quantization precision to layers. It can be one of the following: "
176202
"['weight_quantization_error', 'hessian_input_activation', 'mean_activation_variance', "
177203
"'max_activation_variance', 'mean_activation_magnitude']."
178204
),
@@ -191,7 +217,7 @@ def parse_args_openvino(parser: "ArgumentParser"):
191217
"In stateful models all kv-cache inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. "
192218
"If --disable-stateful option is used, it may result in sub-optimal inference performance. "
193219
"Use it when you intentionally want to use a stateless model, for example, to be compatible with existing "
194-
"OpenVINO native inference code that expects kv-cache inputs and outputs in the model."
220+
"OpenVINO native inference code that expects KV-cache inputs and outputs in the model."
195221
),
196222
)
197223
optional_group.add_argument(
@@ -215,7 +241,9 @@ def no_compression_parameter_provided(args):
215241
args.awq,
216242
args.scale_estimation,
217243
args.gptq,
244+
args.lora_correction,
218245
args.sensitivity_metric,
246+
args.backup_precision,
219247
)
220248
)
221249
)
@@ -287,7 +315,9 @@ def run(self):
287315
"sensitivity_metric": self.args.sensitivity_metric,
288316
"scale_estimation": self.args.scale_estimation,
289317
"gptq": self.args.gptq,
318+
"lora_correction": self.args.lora_correction,
290319
"weight_format": self.args.weight_format,
320+
"backup_precision": self.args.backup_precision,
291321
}
292322

293323
if quantization_config.get("dataset", None) is not None:

0 commit comments

Comments
 (0)