You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For more information about hybrid quantization, take a look at this jupyter [notebook](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb).
178
+
133
179
## When loading your model
134
180
135
181
You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model.
If not specified, `load_in_8bit` will be set to `True`by default when models larger than 1 billion parameters are exported to the OpenVINO format (with `export=True`). You can disable it with `load_in_8bit=False`.
56
+
If quantization_config is not provided, model will be exported in 8 bits by default when it has more than 1 billion parameters. You can disable it with `load_in_8bit=False`.
48
57
49
58
</Tip>
50
59
51
-
You can also provide a `quantization_config` instead to specify additional optimization parameters.
52
60
53
61
#### 4-bit
54
62
55
-
For the 4-bit weight quantization, you need a `quantization_config` to define the optimization parameters, for example:
63
+
4-bit weight quantization can be achieved in a similar way:
56
64
57
65
```python
58
66
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) you can add `sym=True`.
@@ -76,12 +98,21 @@ For 4-bit quantization you can also specify the following arguments in the quant
76
98
Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
77
99
78
100
Quality of 4-bit weight compressed model can further be improved by employing one of the following data-dependent methods:
79
-
* AWQ which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
80
-
* Scale Estimation is a method that tunes quantization scales to minimize the `L2` error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead.
101
+
***AWQ** which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
102
+
***Scale Estimation** is a method that tunes quantization scales to minimize the `L2` error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead.
103
+
***GPTQ** optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer.
81
104
82
-
AWQ and Scale Estimation algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit `OVWeightQuantizationConfig` together with a dataset. For example:
105
+
Data-aware algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit `OVWeightQuantizationConfig` together with a dataset. For example:
0 commit comments