You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that quantization is currently only supported for CPUs (only CPU backends are available), so we will not be utilizing GPUs / CUDA in this example.
55
55
56
-
To load a quantized model hosted locally or on the 🤗 hub, you can do as follows :
57
-
```python
58
-
from optimum.intel import INCModelForSequenceClassification
model = INCModelForSequenceClassification.from_pretrained(model_id)
62
-
```
63
-
64
56
You can load many more quantized models hosted on the hub under the Intel organization [`here`](https://huggingface.co/Intel).
65
57
66
-
For more details on the supported compression techniques, please refer to the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc).
67
-
58
+
For more details on the supported compression techniques, please refer to the [documentation](https://huggingface.co/docs/optimum-intel/en/neural_compressor/optimization).
68
59
69
60
## OpenVINO
70
61
@@ -75,28 +66,27 @@ Below are examples of how to use OpenVINO and its [NNCF](https://docs.openvino.a
75
66
It is also possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
You can also apply 8-bit weight-only quantization when exporting your model : the model linear, embedding and convolution weights will be quantized to INT8, the activations will be kept in floating point precision.
Quantization in hybrid mode can be applied to Stable Diffusion pipeline during model export. This involves applying hybrid post-training quantization to the UNet model and weight-only quantization for the rest of the pipeline components. In the hybrid mode, weights in MatMul and Embedding layers are quantized, as well as activations of other layers.
To apply quantization on both weights and activations, you can find more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov).
84
+
To apply quantization on both weights and activations, you can find more information in the [documentation](https://huggingface.co/docs/optimum-intel/en/openvino/optimization).
94
85
95
86
#### Inference:
96
87
97
88
To load a model and run inference with OpenVINO Runtime, you can just replace your `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.
98
89
99
-
100
90
```diff
101
91
- from transformers import AutoModelForSeq2SeqLM
102
92
+ from optimum.intel import OVModelForSeq2SeqLM
@@ -112,50 +102,22 @@ To load a model and run inference with OpenVINO Runtime, you can just replace yo
112
102
[{'translation_text': "Il n'est jamais sorti sans un livre sous son bras, et il est souvent revenu avec deux."}]
113
103
```
114
104
115
-
If you want to load a PyTorch checkpoint, set `export=True` to convert your model to the OpenVINO IR.
105
+
#### Quantization:
116
106
117
-
```python
118
-
from optimum.intel import OVModelForCausalLM
119
-
120
-
model = OVModelForCausalLM.from_pretrained("gpt2", export=True)
121
-
model.save_pretrained("./ov_model")
122
-
```
107
+
Post-training static quantization can also be applied. Here is an example on how to apply static quantization on a Whisper model using the [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) dataset for the calibration step.
123
108
109
+
```python
110
+
from optimum.intel import OVModelForSpeechSeq2Seq, OVQuantizationConfig
124
111
125
-
#### Post-training static quantization:
126
-
127
-
Post-training static quantization introduces an additional calibration step where data is fed through the network in order to compute the activations quantization parameters. Here is an example on how to apply static quantization on a fine-tuned DistilBERT.
0 commit comments