Skip to content

Commit acf1647

Browse files
OpenVINOQuantizer
1 parent f0ab805 commit acf1647

File tree

1 file changed

+62
-72
lines changed

1 file changed

+62
-72
lines changed
Lines changed: 62 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
PyTorch 2 Export Quantization with NNCF quantization and OpenVINO runtime
1+
PyTorch 2 Export Quantization with OpenVINO backend
22
===========================================================================
33

44
**Author**: dlyakhov, asuslov, aamir, # TODO: add required authors
@@ -9,13 +9,13 @@ Introduction
99
This tutorial introduces the steps for utilizing the `Neural Network Compression Framework (nncf) <https://github.com/openvinotoolkit/nncf/tree/develop>`_ to generate a quantized model customized
1010
for the `OpenVINO torch.compile backend <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ and explains how to lower the quantized model into the `OpenVINO <https://docs.openvino.ai/2024/index.html>`_ representation.
1111

12-
The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
12+
The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and performs quantization transformations on top of the ATen graph.
1313
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
1414
OpenVINO is the new backend that compiles the FX Graph generated by TorchDynamo into an optimized OpenVINO model.
1515

16-
The quantization flow mainly includes three steps:
16+
The quantization flow mainly includes four steps:
1717

18-
- Step 1: OpenVINO and NNCF installation.
18+
- Step 1: Install OpenVINO and NNCF.
1919
- Step 2: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
2020
- Step 3: Apply the Quantization flow based on the captured FX Graph.
2121
- Step 4: Lower the quantized model into OpenVINO representation with the API ``torch.compile``.
@@ -33,9 +33,14 @@ The high-level architecture of this flow could look like this:
3333
|
3434
FX Graph in ATen
3535
|
36-
|
36+
| OpenVINOQuantizer
37+
| /
3738
—--------------------------------------------------------
38-
| nncf.quantize |
39+
| prepare_pt2e |
40+
| | |
41+
| Calibrate
42+
| | |
43+
| convert_pt2e |
3944
—--------------------------------------------------------
4045
|
4146
Quantized Model
@@ -69,10 +74,13 @@ We will start by performing the necessary imports, capturing the FX Graph from t
6974

7075
.. code-block:: python
7176
72-
import torch
73-
import torchvision.models as models
7477
import copy
7578
import openvino.torch
79+
import torch
80+
import torchvision.models as models
81+
from torch.ao.quantization.quantize_pt2e import convert_pt2e
82+
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
83+
from torch.ao.quantization.quantizer.openvino_quantizer import OpenVINOQuantizer
7684
7785
import nncf
7886
from nncf.torch import disable_patching
@@ -90,109 +98,90 @@ We will start by performing the necessary imports, capturing the FX Graph from t
9098
example_inputs = (x,)
9199
92100
# Capture the FX Graph to be quantized
93-
with torch.no_grad():
94-
with disable_patching():
95-
exported_model = torch.export.export(model, example_inputs).module()
101+
with torch.no_grad(), disable_patching():
102+
exported_model = torch.export.export(model, example_inputs).module()
96103
97104
98-
Next, we will have the FX Module to be quantized.
99105
100106
3. Apply Quantization
101107
^^^^^^^^^^^^^^^^^^^^^^^
102108

103-
Before the quantization, we need to create an instance of the nncf.Dataset class that represents the calibration dataset.
104-
The ``nncf.Dataset`` class can be a wrapper over the framework dataset object that is used for model training or validation
105-
The class constructor receives the dataset object and an optional transformation function.
109+
After we capture the FX Module to be quantized, we will import the OpenVINOQuantizer.
106110

107-
The transformation function is a function that takes a sample from the dataset and returns data that can be passed to the model for inference.
108-
For example, this function can take a tuple of a data tensor and labels tensor and return the former while ignoring the latter.
109-
The transformation function is used to avoid modifying the dataset code to make it compatible with the quantization API.
110-
The function is applied to each sample from the dataset before passing it to the model for inference.
111-
The following code snippet shows how to create an instance of the ``nncf.Dataset`` class:
112111

113112
.. code-block:: python
114113
115-
calibration_loader = torch.utils.data.DataLoader([example_inputs])
114+
quantizer = OpenVINOQuantizer()
116115
117-
def transform_fn(data_item):
118-
# In the transformation function,
119-
# user can separate labels and input data
120-
# from the given data item:
121-
# images, _ = data_item
122-
return data_item
116+
``OpenVINOQuantizer`` has several optional parameters that allow tuning the quantization process to get a more accurate model.
117+
Below is the list of essential parameters and their description:
123118

124-
calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
125119

126-
If there is no framework dataset object, you can create your own entity that implements the Iterable interface in Python,
127-
for example, the list of images, and returns data samples feasible for inference. In this case, a transformation function is not required.
120+
* ``preset`` - defines quantization scheme for the model. Two types of presets are available:
128121

129-
Once the dataset is ready and the model object is instantiated, you can apply 8-bit quantization to it.
122+
* ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations
130123

131-
.. code-block:: python
124+
* ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.
132125

133-
with disable_patching():
134-
quantized_model = nncf.quantize(exported_model, calibration_dataset)
126+
.. code-block:: python
135127
136-
``nncf.quantize()`` function has several optional parameters that allow tuning the quantization process to get a more accurate model.
137-
Below is the list of parameters and their description:
128+
OpenVINOQuantizer(preset=nncf.QuantizationPreset.MIXED)
138129
139130
* ``model_type`` - used to specify quantization scheme required for specific type of the model. Transformer is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, DistilBERT, etc.). None is default, i.e. no specific scheme is defined.
140-
.. code-block:: python
141131

142-
nncf.quantize(model, dataset, model_type=nncf.ModelType.Transformer)
132+
.. code-block:: python
143133
144-
* ``preset`` - defines quantization scheme for the model. Two types of presets are available:
134+
OpenVINOQuantizer(model_type=nncf.ModelType.Transformer)
145135
146-
* ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations
136+
* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter:
147137

148-
* ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.
138+
.. code-block:: python
149139
150-
.. code-block:: python
140+
#Exclude by layer name:
141+
names = ['layer_1', 'layer_2', 'layer_3']
142+
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(names=names))
151143
152-
nncf.quantize(model, dataset, preset=nncf.QuantizationPreset.MIXED)
144+
#Exclude by layer type:
145+
types = ['Conv2d', 'Linear']
146+
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(types=types))
153147
154-
* ``fast_bias_correction`` - when set to False, enables a more accurate bias (error) correction algorithm that can be used to improve the accuracy of the model. True is used by default to minimize quantization time.
148+
#Exclude by regular expression:
149+
regex = '.*layer_.*'
150+
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(patterns=regex))
155151
156-
.. code-block:: python
152+
#Exclude by subgraphs:
153+
# In this case, all nodes along all simple paths in the graph
154+
# from input to output nodes will be excluded from the quantization process.
155+
subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3'])
156+
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph]))
157157
158-
nncf.quantize(model, dataset, fast_bias_correction=False)
159158
160-
* ``subset_size`` - defines the number of samples from the calibration dataset that will be used to estimate quantization parameters of activations. The default value is 300.
159+
* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.
161160

162-
.. code-block:: python
161+
.. code-block:: python
163162
164-
nncf.quantize(model, dataset, subset_size=1000)
163+
OpenVINOQuantizer(target_device=nncf.TargetDevice.CPU)
165164
166-
* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter:
167165
168-
.. code-block:: python
166+
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
167+
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
169168

170-
#Exclude by layer name:
171-
names = ['layer_1', 'layer_2', 'layer_3']
172-
nncf.quantize(model, dataset, ignored_scope=nncf.IgnoredScope(names=names))
169+
.. code-block:: python
173170
174-
#Exclude by layer type:
175-
types = ['Conv2d', 'Linear']
176-
nncf.quantize(model, dataset, ignored_scope=nncf.IgnoredScope(types=types))
171+
prepared_model = prepare_pt2e(exported_model, quantizer)
177172
178-
#Exclude by regular expression:
179-
regex = '.*layer_.*'
180-
nncf.quantize(model, dataset, ignored_scope=nncf.IgnoredScope(patterns=regex))
173+
Now, we will calibrate the ``prepared_model`` after the observers are inserted in the model.
181174

182-
#Exclude by subgraphs:
183-
# In this case, all nodes along all simple paths in the graph
184-
# from input to output nodes will be excluded from the quantization process.
185-
subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3'])
186-
nncf.quantize(model, dataset, ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph]))
175+
.. code-block:: python
187176
177+
# We use the dummy data as an example here
178+
prepared_model(*example_inputs)
188179
189-
* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.
180+
Finally, we will convert the calibrated Model to a quantized Model. ``convert_pt2e`` takes a calibrated model and produces a quantized model.
190181

191182
.. code-block:: python
192183
193-
nncf.quantize(model, dataset, target_device=nncf.TargetDevice.CPU)
194-
195-
* ``advanced_parameters`` - used to specify advanced quantization parameters for fine-tuning the quantization algorithm. Defined by nncf.quantization.advanced_parameters NNCF submodule. None is default.
184+
quantized_model = convert_pt2e(prepared_model)
196185
197186
After these steps, we finished running the quantization flow, and we will get the quantized model.
198187

@@ -204,18 +193,19 @@ After that the FX Graph can utilize OpenVINO optimizations using `torch.compile(
204193

205194
.. code-block:: python
206195
207-
with torch.no_grad():
196+
with torch.no_grad(), disable_patching():
208197
optimized_model = torch.compile(quantized_model, backend="openvino")
209198
210199
# Running some benchmark
211200
optimized_model(*example_inputs)
212201
213202
203+
214204
The optimized model is using low-level kernels designed specifically for Intel CPU.
215205
This should significantly speed up inference time in comparison with the eager model.
216206

217207
Conclusion
218208
------------
219209

220-
With this tutorial, we introduce how to use torch.compile with the OpenVINO backend with models quantized via ``nncf.quantize``.
221-
For further information, please visit `complete example on renset18 model <https://github.com/openvinotoolkit/nncf/tree/v2.14.0/examples/post_training_quantization/torch_fx/resnet18>`_.
210+
With this tutorial, we introduce how to use torch.compile with the OpenVINO backend and the OpenVINO quantizer.
211+
For further information, please visit `OpenVINO deploymet via torch.compile documentation <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_.

0 commit comments

Comments
 (0)