You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This tutorial introduces the steps for utilizing the `Neural Network Compression Framework (nncf) <https://github.com/openvinotoolkit/nncf/tree/develop>`_ to generate a quantized model customized
10
10
for the `OpenVINO torch.compile backend <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ and explains how to lower the quantized model into the `OpenVINO <https://docs.openvino.ai/2024/index.html>`_ representation.
11
11
12
-
The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
12
+
The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and performs quantization transformations on top of the ATen graph.
13
13
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
14
14
OpenVINO is the new backend that compiles the FX Graph generated by TorchDynamo into an optimized OpenVINO model.
15
15
16
-
The quantization flow mainly includes three steps:
16
+
The quantization flow mainly includes four steps:
17
17
18
-
- Step 1: OpenVINO and NNCF installation.
18
+
- Step 1: Install OpenVINO and NNCF.
19
19
- Step 2: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
20
20
- Step 3: Apply the Quantization flow based on the captured FX Graph.
21
21
- Step 4: Lower the quantized model into OpenVINO representation with the API ``torch.compile``.
@@ -33,9 +33,14 @@ The high-level architecture of this flow could look like this:
If there is no framework dataset object, you can create your own entity that implements the Iterable interface in Python,
127
-
for example, the list of images, and returns data samples feasible for inference. In this case, a transformation function is not required.
120
+
* ``preset`` - defines quantization scheme for the model. Two types of presets are available:
128
121
129
-
Once the dataset is ready and the model object is instantiated, you can apply 8-bit quantization to it.
122
+
* ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations
130
123
131
-
.. code-block:: python
124
+
* ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.
* ``model_type`` - used to specify quantization scheme required for specific type of the model. Transformer is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, DistilBERT, etc.). None is default, i.e. no specific scheme is defined.
* ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations
136
+
* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter:
147
137
148
-
* ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.
* ``fast_bias_correction`` - when set to False, enables a more accurate bias (error) correction algorithm that can be used to improve the accuracy of the model. True is used by default to minimize quantization time.
* ``subset_size`` - defines the number of samples from the calibration dataset that will be used to estimate quantization parameters of activations. The default value is 300.
159
+
* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.
* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter:
167
165
168
-
.. code-block:: python
166
+
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
167
+
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.
180
+
Finally, we will convert the calibrated Model to a quantized Model. ``convert_pt2e`` takes a calibrated model and produces a quantized model.
* ``advanced_parameters`` - used to specify advanced quantization parameters for fine-tuning the quantization algorithm. Defined by nncf.quantization.advanced_parameters NNCF submodule. None is default.
184
+
quantized_model = convert_pt2e(prepared_model)
196
185
197
186
After these steps, we finished running the quantization flow, and we will get the quantized model.
198
187
@@ -204,18 +193,19 @@ After that the FX Graph can utilize OpenVINO optimizations using `torch.compile(
The optimized model is using low-level kernels designed specifically for Intel CPU.
215
205
This should significantly speed up inference time in comparison with the eager model.
216
206
217
207
Conclusion
218
208
------------
219
209
220
-
With this tutorial, we introduce how to use torch.compile with the OpenVINO backend with models quantized via ``nncf.quantize``.
221
-
For further information, please visit `complete example on renset18 model<https://github.com/openvinotoolkit/nncf/tree/v2.14.0/examples/post_training_quantization/torch_fx/resnet18>`_.
210
+
With this tutorial, we introduce how to use torch.compile with the OpenVINO backend and the OpenVINO quantizer.
211
+
For further information, please visit `OpenVINO deploymet via torch.compile documentation<https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_.
0 commit comments