daniil-lyakhov · daniil-lyakhov · Nov 21, 2024 · Jan 28, 2025 · Feb 7, 2025 · Feb 7, 2025
diff --git a/prototype_source/openvino_quantizer.rst b/prototype_source/openvino_quantizer.rst
@@ -0,0 +1,211 @@
+PyTorch 2 Export Quantization with OpenVINO backend
+===========================================================================
+
+**Author**: dlyakhov, asuslov, aamir, # TODO: add required authors
+
+Introduction
+--------------
+
+This tutorial introduces the steps for utilizing the `Neural Network Compression Framework (nncf) <https://github.com/openvinotoolkit/nncf/tree/develop>`_ to generate a quantized model customized
+for the `OpenVINO torch.compile backend <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ and explains how to lower the quantized model into the `OpenVINO <https://docs.openvino.ai/2024/index.html>`_ representation.
+
+The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and performs quantization transformations on top of the ATen graph.
+This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
+OpenVINO is the new backend that compiles the FX Graph generated by TorchDynamo into an optimized OpenVINO model.
+
+The quantization flow mainly includes four steps:
+
+- Step 1: Install OpenVINO and NNCF.
+- Step 2: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
+- Step 3: Apply the Quantization flow based on the captured FX Graph.
+- Step 4: Lower the quantized model into OpenVINO representation with the API ``torch.compile``.
+
+The high-level architecture of this flow could look like this:
+
+::
+
+    float_model(Python)                          Example Input
+        \                                              /
+         \                                            /
+    —--------------------------------------------------------
+    |                         export                       |
+    —--------------------------------------------------------
+                                |
+                        FX Graph in ATen
+                                |
+                                |           OpenVINOQuantizer
+                                |                 /
+    —--------------------------------------------------------
+    |                      prepare_pt2e                     |
+    |                           |                           |
+    |                       Calibrate
+    |                           |                           |
+    |                      convert_pt2e                     |
+    —--------------------------------------------------------
+                                |
+                         Quantized Model
+                                |
+    —--------------------------------------------------------
+    |                  Lower into Inductor                  |
+    —--------------------------------------------------------
+                                |
+                          OpenVINO model
+
+Post Training Quantization
+----------------------------
+
+Now, we will walk you through a step-by-step tutorial for how to use it with `torchvision resnet18 model <https://download.pytorch.org/models/resnet18-f37072fd.pth>`_
+for post training quantization.
+
+1. OpenVINO and NNCF installation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+OpenVINO and NNCF could be easily installed via `pip distribution <https://docs.openvino.ai/2024/get-started/install-openvino.html>`_:
+
+.. code-block:: bash
+
+    pip install -U pip
+    pip install openvino, nncf
+
+
+2. Capture FX Graph
+^^^^^^^^^^^^^^^^^^^^^
+
+We will start by performing the necessary imports, capturing the FX Graph from the eager module.
+
+.. code-block:: python
+
+    import copy
+    import openvino.torch
+    import torch
+    import torchvision.models as models
+    from torch.ao.quantization.quantize_pt2e import convert_pt2e
+    from torch.ao.quantization.quantize_pt2e import prepare_pt2e
+    from torch.ao.quantization.quantizer.openvino_quantizer import OpenVINOQuantizer
+
+    import nncf
+    from nncf.torch import disable_patching
-    import nncf
-    from nncf.torch import disable_patching
+    import nncf
-    import nncf
-    from nncf.torch import disable_patching
+    import nncf
+
+    # Create the Eager Model
+    model_name = "resnet18"
+    model = models.__dict__[model_name](pretrained=True)
+
+    # Set the model to eval mode
+    model = model.eval()
+
+    # Create the data, using the dummy data here as an example
+    traced_bs = 50
+    x = torch.randn(traced_bs, 3, 224, 224).contiguous(memory_format=torch.channels_last)
+    example_inputs = (x,)
+
+    # Capture the FX Graph to be quantized
+    with torch.no_grad(), disable_patching():
+        exported_model = torch.export.export(model, example_inputs).module()
+
+
+
+3. Apply Quantization
+^^^^^^^^^^^^^^^^^^^^^^^
+
+After we capture the FX Module to be quantized, we will import the OpenVINOQuantizer.
+
+
+.. code-block:: python
+
+    quantizer = OpenVINOQuantizer()
+
+``OpenVINOQuantizer`` has several optional parameters that allow tuning the quantization process to get a more accurate model.
+Below is the list of essential parameters and their description:
+
+
+* ``preset`` - defines quantization scheme for the model. Two types of presets are available:
+
+    * ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations
+
+    * ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.
+
+    .. code-block:: python
+
+        OpenVINOQuantizer(preset=nncf.QuantizationPreset.MIXED)
+
+* ``model_type`` - used to specify quantization scheme required for specific type of the model. Transformer is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, DistilBERT, etc.). None is default, i.e. no specific scheme is defined.
+
+    .. code-block:: python
+
+        OpenVINOQuantizer(model_type=nncf.ModelType.Transformer)
+
+* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy.  For example, when you want to exclude the last layer of the model from quantization.  Below are some examples of how to use this parameter:
+
+    .. code-block:: python
+
+        #Exclude by layer name:
+        names = ['layer_1', 'layer_2', 'layer_3']
+        OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(names=names))
+
+        #Exclude by layer type:
+        types = ['Conv2d', 'Linear']
+        OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(types=types))
+
+        #Exclude by regular expression:
+        regex = '.*layer_.*'
+        OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(patterns=regex))
+
+        #Exclude by subgraphs:
+        # In this case, all nodes along all simple paths in the graph
+        # from input to output nodes will be excluded from the quantization process.
+        subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3'])
+        OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph]))
+
+
+* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.
+
+    .. code-block:: python
+
+        OpenVINOQuantizer(target_device=nncf.TargetDevice.CPU)
+
+
+After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
+``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
+
+.. code-block:: python
+
+    prepared_model = prepare_pt2e(exported_model, quantizer)
+
+Now, we will calibrate the ``prepared_model`` after the observers are inserted in the model.
+
+.. code-block:: python
+
+    # We use the dummy data as an example here
+    prepared_model(*example_inputs)
+
+Finally, we will convert the calibrated Model to a quantized Model. ``convert_pt2e`` takes a calibrated model and produces a quantized model.
+
+.. code-block:: python
+
+    quantized_model = convert_pt2e(prepared_model)
+
+After these steps, we finished running the quantization flow, and we will get the quantized model.
+
+
+4. Lower into OpenVINO representation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+After that the FX Graph can utilize OpenVINO optimizations using `torch.compile(…, backend=”openvino”) <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ functionality.
+
+.. code-block:: python
+
+    with torch.no_grad(), disable_patching():
+        optimized_model = torch.compile(quantized_model, backend="openvino")
+
+        # Running some benchmark
+        optimized_model(*example_inputs)
+
+
+
+The optimized model is using low-level kernels designed specifically for Intel CPU.
+This should significantly speed up inference time in comparison with the eager model.
+
+Conclusion
+------------
+
+With this tutorial, we introduce how to use torch.compile with the OpenVINO backend and the OpenVINO quantizer.
+For further information, please visit `OpenVINO deploymet via torch.compile documentation <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_.