Skip to content
Open
Changes from 5 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
f0ab805
WIP
daniil-lyakhov Nov 21, 2024
acf1647
OpenVINOQuantizer
daniil-lyakhov Jan 28, 2025
5b1c99a
Apply suggestions from code review
daniil-lyakhov Feb 7, 2025
b2eaa82
Comments
daniil-lyakhov Feb 7, 2025
810899a
NNCF API docs
daniil-lyakhov Feb 20, 2025
82a47a5
Comments
daniil-lyakhov Feb 24, 2025
26f044b
fold_quantize=False
daniil-lyakhov Feb 24, 2025
75d3549
Update prototype_source/openvino_quantizer.rst
daniil-lyakhov Apr 11, 2025
e8e94d3
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 12, 2025
f09a85f
Spelling / comments
daniil-lyakhov Apr 14, 2025
2c766e7
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 15, 2025
b424f92
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 15, 2025
f3137be
prototype_index.rst is updated
daniil-lyakhov Apr 16, 2025
b7d2781
Apply suggestions from code review
daniil-lyakhov Apr 16, 2025
bb3c2f8
Merge remote-tracking branch 'origin/main' into dl/fx/openvino_quantizer
daniil-lyakhov Apr 22, 2025
c093c76
Update prototype_source/openvino_quantizer.rst
daniil-lyakhov Apr 22, 2025
ccc02d6
Remove Docs Survey Banner (#3340)
sekyondaMeta Apr 22, 2025
090823f
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 22, 2025
71695c7
Fix code snippet format issue in inductor_windows (#3339)
ZhaoqiongZ Apr 22, 2025
35c68ea
Add a note that foreach feature is a prototype (#3341)
svekars Apr 22, 2025
a5632da
Updating tutorials for 2.7. (#3338)
AlannaBurke Apr 23, 2025
0a422c2
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 23, 2025
7fc877b
Adjust torch.compile() best practices (#3336)
punkeel Apr 28, 2025
bdeca26
fix index format (#3343)
ZhaoqiongZ Apr 28, 2025
1988e26
fix a typo in optimization_tutorial.py (#3333)
partev Apr 28, 2025
70d2154
fix a typo in zeroing_out_gradients.py (#3337)
partev Apr 28, 2025
7e97977
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
245 changes: 245 additions & 0 deletions prototype_source/openvino_quantizer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
PyTorch 2 Export Quantization for OpenVINO torch.compile backend.
===========================================================================

**Authors**: `Daniil Lyakhov <https://github.com/daniil-lyakhov>`_, `Alexander Suslov <https://github.com/alexsu52>`_, `Aamir Nazir <https://github.com/anzr299>`_

Prerequisites
--------------
- `PyTorch 2 Export Post Training Quantization <https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html>`_
- `How to Write a Quantizer for PyTorch 2 Export Quantization <https://pytorch.org/tutorials/prototype/pt2e_quantizer.html>`_

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Introduction
--------------

This tutorial demonstrates how to use `OpenVINOQuantizer` from `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf/tree/develop>`_ in PyTorch 2 Export Quantization flow to generate a quantized model customized for the `OpenVINO torch.compile backend <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ and explains how to lower the quantized model into the `OpenVINO <https://docs.openvino.ai/2024/index.html>`_ representation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more attractive if to give the user an idea why it may need to use OpenVINOQuantizer (e.g. it is more accurate, performant, etc.)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense! Description of advantages of OpenVINOQuantizer was added


The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and performs quantization transformations on top of the ATen graph.
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
OpenVINO backend compiles the FX Graph generated by TorchDynamo into an optimized OpenVINO model.

The quantization flow mainly includes four steps:

- Step 1: Install OpenVINO and NNCF.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the quantization flow itself does not includer step 1. It is just a prerequisite.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, fixed

- Step 2: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
- Step 3: Apply the PyTorch 2 Export Quantization flow with OpenVINOQuantizer based on the captured FX Graph.
- Step 4: Lower the quantized model into OpenVINO representation with the API `torch.compile <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_.

The high-level architecture of this flow could look like this:

::

float_model(Python) Example Input
\ /
\ /
—--------------------------------------------------------
| export |
—--------------------------------------------------------
|
FX Graph in ATen
|
| OpenVINOQuantizer
| /
—--------------------------------------------------------
| prepare_pt2e |
| | |
| Calibrate
| | |
| convert_pt2e |
—--------------------------------------------------------
|
Quantized Model
|
—--------------------------------------------------------
| Lower into Inductor |
—--------------------------------------------------------
|
OpenVINO model

Post Training Quantization
----------------------------

Now, we will walk you through a step-by-step tutorial for how to use it with `torchvision resnet18 model <https://download.pytorch.org/models/resnet18-f37072fd.pth>`_
for post training quantization.

1. OpenVINO and NNCF installation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OpenVINO and NNCF could be easily installed via `pip distribution <https://docs.openvino.ai/2024/get-started/install-openvino.html>`_:

.. code-block:: bash

pip install -U pip
pip install openvino, nncf


2. Capture FX Graph
^^^^^^^^^^^^^^^^^^^^^

We will start by performing the necessary imports, capturing the FX Graph from the eager module.

.. code-block:: python

import copy
import openvino.torch
import torch
import torchvision.models as models
from torch.ao.quantization.quantize_pt2e import convert_pt2e
from torch.ao.quantization.quantize_pt2e import prepare_pt2e

import nncf.torch

# Create the Eager Model
model_name = "resnet18"
model = models.__dict__[model_name](pretrained=True)

# Set the model to eval mode
model = model.eval()

# Create the data, using the dummy data here as an example
traced_bs = 50
x = torch.randn(traced_bs, 3, 224, 224)
example_inputs = (x,)

# Capture the FX Graph to be quantized
with torch.no_grad(), nncf.torch.disable_patching():
exported_model = torch.export.export(model, example_inputs).module()



3. Apply Quantization
^^^^^^^^^^^^^^^^^^^^^^^

After we capture the FX Module to be quantized, we will import the OpenVINOQuantizer.


.. code-block:: python

from nncf.experimental.torch.fx import OpenVINOQuantizer

quantizer = OpenVINOQuantizer()

``OpenVINOQuantizer`` has several optional parameters that allow tuning the quantization process to get a more accurate model.
Below is the list of essential parameters and their description:


* ``preset`` - defines quantization scheme for the model. Two types of presets are available:

* ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations

* ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.

.. code-block:: python

OpenVINOQuantizer(preset=nncf.QuantizationPreset.MIXED)

* ``model_type`` - used to specify quantization scheme required for specific type of the model. Transformer is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, DistilBERT, etc.). None is default, i.e. no specific scheme is defined.

.. code-block:: python

OpenVINOQuantizer(model_type=nncf.ModelType.Transformer)

* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter:

.. code-block:: python

#Exclude by layer name:
names = ['layer_1', 'layer_2', 'layer_3']
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(names=names))

#Exclude by layer type:
types = ['Conv2d', 'Linear']
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(types=types))

#Exclude by regular expression:
regex = '.*layer_.*'
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(patterns=regex))

#Exclude by subgraphs:
# In this case, all nodes along all simple paths in the graph
# from input to output nodes will be excluded from the quantization process.
subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3'])
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph]))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find more information about OpenVINOQuantizer parameters?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, we don't have a dedicated page about the OpenVINOQuantizer yet. We have a dedicated page for the nncf.quantize and its parameters, but the subset of parameters is not equivalent

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a link to nncf API docs, which should be updated with this PR: openvinotoolkit/nncf#3277


* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.

.. code-block:: python

OpenVINOQuantizer(target_device=nncf.TargetDevice.CPU)

For futher details on `OpenVINOQuantizer` please see the `documentation <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.OpenVINOQuantizer>`_.

After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.

.. code-block:: python

prepared_model = prepare_pt2e(exported_model, quantizer)

Now, we will calibrate the ``prepared_model`` after the observers are inserted in the model.

.. code-block:: python

# We use the dummy data as an example here
prepared_model(*example_inputs)

Finally, we will convert the calibrated Model to a quantized Model. ``convert_pt2e`` takes a calibrated model and produces a quantized model.

.. code-block:: python

quantized_model = convert_pt2e(prepared_model)

After these steps, we finished running the quantization flow, and we will get the quantized model.


4. Lower into OpenVINO representation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After that the FX Graph can utilize OpenVINO optimizations using `torch.compile(…, backend=”openvino”) <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ functionality.

.. code-block:: python

with torch.no_grad(), nncf.torch.disable_patching():
optimized_model = torch.compile(quantized_model, backend="openvino")

# Running some benchmark
optimized_model(*example_inputs)



The optimized model is using low-level kernels designed specifically for Intel CPU.
This should significantly speed up inference time in comparison with the eager model.

5. Optional: Improve quantized model metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

NNCF implements advanced quantization algorithms like SmoothQuant and BiasCorrection, which help
improve the quantized model metrics while minimizing the output discrepancies between the original and compressed models.
These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API:

.. code-block:: python

from nncf.experimental.torch.fx import quantize_pt2e

calibration_loader = torch.utils.data.DataLoader(...)


def transform_fn(data_item):
images, _ = data_item
return images


calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
quantized_model = quantize_pt2e(
exported_model, quantizer, calibration_dataset, smooth_quant=True, fast_bias_correction=False
)


For further details, please see the `documentation <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.quantize_pt2e>`_
and a complete `example on Resnet18 quantization <https://github.com/openvinotoolkit/nncf/blob/develop/examples/post_training_quantization/torch_fx/resnet18/README.md>`_.

Conclusion
------------

This tutorial introduces how to use torch.compile with the OpenVINO backend and the OpenVINO quantizer.
For more details on NNCF and the NNCF Quantization Flow for PyTorch models, refer to the `NNCF Quantization Guide <https://docs.openvino.ai/2025/openvino-workflow/model-optimization-guide/quantizing-models-post-training/basic-quantization-flow.html.>`_.
For additional information, check out the `OpenVINO Deployment via torch.compile Documentation <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_.