Skip to content
Open
Changes from 2 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
f0ab805
WIP
daniil-lyakhov Nov 21, 2024
acf1647
OpenVINOQuantizer
daniil-lyakhov Jan 28, 2025
5b1c99a
Apply suggestions from code review
daniil-lyakhov Feb 7, 2025
b2eaa82
Comments
daniil-lyakhov Feb 7, 2025
810899a
NNCF API docs
daniil-lyakhov Feb 20, 2025
82a47a5
Comments
daniil-lyakhov Feb 24, 2025
26f044b
fold_quantize=False
daniil-lyakhov Feb 24, 2025
75d3549
Update prototype_source/openvino_quantizer.rst
daniil-lyakhov Apr 11, 2025
e8e94d3
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 12, 2025
f09a85f
Spelling / comments
daniil-lyakhov Apr 14, 2025
2c766e7
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 15, 2025
b424f92
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 15, 2025
f3137be
prototype_index.rst is updated
daniil-lyakhov Apr 16, 2025
b7d2781
Apply suggestions from code review
daniil-lyakhov Apr 16, 2025
bb3c2f8
Merge remote-tracking branch 'origin/main' into dl/fx/openvino_quantizer
daniil-lyakhov Apr 22, 2025
c093c76
Update prototype_source/openvino_quantizer.rst
daniil-lyakhov Apr 22, 2025
ccc02d6
Remove Docs Survey Banner (#3340)
sekyondaMeta Apr 22, 2025
090823f
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 22, 2025
71695c7
Fix code snippet format issue in inductor_windows (#3339)
ZhaoqiongZ Apr 22, 2025
35c68ea
Add a note that foreach feature is a prototype (#3341)
svekars Apr 22, 2025
a5632da
Updating tutorials for 2.7. (#3338)
AlannaBurke Apr 23, 2025
0a422c2
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 23, 2025
7fc877b
Adjust torch.compile() best practices (#3336)
punkeel Apr 28, 2025
bdeca26
fix index format (#3343)
ZhaoqiongZ Apr 28, 2025
1988e26
fix a typo in optimization_tutorial.py (#3333)
partev Apr 28, 2025
70d2154
fix a typo in zeroing_out_gradients.py (#3337)
partev Apr 28, 2025
7e97977
Merge branch 'main' into dl/fx/openvino_quantizer
svekars Apr 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 211 additions & 0 deletions prototype_source/openvino_quantizer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
PyTorch 2 Export Quantization with OpenVINO backend
===========================================================================

**Author**: dlyakhov, asuslov, aamir, # TODO: add required authors

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Introduction
--------------

This tutorial introduces the steps for utilizing the `Neural Network Compression Framework (nncf) <https://github.com/openvinotoolkit/nncf/tree/develop>`_ to generate a quantized model customized
for the `OpenVINO torch.compile backend <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ and explains how to lower the quantized model into the `OpenVINO <https://docs.openvino.ai/2024/index.html>`_ representation.

The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and performs quantization transformations on top of the ATen graph.
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
OpenVINO is the new backend that compiles the FX Graph generated by TorchDynamo into an optimized OpenVINO model.

The quantization flow mainly includes four steps:

- Step 1: Install OpenVINO and NNCF.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the quantization flow itself does not includer step 1. It is just a prerequisite.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, fixed

- Step 2: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
- Step 3: Apply the Quantization flow based on the captured FX Graph.
- Step 4: Lower the quantized model into OpenVINO representation with the API ``torch.compile``.

The high-level architecture of this flow could look like this:

::

float_model(Python) Example Input
\ /
\ /
—--------------------------------------------------------
| export |
—--------------------------------------------------------
|
FX Graph in ATen
|
| OpenVINOQuantizer
| /
—--------------------------------------------------------
| prepare_pt2e |
| | |
| Calibrate
| | |
| convert_pt2e |
—--------------------------------------------------------
|
Quantized Model
|
—--------------------------------------------------------
| Lower into Inductor |
—--------------------------------------------------------
|
OpenVINO model

Post Training Quantization
----------------------------

Now, we will walk you through a step-by-step tutorial for how to use it with `torchvision resnet18 model <https://download.pytorch.org/models/resnet18-f37072fd.pth>`_
for post training quantization.

1. OpenVINO and NNCF installation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OpenVINO and NNCF could be easily installed via `pip distribution <https://docs.openvino.ai/2024/get-started/install-openvino.html>`_:

.. code-block:: bash

pip install -U pip
pip install openvino, nncf


2. Capture FX Graph
^^^^^^^^^^^^^^^^^^^^^

We will start by performing the necessary imports, capturing the FX Graph from the eager module.

.. code-block:: python

import copy
import openvino.torch
import torch
import torchvision.models as models
from torch.ao.quantization.quantize_pt2e import convert_pt2e
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch.ao.quantization.quantizer.openvino_quantizer import OpenVINOQuantizer

import nncf
from nncf.torch import disable_patching
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import nncf
from nncf.torch import disable_patching
import nncf

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that does not work. We can do import nncf.torch and then do nncf.torch.disable_patching

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import nncf.torch is introduced, please check


# Create the Eager Model
model_name = "resnet18"
model = models.__dict__[model_name](pretrained=True)

# Set the model to eval mode
model = model.eval()

# Create the data, using the dummy data here as an example
traced_bs = 50
x = torch.randn(traced_bs, 3, 224, 224).contiguous(memory_format=torch.channels_last)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the memory format to be channels_last?

Copy link
Owner Author

@daniil-lyakhov daniil-lyakhov Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy past from the original tutorial, removed, thanks!

example_inputs = (x,)

# Capture the FX Graph to be quantized
with torch.no_grad(), disable_patching():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is disable_patching() needed both during export and inference with torch.compile?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes: export will fail with an error and performance of the compiled model will be ruined without it

exported_model = torch.export.export(model, example_inputs).module()



3. Apply Quantization
^^^^^^^^^^^^^^^^^^^^^^^

After we capture the FX Module to be quantized, we will import the OpenVINOQuantizer.


.. code-block:: python

quantizer = OpenVINOQuantizer()

``OpenVINOQuantizer`` has several optional parameters that allow tuning the quantization process to get a more accurate model.
Below is the list of essential parameters and their description:


* ``preset`` - defines quantization scheme for the model. Two types of presets are available:

* ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations

* ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.

.. code-block:: python

OpenVINOQuantizer(preset=nncf.QuantizationPreset.MIXED)

* ``model_type`` - used to specify quantization scheme required for specific type of the model. Transformer is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, DistilBERT, etc.). None is default, i.e. no specific scheme is defined.

.. code-block:: python

OpenVINOQuantizer(model_type=nncf.ModelType.Transformer)

* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter:

.. code-block:: python

#Exclude by layer name:
names = ['layer_1', 'layer_2', 'layer_3']
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(names=names))

#Exclude by layer type:
types = ['Conv2d', 'Linear']
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(types=types))

#Exclude by regular expression:
regex = '.*layer_.*'
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(patterns=regex))

#Exclude by subgraphs:
# In this case, all nodes along all simple paths in the graph
# from input to output nodes will be excluded from the quantization process.
subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3'])
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph]))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find more information about OpenVINOQuantizer parameters?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, we don't have a dedicated page about the OpenVINOQuantizer yet. We have a dedicated page for the nncf.quantize and its parameters, but the subset of parameters is not equivalent

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a link to nncf API docs, which should be updated with this PR: openvinotoolkit/nncf#3277


* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.

.. code-block:: python

OpenVINOQuantizer(target_device=nncf.TargetDevice.CPU)


After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.

.. code-block:: python

prepared_model = prepare_pt2e(exported_model, quantizer)

Now, we will calibrate the ``prepared_model`` after the observers are inserted in the model.

.. code-block:: python

# We use the dummy data as an example here
prepared_model(*example_inputs)

Finally, we will convert the calibrated Model to a quantized Model. ``convert_pt2e`` takes a calibrated model and produces a quantized model.

.. code-block:: python

quantized_model = convert_pt2e(prepared_model)

After these steps, we finished running the quantization flow, and we will get the quantized model.


4. Lower into OpenVINO representation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After that the FX Graph can utilize OpenVINO optimizations using `torch.compile(…, backend=”openvino”) <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ functionality.

.. code-block:: python

with torch.no_grad(), disable_patching():
optimized_model = torch.compile(quantized_model, backend="openvino")

# Running some benchmark
optimized_model(*example_inputs)



The optimized model is using low-level kernels designed specifically for Intel CPU.
This should significantly speed up inference time in comparison with the eager model.

Conclusion
------------

With this tutorial, we introduce how to use torch.compile with the OpenVINO backend and the OpenVINO quantizer.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add somethink like that:
For more information about NNCF and NNCF Quantization Flow for PyTorch models, please visit

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please check

For further information, please visit `OpenVINO deploymet via torch.compile documentation <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_.