diff --git a/recipes_source/intel_neural_compressor_for_pytorch.rst b/recipes_source/intel_neural_compressor_for_pytorch.rst index 02ce3d7b378..ee569382343 100755 --- a/recipes_source/intel_neural_compressor_for_pytorch.rst +++ b/recipes_source/intel_neural_compressor_for_pytorch.rst @@ -4,37 +4,20 @@ Ease-of-use quantization for PyTorch with Intel® Neural Compressor Overview -------- -Most deep learning applications are using 32-bits of floating-point precision -for inference. But low precision data types, especially int8, are getting more -focus due to significant performance boost. One of the essential concerns on -adopting low precision is how to easily mitigate the possible accuracy loss -and reach predefined accuracy requirement. +Most deep learning applications are using 32-bits of floating-point precision for inference. But low precision data types, such as fp8, are getting more focus due to significant performance boost. A key concern in adopting low precision is mitigating accuracy loss while meeting predefined requirements. -Intel® Neural Compressor aims to address the aforementioned concern by extending -PyTorch with accuracy-driven automatic tuning strategies to help user quickly find -out the best quantized model on Intel hardware, including Intel Deep Learning -Boost (`Intel DL Boost `_) -and Intel Advanced Matrix Extensions (`Intel AMX `_). +Intel® Neural Compressor aims to address the aforementioned concern by extending PyTorch with accuracy-driven automatic tuning strategies to help user quickly find out the best quantized model on Intel hardware. -Intel® Neural Compressor has been released as an open-source project -at `Github `_. +Intel® Neural Compressor is an open-source project at `Github `_. Features -------- -- **Ease-of-use Python API:** Intel® Neural Compressor provides simple frontend - Python APIs and utilities for users to do neural network compression with few - line code changes. - Typically, only 5 to 6 clauses are required to be added to the original code. +- **Ease-of-use API:** Intel® Neural Compressor is re-using the PyTorch ``prepare``, ``convert`` API for user usage. -- **Quantization:** Intel® Neural Compressor supports accuracy-driven automatic - tuning process on post-training static quantization, post-training dynamic - quantization, and quantization-aware training on PyTorch fx graph mode and - eager model. +- **Accuracy-driven Tuning:** Intel® Neural Compressor supports accuracy-driven automatic tuning process, provides ``autotune`` API for user usage. -*This tutorial mainly focuses on the quantization part. As for how to use Intel® -Neural Compressor to do pruning and distillation, please refer to corresponding -documents in the Intel® Neural Compressor github repo.* +- **Kinds of Quantization:** Intel® Neural Compressor supports a variety of quantization methods, including classic INT8 quantization, weight-only quantization and the popular FP8 quantization. Neural compressor also provides the latest research in simulation work, such as MX data type emulation quantization. For more details, please refer to `Supported Matrix `_. Getting Started --------------- @@ -45,329 +28,134 @@ Installation .. code:: bash # install stable version from pip - pip install neural-compressor + pip install neural-compressor-pt +.. - # install nightly version from pip - pip install -i https://test.pypi.org/simple/ neural-compressor +**Note**: Neural Compressor provides automatic accelerator detection, including HPU, Intel GPU, CUDA, and CPU. To specify the target device, ``INC_TARGET_DEVICE`` is suggested, e.g., ``export INC_TARGET_DEVICE=cpu``. - # install stable version from from conda - conda install neural-compressor -c conda-forge -c intel -*Supported python versions are 3.6 or 3.7 or 3.8 or 3.9* +Examples +~~~~~~~~~~~~ + +This section shows examples of kinds of quantization with Intel® Neural compressor -Usages -~~~~~~ +FP8 Quantization +^^^^^^^^^^^^^^^^ -Minor code changes are required for users to get started with Intel® Neural Compressor -quantization API. Both PyTorch fx graph mode and eager mode are supported. +**FP8 Quantization** is supported by Intel® Gaudi®2&3 AI Accelerator (HPU). To prepare the environment, please refer to `Intel® Gaudi® Documentation `_. -Intel® Neural Compressor takes a FP32 model and a yaml configuration file as inputs. -To construct the quantization process, users can either specify the below settings via -the yaml configuration file or python APIs: +Run the example, -1. Calibration Dataloader (Needed for static quantization) -2. Evaluation Dataloader -3. Evaluation Metric +.. code-block:: python -Intel® Neural Compressor supports some popular dataloaders and evaluation metrics. For -how to configure them in yaml configuration file, user could refer to `Built-in Datasets -`_. + # FP8 Quantization Example + from neural_compressor.torch.quantization import ( + FP8Config, + prepare, + convert, + ) -If users want to use a self-developed dataloader or evaluation metric, Intel® Neural -Compressor supports this by the registration of customized dataloader/metric using python code. + import torch + import torchvision.models as models -For the yaml configuration file format please refer to `yaml template -`_. + # Load a pre-trained ResNet18 model + model = models.resnet18() -The code changes that are required for *Intel® Neural Compressor* are highlighted with -comments in the line above. + # Configure FP8 quantization + qconfig = FP8Config(fp8_config="E4M3") + model = prepare(model, qconfig) -Model -^^^^^ + # Perform calibration (replace with actual calibration data) + calibration_data = torch.randn(1, 3, 224, 224).to("hpu") + model(calibration_data) -In this tutorial, the LeNet model is used to demonstrate how to deal with *Intel® Neural Compressor*. + # Convert the model to FP8 + model = convert(model) -.. code-block:: python3 + # Perform inference + input_data = torch.randn(1, 3, 224, 224).to("hpu") + output = model(input_data).to("cpu") + print(output) - # main.py - import torch - import torch.nn as nn - import torch.nn.functional as F - - # LeNet Model definition - class Net(nn.Module): - def __init__(self): - super(Net, self).__init__() - self.conv1 = nn.Conv2d(1, 10, kernel_size=5) - self.conv2 = nn.Conv2d(10, 20, kernel_size=5) - self.conv2_drop = nn.Dropout2d() - self.fc1 = nn.Linear(320, 50) - self.fc1_drop = nn.Dropout() - self.fc2 = nn.Linear(50, 10) - - def forward(self, x): - x = F.relu(F.max_pool2d(self.conv1(x), 2)) - x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) - x = x.reshape(-1, 320) - x = F.relu(self.fc1(x)) - x = self.fc1_drop(x) - x = self.fc2(x) - return F.log_softmax(x, dim=1) - - model = Net() - model.load_state_dict(torch.load('./lenet_mnist_model.pth', weights_only=True)) - -The pretrained model weight `lenet_mnist_model.pth` comes from -`here `_. - -Accuracy driven quantization -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Intel® Neural Compressor supports accuracy-driven automatic tuning to generate the optimal -int8 model which meets a predefined accuracy goal. - -Below is an example of how to quantize a simple network on PyTorch -`FX graph mode `_ by auto-tuning. - -.. code-block:: yaml - - # conf.yaml - model: - name: LeNet - framework: pytorch_fx - - evaluation: - accuracy: - metric: - topk: 1 - - tuning: - accuracy_criterion: - relative: 0.01 - -.. code-block:: python3 - - # main.py - model.eval() - - from torchvision import datasets, transforms - test_loader = torch.utils.data.DataLoader( - datasets.MNIST('./data', train=False, download=True, - transform=transforms.Compose([ - transforms.ToTensor(), - ])), - batch_size=1) - - # launch code for Intel® Neural Compressor - from neural_compressor.experimental import Quantization - quantizer = Quantization("./conf.yaml") - quantizer.model = model - quantizer.calib_dataloader = test_loader - quantizer.eval_dataloader = test_loader - q_model = quantizer() - q_model.save('./output') - -In the `conf.yaml` file, the built-in metric `top1` of Intel® Neural Compressor is specified as -the evaluation method, and `1%` relative accuracy loss is set as the accuracy target for auto-tuning. -Intel® Neural Compressor will traverse all possible quantization config combinations on per-op level -to find out the optimal int8 model that reaches the predefined accuracy target. - -Besides those built-in metrics, Intel® Neural Compressor also supports customized metric through -python code: - -.. code-block:: yaml - - # conf.yaml - model: - name: LeNet - framework: pytorch_fx - - tuning: - accuracy_criterion: - relative: 0.01 - -.. code-block:: python3 - - # main.py - model.eval() - - from torchvision import datasets, transforms - test_loader = torch.utils.data.DataLoader( - datasets.MNIST('./data', train=False, download=True, - transform=transforms.Compose([ - transforms.ToTensor(), - ])), - batch_size=1) - - # define a customized metric - class Top1Metric(object): - def __init__(self): - self.correct = 0 - def update(self, output, label): - pred = output.argmax(dim=1, keepdim=True) - self.correct += pred.eq(label.view_as(pred)).sum().item() - def reset(self): - self.correct = 0 - def result(self): - return 100. * self.correct / len(test_loader.dataset) - - # launch code for Intel® Neural Compressor - from neural_compressor.experimental import Quantization - quantizer = Quantization("./conf.yaml") - quantizer.model = model - quantizer.calib_dataloader = test_loader - quantizer.eval_dataloader = test_loader - quantizer.metric = Top1Metric() - q_model = quantizer() - q_model.save('./output') - -In the above example, a `class` which contains `update()` and `result()` function is implemented -to record per mini-batch result and calculate final accuracy at the end. - -Quantization aware training -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Besides post-training static quantization and post-training dynamic quantization, Intel® Neural -Compressor supports quantization-aware training with an accuracy-driven automatic tuning mechanism. - -Below is an example of how to do quantization aware training on a simple network on PyTorch -`FX graph mode `_. - -.. code-block:: yaml - - # conf.yaml - model: - name: LeNet - framework: pytorch_fx - - quantization: - approach: quant_aware_training - - evaluation: - accuracy: - metric: - topk: 1 - - tuning: - accuracy_criterion: - relative: 0.01 - -.. code-block:: python3 - - # main.py - model.eval() - - from torchvision import datasets, transforms - train_loader = torch.utils.data.DataLoader( - datasets.MNIST('./data', train=True, download=True, - transform=transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize((0.1307,), (0.3081,)) - ])), - batch_size=64, shuffle=True) - test_loader = torch.utils.data.DataLoader( - datasets.MNIST('./data', train=False, download=True, - transform=transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize((0.1307,), (0.3081,)) - ])), - batch_size=1) - - import torch.optim as optim - optimizer = optim.SGD(model.parameters(), lr=0.0001, momentum=0.1) - - def training_func(model): - model.train() - for epoch in range(1, 3): - for batch_idx, (data, target) in enumerate(train_loader): - optimizer.zero_grad() - output = model(data) - loss = F.nll_loss(output, target) - loss.backward() - optimizer.step() - print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( - epoch, batch_idx * len(data), len(train_loader.dataset), - 100. * batch_idx / len(train_loader), loss.item())) - - # launch code for Intel® Neural Compressor - from neural_compressor.experimental import Quantization - quantizer = Quantization("./conf.yaml") - quantizer.model = model - quantizer.q_func = training_func - quantizer.eval_dataloader = test_loader - q_model = quantizer() - q_model.save('./output') - -Performance only quantization -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Intel® Neural Compressor supports directly yielding int8 model with dummy dataset for the -performance benchmarking purpose. - -Below is an example of how to quantize a simple network on PyTorch -`FX graph mode `_ with a dummy dataset. - -.. code-block:: yaml - - # conf.yaml - model: - name: lenet - framework: pytorch_fx - -.. code-block:: python3 - - # main.py - model.eval() - - # launch code for Intel® Neural Compressor - from neural_compressor.experimental import Quantization, common - from neural_compressor.experimental.data.datasets.dummy_dataset import DummyDataset - quantizer = Quantization("./conf.yaml") - quantizer.model = model - quantizer.calib_dataloader = common.DataLoader(DummyDataset([(1, 1, 28, 28)])) - q_model = quantizer() - q_model.save('./output') - -Quantization outputs -~~~~~~~~~~~~~~~~~~~~ - -Users could know how many ops get quantized from log printed by Intel® Neural Compressor -like below: - -:: - - 2021-12-08 14:58:35 [INFO] |********Mixed Precision Statistics*******| - 2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+ - 2021-12-08 14:58:35 [INFO] | Op Type | Total | INT8 | - 2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+ - 2021-12-08 14:58:35 [INFO] | quantize_per_tensor | 2 | 2 | - 2021-12-08 14:58:35 [INFO] | Conv2d | 2 | 2 | - 2021-12-08 14:58:35 [INFO] | max_pool2d | 1 | 1 | - 2021-12-08 14:58:35 [INFO] | relu | 1 | 1 | - 2021-12-08 14:58:35 [INFO] | dequantize | 2 | 2 | - 2021-12-08 14:58:35 [INFO] | LinearReLU | 1 | 1 | - 2021-12-08 14:58:35 [INFO] | Linear | 1 | 1 | - 2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+ - -The quantized model will be generated under `./output` directory, in which there are two files: -1. best_configure.yaml -2. best_model_weights.pt - -The first file contains the quantization configurations of each op, the second file contains -int8 weights and zero point and scale info of activations. - -Deployment -~~~~~~~~~~ - -Users could use the below code to load quantized model and then do inference or performance benchmark. - -.. code-block:: python3 - - from neural_compressor.utils.pytorch import load - int8_model = load('./output', model) +.. + +Weight-only Quantization +^^^^^^^^^^^^^^^^^^^^^^^^ + +**Weight-only Quantization** is also supported on Intel® Gaudi®2&3 AI Accelerator. The quantized model could be loaded as below. + +.. code-block:: python + + from neural_compressor.torch.quantization import load + + # The model name comes from HuggingFace Model Hub. + model_name = "TheBloke/Llama-2-7B-GPTQ" + model = load( + model_name_or_path=model_name, + format="huggingface", + device="hpu", + torch_dtype=torch.bfloat16, + ) +.. + +**Note:** Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time. + +Static Quantization with PT2E Backend +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The PT2E path uses ``torch.dynamo`` to capture the eager model into an FX graph model, and then inserts the observers and Q/QD pairs on it. Finally it uses the ``torch.compile`` to perform the pattern matching and replace the Q/DQ pairs with optimized quantized operators. + +There are four steps to perform W8A8 static quantization with PT2E backend: ``export``, ``prepare``, ``convert`` and ``compile``. + +.. code-block:: python + + import torch + from neural_compressor.torch.export import export + from neural_compressor.torch.quantization import StaticQuantConfig, prepare, convert + + # Prepare the float model and example inputs for export model + model = UserFloatModel() + example_inputs = ... + + # Export eager model into FX graph model + exported_model = export(model=model, example_inputs=example_inputs) + # Quantize the model + quant_config = StaticQuantConfig() + prepared_model = prepare(exported_model, quant_config=quant_config) + # Calibrate + run_fn(prepared_model) + q_model = convert(prepared_model) + # Compile the quantized model and replace the Q/DQ pattern with Q-operator + from torch._inductor import config + + config.freezing = True + opt_model = torch.compile(q_model) +.. + +Accuracy-driven Tuning +^^^^^^^^^^^^^^^^^^^^^^ + +To leverage accuracy-driven automatic tuning, a specified tuning space is necessary. The ``autotune`` iterates the tuning space and applies the configuration on given high-precision model then records and compares its evaluation result with the baseline. The tuning process stops when meeting the exit policy. + + +.. code-block:: python + + from neural_compressor.torch.quantization import RTNConfig, TuningConfig, autotune + + + def eval_fn(model) -> float: + return ... + + + tune_config = TuningConfig( + config_set=RTNConfig(use_sym=[False, True], group_size=[32, 128]), + tolerable_loss=0.2, + max_trials=10, + ) + q_model = autotune(model, tune_config=tune_config, eval_fn=eval_fn) +.. Tutorials --------- -Please visit `Intel® Neural Compressor Github repo `_ -for more tutorials. +More detailed tutorials are available in the official Intel® Neural Compressor `doc `_.