Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 77 additions & 25 deletions docsrc/user_guide/mixed_precision.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,16 @@ Consider the following PyTorch model which explicitly casts intermediate layer t
return x


If we compile the above model using Torch-TensorRT, layer profiling logs indicate that all the layers are
run in FP32. This is because TensorRT picks the kernels for layers which result in the best performance.
Before TensorRT 10.12, if we compile the above model using Torch-TensorRT with the following settings,
layer profiling logs indicate that all the layers are run in FP32. This is because old TensorRT picks
the kernels for layers which result in the best performance (i.e., weak typing in old TensorRT).

.. code-block:: python

inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
trt_gm = torch_tensorrt.dynamo.compile(ep,
inputs=inputs,
debug=True)
trt_gm = torch_tensorrt.dynamo.compile(ep, inputs=inputs)

# Debug log info
# Layers:
Expand All @@ -52,32 +50,88 @@ run in FP32. This is because TensorRT picks the kernels for layers which result
# Name: __myl_AddResMulSumAdd_myl0_2, LayerType: kgen, Inputs: [ { Name: __mye146_dconst, Dimensions: [30,40], Format/Datatype: Float }, { Name: linear3/addmm_2_constant_0 _ linear3/addmm_2_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,40], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }, { Name: linear2/addmm_1_constant_0 _ linear2/addmm_1_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,30], Format/Datatype: Float }], Outputs: [ { Name: output0, Dimensions: [1,40], Format/Datatype: Float }], TacticName: __myl_AddResMulSumAdd_0xcdd0085ad25f5f45ac5fafb72acbffd6, StreamId: 0, Metadata:


In order to respect the types specified by the user in the model (eg: in this case, ``linear2`` layer to run in FP16), users can enable
the compilation setting ``use_explicit_typing=True``. Compiling with this option results in the following TensorRT logs

.. note:: If you enable ``use_explicit_typing=True``, only torch.float32 is supported in the enabled_precisions.

However, since TensorRT 10.12, TensorRT has deprecated weak typing, we must set ``use_explicit_typing=True``
to enable strong typing, which means users must specify the precision of the nodes in the model. For example,
in the case above, we set ``linear2`` layer to run in FP16, so if we compile the model with the following settings,
the ``linear2`` layer will run in FP16 and other layers will run in FP32 as shown in the following TensorRT logs:

.. code-block:: python

inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
trt_gm = torch_tensorrt.dynamo.compile(ep,
inputs=inputs,
use_explicit_typing=True,
debug=True)
trt_gm = torch_tensorrt.dynamo.compile(ep, inputs=inputs, use_explicit_typing=True)

# Debug log info
# Layers:
# Name: __myl_MulSumAddCas_myl0_0, LayerType: kgen, Inputs: [ { Name: linear1/addmm_constant_0 _ linear1/addmm_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,10], Format/Datatype: Float }, { Name: __mye112_dconst, Dimensions: [10,10], Format/Datatype: Float }, { Name: x, Dimensions: [10,1], Format/Datatype: Float }], Outputs: [ { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Half }], TacticName: __myl_MulSumAddCas_0xacf8f5dd9be2f3e7bb09cdddeac6c936, StreamId: 0, Metadata:
# Name: __myl_ResMulSumAddCas_myl0_1, LayerType: kgen, Inputs: [ { Name: __mye127_dconst, Dimensions: [10,30], Format/Datatype: Half }, { Name: linear2/addmm_1_constant_0 _ linear2/addmm_1_add_broadcast_to_same_shape_lhs_broadcast_constantHalf, Dimensions: [1,30], Format/Datatype: Half }, { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Half }], Outputs: [ { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], TacticName: __myl_ResMulSumAddCas_0x5a3b318b5a1c97b7d5110c0291481337, StreamId: 0, Metadata:
# Name: __myl_ResMulSumAdd_myl0_2, LayerType: kgen, Inputs: [ { Name: __mye142_dconst, Dimensions: [30,40], Format/Datatype: Float }, { Name: linear3/addmm_2_constant_0 _ linear3/addmm_2_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,40], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], Outputs: [ { Name: output0, Dimensions: [1,40], Format/Datatype: Float }], TacticName: __myl_ResMulSumAdd_0x3fad91127c640fd6db771aa9cde67db0, StreamId: 0, Metadata:

Now the ``linear2`` layer runs in FP16 as shown in the above logs.
Autocast
---------------

Weak typing behavior in TensorRT is deprecated. However mixed precision is a good way to maximize performance.
Therefore, in Torch-TensorRT, we want to provide a way to enable mixed precision behavior like weak typing in
old TensorRT, which is called `Autocast`.

Before we dive into Torch-TensorRT Autocast, let's first take a look at PyTorch Autocast. PyTorch Autocast is a
context-based autocast, which means it will affect the precision of the nodes inside the context. For example,
in PyTorch, we can do the following:

.. code-block:: python

x = self.linear1(x)
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.float16):
x = self.linear2(x)
x = self.linear3(x)

This will run ``linear2`` in FP16 and other layers remain in FP32. Please refer to `PyTorch Autocast documentation <https://docs.pytorch.org/docs/stable/amp.html#torch.autocast>`_ for more details.

Unlike PyTorch Autocast, Torch-TensorRT Autocast is a rule-based autocast, which intelligently selects nodes to
keep in FP32 precision to maintain model accuracy while benefiting from reduced precision on the rest of the nodes.
Torch-TensorRT Autocast also supports users to specify which nodes to exclude from Autocast, considering some nodes
might be more sensitive to affecting accuracy. In addition, Torch-TensorRT Autocast can cooperate with PyTorch Autocast,
allowing users to use both PyTorch Autocast and Torch-TensorRT Autocast in the same model. Torch-TensorRT Autocast
respects the precision of the nodes within PyTorch Autocast context.

To enable Torch-TensorRT Autocast, we need to set both ``enable_autocast=True`` and ``use_explicit_typing=True``.
On top of them, we can also specify the precision of the nodes to reduce to by ``autocast_low_precision_type``,
and exclude certain nodes/ops from Torch-TensorRT Autocast by ``autocast_excluded_nodes`` or ``autocast_excluded_ops``.
For example,

.. code-block:: python

class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear1 = torch.nn.Linear(10,10)
self.linear2 = torch.nn.Linear(10,30)
self.linear3 = torch.nn.Linear(30,40)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
x = self.linear3(x)
return x

inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
trt_gm = torch_tensorrt.dynamo.compile(
ep,
inputs=inputs,
enable_autocast=True,
use_explicit_typing=True,
autocast_low_precision_type=torch.float16,
autocast_excluded_nodes={"^linear2$"},
)

This model excludes ``linear2`` from Autocast, so it will run ``linear2`` in FP32 and other layers in FP16.

In summary, now there are two ways in Torch-TensorRT to choose the precision of the nodes:
1. User specifies precision (strong typing): ``use_explicit_typing=True + enable_autocast=False``
2. Autocast chooses precision (autocast + strong typing): ``use_explicit_typing=True + enable_autocast=True``

FP32 Accumulation
-----------------
Expand All @@ -93,14 +147,12 @@ When ``use_fp32_acc=True`` is set, Torch-TensorRT will attempt to use FP32 accum
inputs = [torch.randn((1, 10), dtype=torch.float16).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
trt_gm = torch_tensorrt.dynamo.compile(
ep,
inputs=inputs,
use_fp32_acc=True,
use_explicit_typing=True, # Explicit typing must be enabled
debug=True
)
trt_gm = torch_tensorrt.dynamo.compile(
ep,
inputs=inputs,
use_fp32_acc=True,
use_explicit_typing=True, # Explicit typing must be enabled
)

# Debug log info
# Layers:
Expand Down
122 changes: 122 additions & 0 deletions examples/dynamo/autocast_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
"""
.. _autocast_example:

An example of using Torch-TensorRT Autocast
================

This example demonstrates how to use Torch-TensorRT Autocast with PyTorch Autocast to compile a mixed precision model.
"""

import torch
import torch.nn as nn
import torch_tensorrt

# %% Mixed Precision Model
#
# We define a mixed precision model that consists of a few layers, a ``log`` operation, and an ``abs`` operation.
# Among them, the ``fc1``, ``log``, and ``abs`` operations are within PyTorch Autocast context with ``dtype=torch.float16``.


class MixedPytorchAutocastModel(nn.Module):
def __init__(self):
super(MixedPytorchAutocastModel, self).__init__()
self.conv1 = nn.Conv2d(
in_channels=3, out_channels=8, kernel_size=3, stride=1, padding=1
)
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = nn.Conv2d(
in_channels=8, out_channels=16, kernel_size=3, stride=1, padding=1
)
self.relu2 = nn.ReLU()
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(16 * 8 * 8, 10)

def forward(self, x):
out1 = self.conv1(x)
out2 = self.relu1(out1)
out3 = self.pool1(out2)
out4 = self.conv2(out3)
out5 = self.relu2(out4)
out6 = self.pool2(out5)
out7 = self.flatten(out6)
with torch.autocast(x.device.type, enabled=True, dtype=torch.float16):
out8 = self.fc1(out7)
out9 = torch.log(
torch.abs(out8) + 1
) # log is fp32 due to Pytorch Autocast requirements
return x, out1, out2, out3, out4, out5, out6, out7, out8, out9


# %%
# Define the model, inputs, and calibration dataloader for Autocast, and then we run the original PyTorch model to get the reference outputs.

model = MixedPytorchAutocastModel().cuda().eval()
inputs = (torch.randn((8, 3, 32, 32), dtype=torch.float32, device="cuda"),)
ep = torch.export.export(model, inputs)
calibration_dataloader = torch.utils.data.DataLoader(
torch.utils.data.TensorDataset(*inputs), batch_size=2, shuffle=False
)

pytorch_outs = model(*inputs)

# %% Compile the model with Torch-TensorRT Autocast
#
# We compile the model with Torch-TensorRT Autocast by setting ``enable_autocast=True``, ``use_explicit_typing=True``, and
# ``autocast_low_precision_type=torch.bfloat16``. To illustrate, we exclude the ``conv1`` node, all nodes with name
# containing ``relu``, and ``torch.ops.aten.flatten.using_ints`` ATen op from Autocast. In addtion, we also set
# ``autocast_max_output_threshold``, ``autocast_max_depth_of_reduction``, and ``autocast_calibration_dataloader``. Please refer to
# the documentation for more details.

trt_autocast_mod = torch_tensorrt.compile(
ep.module(),
arg_inputs=inputs,
min_block_size=1,
use_python_runtime=True,
use_explicit_typing=True,
enable_autocast=True,
autocast_low_precision_type=torch.bfloat16,
autocast_excluded_nodes={"^conv1$", "relu"},
autocast_excluded_ops={"torch.ops.aten.flatten.using_ints"},
autocast_max_output_threshold=512,
autocast_max_depth_of_reduction=None,
autocast_calibration_dataloader=calibration_dataloader,
)

autocast_outs = trt_autocast_mod(*inputs)

# %% Verify the outputs
#
# We verify both the dtype and values of the outputs of the model are correct.
# As expected, ``fc1`` is in FP16 because of PyTorch Autocast;
# ``pool1``, ``conv2``, and ``pool2`` are in BFP16 because of Torch-TensorRT Autocast;
# the rest remain in FP32. Note that ``log`` is in FP32 because of PyTorch Autocast requirements.

should_be_fp32 = [
autocast_outs[0],
autocast_outs[1],
autocast_outs[2],
autocast_outs[5],
autocast_outs[7],
autocast_outs[9],
]
should_be_fp16 = [
autocast_outs[8],
]
should_be_bf16 = [autocast_outs[3], autocast_outs[4], autocast_outs[6]]

assert all(
a.dtype == torch.float32 for a in should_be_fp32
), "Some Autocast outputs are not float32!"
assert all(
a.dtype == torch.float16 for a in should_be_fp16
), "Some Autocast outputs are not float16!"
assert all(
a.dtype == torch.bfloat16 for a in should_be_bf16
), "Some Autocast outputs are not bfloat16!"
for i, (a, w) in enumerate(zip(autocast_outs, pytorch_outs)):
assert torch.allclose(
a.to(torch.float32), w.to(torch.float32), atol=1e-2, rtol=1e-2
), f"Autocast and Pytorch outputs do not match! autocast_outs[{i}] = {a}, pytorch_outs[{i}] = {w}"
print("All dtypes and values match!")
Loading
Loading