pytorch · zewenli98 · Oct 28, 2025 · Oct 28, 2025 · Oct 29, 2025 · Nov 5, 2025
diff --git a/docsrc/user_guide/mixed_precision.rst b/docsrc/user_guide/mixed_precision.rst
@@ -32,18 +32,16 @@ Consider the following PyTorch model which explicitly casts intermediate layer t
             return x
 
 
-If we compile the above model using Torch-TensorRT, layer profiling logs indicate that all the layers are 
-run in FP32. This is because TensorRT picks the kernels for layers which result in the best performance. 
+Before TensorRT 10.12, if we compile the above model using Torch-TensorRT with the following settings, 
+layer profiling logs indicate that all the layers are run in FP32. This is because old TensorRT picks 
+the kernels for layers which result in the best performance (i.e., weak typing in old TensorRT). 
 
 .. code-block:: python
 
     inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
     mod = MyModule().eval().cuda()
     ep = torch.export.export(mod, tuple(inputs))
-    with torch_tensorrt.logging.debug():
-        trt_gm = torch_tensorrt.dynamo.compile(ep, 
-                                            inputs=inputs, 
-                                            debug=True)
+    trt_gm = torch_tensorrt.dynamo.compile(ep, inputs=inputs)
 
     # Debug log info
     # Layers:
@@ -52,32 +50,88 @@ run in FP32. This is because TensorRT picks the kernels for layers which result
     # Name: __myl_AddResMulSumAdd_myl0_2, LayerType: kgen, Inputs: [ { Name: __mye146_dconst, Dimensions: [30,40], Format/Datatype: Float }, { Name: linear3/addmm_2_constant_0 _ linear3/addmm_2_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,40], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }, { Name: linear2/addmm_1_constant_0 _ linear2/addmm_1_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,30], Format/Datatype: Float }], Outputs: [ { Name: output0, Dimensions: [1,40], Format/Datatype: Float }], TacticName: __myl_AddResMulSumAdd_0xcdd0085ad25f5f45ac5fafb72acbffd6, StreamId: 0, Metadata: 
 
 
-In order to respect the types specified by the user in the model (eg: in this case, ``linear2`` layer to run in FP16), users can enable 
-the compilation setting ``use_explicit_typing=True``. Compiling with this option results in the following TensorRT logs
-
-.. note:: If you enable ``use_explicit_typing=True``, only torch.float32 is supported in the enabled_precisions.
-
+However, since TensorRT 10.12, TensorRT has deprecated weak typing, we must set ``use_explicit_typing=True`` 
+to enable strong typing, which means users must specify the precision of the nodes in the model. For example,
+in the case above, we set ``linear2`` layer to run in FP16, so if we compile the model with the following settings,
+the ``linear2`` layer will run in FP16 and other layers will run in FP32 as shown in the following TensorRT logs:
 
 .. code-block:: python
 
     inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
     mod = MyModule().eval().cuda()
     ep = torch.export.export(mod, tuple(inputs))
-    with torch_tensorrt.logging.debug():
-        trt_gm = torch_tensorrt.dynamo.compile(ep, 
-                                            inputs=inputs, 
-                                            use_explicit_typing=True,
-                                            debug=True)
+    trt_gm = torch_tensorrt.dynamo.compile(ep, inputs=inputs, use_explicit_typing=True)
 
     # Debug log info
     # Layers:
     # Name: __myl_MulSumAddCas_myl0_0, LayerType: kgen, Inputs: [ { Name: linear1/addmm_constant_0 _ linear1/addmm_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,10], Format/Datatype: Float }, { Name: __mye112_dconst, Dimensions: [10,10], Format/Datatype: Float }, { Name: x, Dimensions: [10,1], Format/Datatype: Float }], Outputs: [ { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Half }], TacticName: __myl_MulSumAddCas_0xacf8f5dd9be2f3e7bb09cdddeac6c936, StreamId: 0, Metadata: 
     # Name: __myl_ResMulSumAddCas_myl0_1, LayerType: kgen, Inputs: [ { Name: __mye127_dconst, Dimensions: [10,30], Format/Datatype: Half }, { Name: linear2/addmm_1_constant_0 _ linear2/addmm_1_add_broadcast_to_same_shape_lhs_broadcast_constantHalf, Dimensions: [1,30], Format/Datatype: Half }, { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Half }], Outputs: [ { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], TacticName: __myl_ResMulSumAddCas_0x5a3b318b5a1c97b7d5110c0291481337, StreamId: 0, Metadata: 
     # Name: __myl_ResMulSumAdd_myl0_2, LayerType: kgen, Inputs: [ { Name: __mye142_dconst, Dimensions: [30,40], Format/Datatype: Float }, { Name: linear3/addmm_2_constant_0 _ linear3/addmm_2_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,40], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], Outputs: [ { Name: output0, Dimensions: [1,40], Format/Datatype: Float }], TacticName: __myl_ResMulSumAdd_0x3fad91127c640fd6db771aa9cde67db0, StreamId: 0, Metadata: 
 
-Now the ``linear2`` layer runs in FP16 as shown in the above logs. 
+Autocast
+---------------
+
+Weak typing behavior in TensorRT is deprecated. However mixed precision is a good way to maximize performance. 
+Therefore, in Torch-TensorRT, we want to provide a way to enable mixed precision behavior like weak typing in 
+old TensorRT, which is called `Autocast`. 
+
+Before we dive into Torch-TensorRT Autocast, let's first take a look at PyTorch Autocast. PyTorch Autocast is a 
+context-based autocast, which means it will affect the precision of the nodes inside the context. For example,
+in PyTorch, we can do the following:
+
+.. code-block:: python
+
+    x = self.linear1(x)
+    with torch.autocast(device_type="cuda", enabled=True, dtype=torch.float16):
+        x = self.linear2(x)
+    x = self.linear3(x)
+
+This will run ``linear2`` in FP16 and other layers remain in FP32. Please refer to `PyTorch Autocast documentation <https://docs.pytorch.org/docs/stable/amp.html#torch.autocast>`_ for more details.
+
+Unlike PyTorch Autocast, Torch-TensorRT Autocast is a rule-based autocast, which intelligently selects nodes to 
+keep in FP32 precision to maintain model accuracy while benefiting from reduced precision on the rest of the nodes. 
+Torch-TensorRT Autocast also supports users to specify which nodes to exclude from Autocast, considering some nodes 
+might be more sensitive to affecting accuracy. In addition, Torch-TensorRT Autocast can cooperate with PyTorch Autocast, 
+allowing users to use both PyTorch Autocast and Torch-TensorRT Autocast in the same model. Torch-TensorRT Autocast 
+respects the precision of the nodes within PyTorch Autocast context.
+
+To enable Torch-TensorRT Autocast, we need to set both ``enable_autocast=True`` and ``use_explicit_typing=True``. 
+On top of them, we can also specify the precision of the nodes to reduce to by ``autocast_low_precision_type``, 
+and exclude certain nodes/ops from Torch-TensorRT Autocast by ``autocast_excluded_nodes`` or ``autocast_excluded_ops``.
+For example,
+
+.. code-block:: python
+
+    class MyModule(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.linear1 = torch.nn.Linear(10,10)
+            self.linear2 = torch.nn.Linear(10,30)
+            self.linear3 = torch.nn.Linear(30,40)
+
+        def forward(self, x):
+            x = self.linear1(x)
+            x = self.linear2(x)
+            x = self.linear3(x)
+            return x
+
+    inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
+    mod = MyModule().eval().cuda()
+    ep = torch.export.export(mod, tuple(inputs))
+    trt_gm = torch_tensorrt.dynamo.compile(
+        ep, 
+        inputs=inputs, 
+        enable_autocast=True, 
+        use_explicit_typing=True,
+        autocast_low_precision_type=torch.float16,
+        autocast_excluded_nodes={"^linear2$"},
+    )
 
+This model excludes ``linear2`` from Autocast, so it will run ``linear2`` in FP32 and other layers in FP16. 
 
+In summary, now there are two ways in Torch-TensorRT to choose the precision of the nodes:
+1. User specifies precision (strong typing):                ``use_explicit_typing=True + enable_autocast=False``
+2. Autocast chooses precision (autocast + strong typing):   ``use_explicit_typing=True + enable_autocast=True``
 
 FP32 Accumulation
 -----------------
@@ -93,14 +147,12 @@ When ``use_fp32_acc=True`` is set, Torch-TensorRT will attempt to use FP32 accum
     inputs = [torch.randn((1, 10), dtype=torch.float16).cuda()]
     mod = MyModule().eval().cuda()
     ep = torch.export.export(mod, tuple(inputs))
-    with torch_tensorrt.logging.debug():
-        trt_gm = torch_tensorrt.dynamo.compile(
-            ep,
-            inputs=inputs,
-            use_fp32_acc=True,
-            use_explicit_typing=True,  # Explicit typing must be enabled
-            debug=True
-        )
+    trt_gm = torch_tensorrt.dynamo.compile(
+        ep,
+        inputs=inputs,
+        use_fp32_acc=True,
+        use_explicit_typing=True,  # Explicit typing must be enabled
+    )
 
     # Debug log info
     # Layers:

diff --git a/examples/dynamo/autocast_example.py b/examples/dynamo/autocast_example.py
@@ -0,0 +1,122 @@
+"""
+.. _autocast_example:
+
+An example of using Torch-TensorRT Autocast
+================
+
+This example demonstrates how to use Torch-TensorRT Autocast with PyTorch Autocast to compile a mixed precision model.
+"""
+
+import torch
+import torch.nn as nn
+import torch_tensorrt
+
+# %% Mixed Precision Model
+#
+# We define a mixed precision model that consists of a few layers, a ``log`` operation, and an ``abs`` operation.
+# Among them, the ``fc1``, ``log``, and ``abs`` operations are within PyTorch Autocast context with ``dtype=torch.float16``.
+
+
+class MixedPytorchAutocastModel(nn.Module):
+    def __init__(self):
+        super(MixedPytorchAutocastModel, self).__init__()
+        self.conv1 = nn.Conv2d(
+            in_channels=3, out_channels=8, kernel_size=3, stride=1, padding=1
+        )
+        self.relu1 = nn.ReLU()
+        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
+        self.conv2 = nn.Conv2d(
+            in_channels=8, out_channels=16, kernel_size=3, stride=1, padding=1
+        )
+        self.relu2 = nn.ReLU()
+        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
+        self.flatten = nn.Flatten()
+        self.fc1 = nn.Linear(16 * 8 * 8, 10)
+
+    def forward(self, x):
+        out1 = self.conv1(x)
+        out2 = self.relu1(out1)
+        out3 = self.pool1(out2)
+        out4 = self.conv2(out3)
+        out5 = self.relu2(out4)
+        out6 = self.pool2(out5)
+        out7 = self.flatten(out6)
+        with torch.autocast(x.device.type, enabled=True, dtype=torch.float16):
+            out8 = self.fc1(out7)
+            out9 = torch.log(
+                torch.abs(out8) + 1
+            )  # log is fp32 due to Pytorch Autocast requirements
+        return x, out1, out2, out3, out4, out5, out6, out7, out8, out9
+
+
+# %%
+# Define the model, inputs, and calibration dataloader for Autocast, and then we run the original PyTorch model to get the reference outputs.
+
+model = MixedPytorchAutocastModel().cuda().eval()
+inputs = (torch.randn((8, 3, 32, 32), dtype=torch.float32, device="cuda"),)
+ep = torch.export.export(model, inputs)
+calibration_dataloader = torch.utils.data.DataLoader(
+    torch.utils.data.TensorDataset(*inputs), batch_size=2, shuffle=False
+)
+
+pytorch_outs = model(*inputs)
+
+# %% Compile the model with Torch-TensorRT Autocast
+#
+# We compile the model with Torch-TensorRT Autocast by setting ``enable_autocast=True``, ``use_explicit_typing=True``, and
+# ``autocast_low_precision_type=torch.bfloat16``. To illustrate, we exclude the ``conv1`` node, all nodes with name
+# containing ``relu``, and ``torch.ops.aten.flatten.using_ints`` ATen op from Autocast. In addtion, we also set
+# ``autocast_max_output_threshold``, ``autocast_max_depth_of_reduction``, and ``autocast_calibration_dataloader``. Please refer to
+# the documentation for more details.
+
+trt_autocast_mod = torch_tensorrt.compile(
+    ep.module(),
+    arg_inputs=inputs,
+    min_block_size=1,
+    use_python_runtime=True,
+    use_explicit_typing=True,
+    enable_autocast=True,
+    autocast_low_precision_type=torch.bfloat16,
+    autocast_excluded_nodes={"^conv1$", "relu"},
+    autocast_excluded_ops={"torch.ops.aten.flatten.using_ints"},
+    autocast_max_output_threshold=512,
+    autocast_max_depth_of_reduction=None,
+    autocast_calibration_dataloader=calibration_dataloader,
+)
+
+autocast_outs = trt_autocast_mod(*inputs)
+
+# %% Verify the outputs
+#
+# We verify both the dtype and values of the outputs of the model are correct.
+# As expected, ``fc1`` is in FP16 because of PyTorch Autocast;
+# ``pool1``, ``conv2``, and ``pool2`` are in BFP16 because of Torch-TensorRT Autocast;
+# the rest remain in FP32. Note that ``log`` is in FP32 because of PyTorch Autocast requirements.
+
+should_be_fp32 = [
+    autocast_outs[0],
+    autocast_outs[1],
+    autocast_outs[2],
+    autocast_outs[5],
+    autocast_outs[7],
+    autocast_outs[9],
+]
+should_be_fp16 = [
+    autocast_outs[8],
+]
+should_be_bf16 = [autocast_outs[3], autocast_outs[4], autocast_outs[6]]
+
+assert all(
+    a.dtype == torch.float32 for a in should_be_fp32
+), "Some Autocast outputs are not float32!"
+assert all(
+    a.dtype == torch.float16 for a in should_be_fp16
+), "Some Autocast outputs are not float16!"
+assert all(
+    a.dtype == torch.bfloat16 for a in should_be_bf16
+), "Some Autocast outputs are not bfloat16!"
+for i, (a, w) in enumerate(zip(autocast_outs, pytorch_outs)):
+    assert torch.allclose(
+        a.to(torch.float32), w.to(torch.float32), atol=1e-2, rtol=1e-2
+    ), f"Autocast and Pytorch outputs do not match! autocast_outs[{i}] = {a}, pytorch_outs[{i}] = {w}"
+print("All dtypes and values match!")