PyTorch 2.8, CUDA 12.8 TensorRT 10.12, Python 3.13
Torch-TensorRT 2.8.0 Standard Linux x86-64 and Windows targets PyTorch 2.8, TensorRT 10.12, CUDA 12.6, 12.8, 12.9 and Python 3.9 ~ 3.13
- Linux x86-64 + Windows
- CUDA 12.8 + Python 3.9-3.13 is Available via PyPI: https://pypi.org/project/torch-tensorrt/
- CUDA 12.6/12.8/12.9 + Python 3.9-3.13 is also Available via Pytorch Index: https://download.pytorch.org/whl/torch-tensorrt
Platform support
In addition to the standard Windows x86-64 and Linux x86-64 releases, we now provide binary builds for SBSA and Jetson:
-
SBSA aarch64
- CUDA 12.9 + Python 3.9–3.13 + Torch 2.8 + TensorRT 10.12
- Available via PyPI: https://pypi.org/project/torch-tensorrt/
- Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt
-
Jetson Orin
- CUDA 12.6 + Python 3.10 + Torch 2.8 + TensorRT 10.3.0
- Available at https://pypi.jetson-ai-lab.io/jp6/cu126
Deprecations
- TensorRT implicit quantization support has been deprecated since TensorRT 10.1. Torch-TensorRT APIs related to the INT8Calibrator will be removed in Torch-TensorRT 2.9.0. Quantization users should move to a workflow based on TensorRT-Model-Optimizer Toolkit. See: https://docs.pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/vgg16_ptq.html for more information
New Features
AOT-Inductor Pythonless Deployment
Stability: Beta
Historically TorchScript has been used to run Torch-TensorRT programs outside of a Python interpreter. Both the dynamo
/torch.compile
frontend and the TorchScript frontends supported this TorchScript deployment workflow.
Old
trt_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(trt_model, inputs=[...])
ts_model.save("trt_model.ts")
Now you can achieve a similar result using AOT-Inductor. AOTInductor is a specialized version of TorchInductor, designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts are specifically crafted for deployment in non-Python environments.
Torch-TensorRT can embed TensorRT engines in AOTInductor libraries to accelerate models further. You are also able to combine Inductor kernels with TensorRT engines via this method. This allows users to deploy their models outside of Python using torch-compile native technologies.
New
with torch.no_grad():
cg_trt_module = torch_tensorrt.compile(model, **compile_settings)
torch_tensorrt.save(
cg_trt_module,
file_path=os.path.join(os.getcwd(), "model.pt2"),
output_format="aot_inductor",
retrace=True,
arg_inputs=example_inputs,
)
This model.pt2
file can then be loaded in either Python or C++ using Torch APIs.
import torch
import torch_tensorrt
model = torch._inductor.aoti_load_package(os.path.join(os.getcwd(), "model.pt2"))
#include <iostream>
#include <vector>
#include "torch/torch.h"
#include "torch/csrc/inductor/aoti_package/model_package_loader.h"
int main(int argc, const char* argv[]) {
std::string trt_aoti_module_path = "model.pt2";
c10::InferenceMode mode;
torch::inductor::AOTIModelPackageLoader loader(trt_aoti_module_path);
std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
std::vector<torch::Tensor> outputs = loader.run(inputs);
std::cout << "Result from the first inference:"<< std::endl;
std::cout << outputs << std::endl;
return 0;
}
More information can be found here https://docs.pytorch.org/TensorRT/user_guide/runtime.html as we as a code example here: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/torchtrt_aoti_example/inference.cpp
PTX Plugins
Stability: Stable
In Torch-TensorRT 2.7.0 we introduced auto-generated plugins which allows users to automatically wrap kernels / PyTorch custom Operators into TensorRT plugins to run their models without a graph break. In 2.8.0 we extend this system to support PTX based plugins which enables users to serialize and run their TensorRT engines without requiring any PyTorch / Triton / Python in the runtime or access to the original kernel implementation. This approach also has the added benefit of lower overhead than the auto-generated plugin system for achieving maximum performance.
Example below shows how to register a custom operator, generate the necessary plugin, and integrate it into the TensorRT execution graph. [the example]
(https://github.com/pytorch/TensorRT/blob/main/examples/dynamo/aot_plugin.py)
Hierarchical Multi-backend Adjacency Partitioner
Stability: Experimental
The Hierarchical Multi-backend Adjacency Partitioner enables sophisticated model partitioning strategies for distributing PyTorch models across multiple backends based on operator support and priority ordering. A prototype partitioner has been added to the package which allows graphs to be split across multiple backends (e.g., TensorRT, PyTorch Inductor) based on operator capabilities. By providing a backend preference order operators are assigned to the highest-priority backend that supports them.
Please refer to the example for usage.
Model Optimizer-Based NVFP4 Quantization (PTQ) Support for Linux
Stability: Stable
Introducing NVFP4 for efficient and accurate low-precision inference on the Blackwell GPU architecture.
Currently, the workflow supports quantizing models from FP16 → NVFP4.
Directly quantizing from FP32 → NVFP4 is not recommended as it may lead to accuracy degradation. Instead, first convert or train the model in FP16, then quantize to NVFP4.
Full example:
https://github.com/pytorch/TensorRT/blob/release/2.8/examples/apps/flux_demo.py
run_llm
and KV Caching
Stability: Beta
We’ve introduced a KV caching implementation for Torch-TensorRT using native TensorRT operations, yielding significant improvements in inference performance for autoregressive large language models (LLMs). KV caching is a crucial optimization that reduces latency by reusing attention activations across decoding steps. In our approach, the KV cache is modeled as fixed-size tensor inputs and outputs, with outputs from each decoding step looped back as inputs to update the cache incrementally. This update is performed using TensorRT-supported operations such as slice
, concat
, and pad
. The design allows step-wise cache updates while preserving compatibility with TensorRT’s optimization workflow and engine serialization.
We’ve also introduced a new utility, run_llm.py
, to run inference on popular LLMs with KV caching enabled.
To run a Qwen3 model using KV caching with Torch-TensorRT, use the following command:
python run_llm.py --model Qwen/Qwen3-8B --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
Please refer to Compiling LLM models from Huggingface for more details and limitations.
Debugger
We introduced a new debugger for better usability and a debugging experience for Torch-TensorRT. The debugger centralized all debugger settings, such as logging level from critical to info, and engine profiling. We also introduced fx graph visualization in the debugger, where you can specify the specific lowering pass before/which you want to draw the graph. Moreover, the debugger can provide engine profiling and layer information that is compatible with TREX, an engine visualization tool developed by TensorRT, that better explains the engine structure.
Model Zoo
We have expanded support to include several popular models from the Qwen3 and Llama3 series. In this release, we’ve also addressed various performance and accuracy issues to improve overall stability. For a complete list of supported models, please refer to the Supported Models section.
Bug Fixes
Refit
Refit has been re-enabled for Python 3.13 after being disabled in 2.7.0
- Reduced memory overhead by offloading model to CPU
Performance improvements
- Linear converter was reverted to the earlier implementation because it shows perf improvements in fp16 on some models (e.g., BERT)
- Group Norm converter was simplified to reduce unnecessary TensorRT ILayers
- The constants in the BatchNorm converter are now folded at compile time, leading to significant performance improvements.
- SDPA op decomposition is optimized, resulting in same or better performance as ONNX-TensorRT for transformer-based diffusion models such as Stable Diffusion 3/WAN2.1/FLUX
What's Changed
- chore: bump torch to 2.8.0.dev by @zewenli98 in #3449
- Nccl ops correction changes by @apbose in #3387
- fix: Change the translational layer from numpy to torch during conversion to handle additional data types by @peri044 in #3445
- Fix grid_sample by @HolyWu in #3340
- fix: Destory cuda graphs before setting weight streaming by @keehyuna in #3461
- tool: uv setting to avoid the pip install -e by @narendasan in #3468
- chore: reenable py313 by @zewenli98 in #3455
- bf16 support for elementwise operation by @apbose in #3462
- feat: rmsnorm lowering by @bowang007 in #3440
- feat: Support flashinfer.rmsnorm by @bowang007 in #3424
- fix: support masked_scatter by lowering path and corner case of maske… by @chohk88 in #3476
- fix: index_put converter to handle multi-shape slicing with None by @chohk88 in #3475
- slight code reorg and bug correction for cross_compile by @apbose in #3472
- Enabled refit on Python 3.13 by @cehongwang in #3481
- fix: l2_limit_for_tiling by @zewenli98 in #3479
- chore: test bf16 fixes in CI by @peri044 in #3491
- add python3.13 into the final release artifact by @lanluo-nvidia in #3499
- chore: remove pre-cxx11 abi by @zewenli98 in #3473
- disabling dla args for hope igx platform by @apbose in #3487
- chore: remove pre-cxx11 abi references in doc by @zewenli98 in #3503
- Fix Windows CI for Release 2.7 (#3505) by @narendasan in #3506
- upgrade modelopt by @lanluo-nvidia in #3511
- chore: miscellaneous fixes for handling graph breaks by @peri044 in #3488
- add nspect ignore file by @lanluo-nvidia in #3514
- Update mutable_torchtrt_module_example.py by @cehongwang in #3519
- Add Linux CI build for aarch64 by @lanluo-nvidia in #3516
- chore: update the docstring for llama2 rmsnorm automatic plugin example by @bowang007 in #3512
- chore(deps): bump undici from 5.28.5 to 5.29.0 in /.github/actions/assigner by @dependabot[bot] in #3520
- fix docker build failure: add allow_empty to true by @lanluo-nvidia in #3526
- Added CPU offloading by @cehongwang in #3452
- chore(deps): bump setuptools from 70.2.0 to 78.1.1 in /toolchains/jp_workspaces by @dependabot[bot] in #3523
- add feature gate for tensorrt plugin by @lanluo-nvidia in #3518
- chore(deps): bump transformers from 4.48.0 to 4.50.0 in /examples/dynamo by @dependabot[bot] in #3497
- Minor fix - check for DTensor on igpu platform by @apbose in #3531
- fix: wrong dtype and device in
aten.full_like
decomposition by @junstar92 in #3535 - feat: Implement SDPA op converter / lowering pass as extensions by @peri044 in #3534
- nvidia-modelopt dependency fix by @lanluo-nvidia in #3544
- Add jetson build on CI by @lanluo-nvidia in #3524
- feat: TensorRT AOT Plugin by @bowang007 in #3504
- Publish jetson wheel to pytorch nightly index by @lanluo-nvidia in #3550
- fix: handle device in the same way as dtype in
aten.full_like
decomposition by @junstar92 in #3538 - fix the jetson nightly build check bug by @lanluo-nvidia in #3552
- fix int8/fp8 constant folding issue by @lanluo-nvidia in #3543
- Upgrade to TensorRT 10.11 by @lanluo-nvidia in #3557
- Cross compile guard by @apbose in #3486
- fix: Fix constant folding failure due to modelopt by @peri044 in #3565
- add --no-deps for tests/py/requirements.txt by @lanluo-nvidia in #3569
- Add fp4 support by @lanluo-nvidia in #3532
- fix: Fix a perf regression due to weights being ITensors by @peri044 in #3568
- Added flux demo by @cehongwang in #3418
- FX graph visualization by @cehongwang in #3528
- fix main test failure bug by @lanluo-nvidia in #3590
- Verify C++ tests, fix cuda graphs union issue by @narendasan in #3589
- Fix: fix aot plugin example docstring issue by @bowang007 in #3595
- feat: working uv pyproject.toml by @narendasan in #3597
- remove torchvision dependency from build, optional for test by @lanluo-nvidia in #3598
- Changed weight map to tensor and fix the refit bug by @cehongwang in #3573
- test failed but displayed as green by @lanluo-nvidia in #3599
- Import dllist only on linux by @HolyWu in #3592
- feat: Hierarchical Partitioner to support multi-backends by @zewenli98 in #3539
- fix dynamo converter test case failure by @lanluo-nvidia in #3594
- feat: Saving modules using the AOTI format by @narendasan in #3567
- skip flashinfer-python for py3.9 due to upstream error by @lanluo-nvidia in #3605
- fix enabled_precisions error in test cases by @lanluo-nvidia in #3606
- debug flag is deprecated, remove it so that test won't complain by @lanluo-nvidia in #3610
- fix: add prefix in hierarchical_partitioner_example by @zewenli98 in #3607
- fix: pre-commit issues by @zewenli98 in #3603
- py39 does not like | E TypeError: unsupported operand type(s) for |: 'type' and 'EnumMeta' by @lanluo-nvidia in #3611
- fix cross compilation test bug by @lanluo-nvidia in #3609
- TorchTensorRTModule Serialization Fix by @cehongwang in #3572
- a few CI changes by @lanluo-nvidia in #3612
- remove debug flag by @lanluo-nvidia in #3618
- fix: Fix unbacked sym int not found issue by @peri044 in #3617
- fix ts fe test error. by @lanluo-nvidia in #3619
- disable test on aarch64 for now by @lanluo-nvidia in #3623
- disable aoti format in windows by @lanluo-nvidia in #3632
- release 2.8 branch cut by @lanluo-nvidia in #3638
- cherry pick 3636 by @lanluo-nvidia in #3640
- cherry pick 3642 by @lanluo-nvidia in #3655
- Lluo/cherry pick 3629 by @lanluo-nvidia in #3656
- Lluo/cherry pick 3620 by @lanluo-nvidia in #3658
- cherry pick 3663: fix the int8 quantization error, remove duplicated lines by @lanluo-nvidia in #3665
- cherry pick 3660 to release/2.8 by @lanluo-nvidia in #3661
- cherry pick 3685: disable jetson build in ci by @lanluo-nvidia in #3688
- cherry pick 3680: fix refit test bug by @lanluo-nvidia in #3687
- cherry-pick 3686: upgrade tensorrt from 10.11 to 10.12 by @lanluo-nvidia in #3690
- cherry pick 3689 to 2.8 release:flux fp4 by @lanluo-nvidia in #3696
- chore: cherry pick of KV cache PR (3527) by @peri044 in #3667
- Cherrypick of PR 3513 by @apbose in #3664
- Cherrypick of PR 3570 by @apbose in #3662
- chore: cherry pick of bf16 cast PR (3643) by @peri044 in #3666
- Cherrypick #3719 for release/2.8 by @zewenli98 in #3734
- Cherrypick #3703 for release/2.8 by @zewenli98 in #3735
- enable back jetpack build by @lanluo-nvidia in #3720
- add typing_extensions as test dependencies which is required by modelopt by @lanluo-nvidia in #3743
- broadcast_remove - cherry pick 3700 by @lanluo-nvidia in #3757
- fix typing-extensions issue by @lanluo-nvidia in #3761
- Fix Jetson FP4 gate issue by @lanluo-nvidia in #3764
- fix build cancellation issue by @lanluo-nvidia in #3768
Full Changelog: v2.7.0...v2.8.0-rc6