Skip to content

Releases: NVIDIA/TensorRT-Model-Optimizer

ModelOpt 0.37.0 Release

08 Oct 16:43
df0882a

Choose a tag to compare

Deprecations

  • Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM, or TensorRT docker image directly or refer to the installation guide for more details.
  • Deprecated quantize_mode argument in examples/onnx_ptq/evaluate.py to support strong typing. Use engine_precision instead.
  • Deprecated TRT-LLM's TRT backend in examples/llm_ptq and examples/vlm_ptq. Tasks build and benchmark support are removed and replaced with quant. engine_dir is replaced with checkpoint_dir in examples/llm_ptq and examples/vlm_ptq. For performance evaluation, please use trtllm-bench directly.
  • The --export_fmt flag in examples/llm_ptq is removed. By default, we export to the unified Hugging Face checkpoint format.
  • Deprecated examples/vlm_eval as it depends on the deprecated TRT-LLM's TRT backend.

New Features

  • high_precision_dtype defaults to fp16 in ONNX quantization, i.e., quantized output model weights are now FP16 by default.
  • Upgraded TensorRT-LLM dependency to 1.1.0rc2.
  • Support for Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in examples/vlm_ptq.
  • Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
  • Added Minitron pruning example for the Megatron-LM framework. See examples/megatron-lm for more details.

ModelOpt 0.35.1 Release

20 Sep 08:32
0365238

Choose a tag to compare

ModelOpt 0.35.0 Release

04 Sep 05:50
c359cb7

Choose a tag to compare

Deprecations

  • Deprecate torch<2.6 support.
  • Deprecate NeMo 1.0 model support.

Bug Fixes

  • Fix attention head ranking logic for pruning Megatron Core GPT models.

New Features

  • ModelOpt now supports PTQ and QAT for GPT-OSS models. See examples/gpt_oss for end-to-end PTQ/QAT example.
  • Add support for QAT with HuggingFace + DeepSpeed. See examples/gpt_oss for an example.
  • Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See examples/gpt_oss for an example.
  • ModelOpt provides convenient trainers such as :class:QATTrainer, :class:QADTrainer, :class:KDTrainer, :class:QATSFTTrainer which inherits from Huggingface trainers.
    ModelOpt trainers can be used as drop in replacement of the corresponding Huggingface trainer. See usage examples in examples/gpt_oss, examples/llm_qat or examples/llm_distill.
  • (Experimental) Add quantization support for custom TensorRT op in ONNX models.
  • Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
  • Add tree decoding support for Megatron Eagle models.
  • For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
  • Add support for mamba_num_heads, mamba_head_dim, hidden_size and num_layers pruning for Megatron Core Mamba or Hybrid Transformer Mamba models in mcore_minitron (previously mcore_gpt_minitron) mode.
  • Add example for QAT/QAD training with LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>_. See examples/llm_qat/llama_factory for more details.
  • Upgrade TensorRT-LLM dependency to 1.0.0rc6.
  • Add unified HuggingFace model export support for quantized NVFP4 GPT-OSS models.

ModelOpt 0.33.1 Release

12 Aug 18:50
55b9106

Choose a tag to compare

Bug Fixes

  • Fix a Qwen3 MOE model export issue.

ModelOpt 0.33.0 Release

14 Jul 18:18

Choose a tag to compare

Backward Breaking Changes

  • PyTorch dependencies for modelopt.torch features are no longer optional and pip install nvidia-modelopt is now same as pip install nvidia-modelopt[torch].

New Features

  • Upgrade TensorRT-LLM dependency to 0.20.
  • Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
  • Add support for ONNX models with custom TensorRT ops in Autocast.
  • Add quantization aware distillation (QAD) support in llm_qat example.
  • Add support for BF16 in ONNX quantization.
  • Add per node calibration support in ONNX quantization.
  • ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires transformers>=4.52.0.
  • Support quantization of FSDP2 wrapped models and add FSDP2 support in the llm_qat example.
  • Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.

ModelOpt 0.31.0 Release

05 Jun 21:02

Choose a tag to compare

Backward Breaking Changes

  • NeMo and Megatron-LM distributed checkpoint (torch-dist) stored with legacy version can no longer be loaded. The remedy is to load the legacy distributed checkpoint with 0.29 and store a torch checkpoint and resume with 0.31 to convert to a new format. The following changes only apply to storing and resuming distributed checkpoint.
    • quantizer_state of :class:TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> is now stored in extra_state of :class:QuantModule <modelopt.torch.quantization.nn.module.QuantModule> where it used to be stored in the sharded modelopt_state.
    • The dtype and shape of amax and pre_quant_scale stored in the distributed checkpoint are now retored. Some dtype and shape are previously changed to make all decoder layers to have homogeneous structure in the checkpoint.
    • Togather with megatron.core-0.13, quantized model will store and resume distributed checkpoint in a heterogenous format.
  • auto_quantize API now accepts a list of quantization config dicts as the list of quantization choices.
    • This API previously accepts a list of strings of quantization format names. It was therefore limited to only pre-defined quantization formats unless through some hacks.
    • With this change, now user can easily use their own custom quantization formats for auto_quantize.
    • In addition, the quantization_formats now exclude None (indicating "do not quantize") as a valid format because the auto_quantize internally always add "do not quantize" as an option anyway.
  • Model export config is refactored. The quant config in hf_quant_config.json is converted and saved to config.json. hf_quant_config.json will be deprecated soon.

Deprecations

  • Deprecate Python 3.9 support.

New Features

  • Upgrade LLM examples to use TensorRT-LLM 0.19.
  • Add new model support in the llm_ptq example: Qwen3 MoE.
  • ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
  • Add AutoCast tool to convert ONNX models to FP16 or BF16.
  • Add --low_memory_mode flag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.

ModelOpt 0.29.0 Release

09 May 05:26

Choose a tag to compare

Backward Breaking Changes

  • Refactor SequentialQuantizer to improve its implementation and maintainability while preserving its functionality.

Deprecations

  • Deprecate torch<2.4 support.

New Features

  • Upgrade LLM examples to use TensorRT-LLM 0.18.
  • Add new model support in the llm_ptq example: Gemma-3, Llama-Nemotron.
  • Add INT8 real quantization support.
  • Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the mtq.compress <modelopt.torch.quantization.compress> API to accelerate evaluation of quantized models.
  • Use the shape of Pytorch parameters and buffers of TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> to initialize them during restore. This makes quantized model restoring more robust.
  • Support adding new custom quantization calibration algorithms. Please refer to mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate> or custom calibration algorithm doc for more details.
  • Add EAGLE3 (LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM.
  • Add support for --override_shapes flag to ONNX quantization.
    • --calibration_shapes is reserved for the input shapes used for calibration process.
    • --override_shapes is used to override the input shapes of the model with static shapes.
  • Add support for UNet ONNX quantization.
  • Enable concat_elimination pass by default to improve the performance of quantized ONNX models.
  • Enable Redundant Cast elimination pass by default in moq.quantize <modelopt.onnx.quantization.quantize>.
  • Add new attribute parallel_state to DynamicModule <modelopt.torch.opt.dynamic.DynamicModule> to support distributed parallelism such as data parallel and tensor parallel.
  • Add MXFP8, NVFP4 quantized ONNX export support.
  • Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.

ModelOpt 0.27.1 Release

15 Apr 18:24

Choose a tag to compare

Add experimental quantization support for Llama4, QwQ and Qwen MOE models.

ModelOpt 0.27.0 Release

03 Apr 05:24

Choose a tag to compare

Deprecations

  • Deprecate real quantization configs, please use mtq.compress <modelopt.torch.quantization.compress> API for model compression after quantization.

New Features

  • New model support in the llm_ptq example: OpenAI Whisper.
  • Blockwise FP8 quantization support in unified model export.
  • Add quantization support to the Transformer Engine Linear module.
  • Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
  • To support distributed checkpoint resume expert-parallel (EP), modelopt_state in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy modelopt_state in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
  • Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
  • Add a new API mtq.compress <modelopt.torch.quantization.compress> for model compression for weights after quantization.
  • Add option to simplify ONNX model before quantization is performed.
  • (Experimental) Improve support for ONNX models with custom TensorRT op:
    • Add support for --calibration_shapes flag.
    • Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

Known Issues

  • Quantization of T5 models is broken. Please use nvidia-modelopt==0.25.0 with transformers<4.50 meanwhile.

ModelOpt 0.25.0 Release

03 Mar 17:41

Choose a tag to compare

Deprecations

  • Deprecate Torch 2.1 support.
  • Deprecate humaneval benchmark in llm_eval examples. Please use the newly added simple_eval instead.
  • Deprecate fp8_naive quantization format in llm_ptq examples. Please use fp8 instead.

New Features

  • Support fast hadamard transform in TensorQuantizer class (modelopt.torch.quantization.nn.modules.TensorQuantizer).
    It can be used for rotation based quantization methods, e.g. QuaRot. Users need to install the package fast_hadamard_transfrom to use this feature.
  • Add affine quantization support for the KV cache, resolving the low accuracy issue in models such as Qwen2.5 and Phi-3/3.5.
  • Add FSDP2 support. FSDP2 can now be used for QAT.
  • Add LiveCodeBench and Simple Evals to the llm_eval examples.
  • Disabled saving modelopt state in unified hf export APIs by default, i.e., added save_modelopt_state flag in export_hf_checkpoint API and by default set to False.
  • Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
  • The modelopt.deploy.llm.LLM class now support use the tensorrt_llm._torch.LLM backend for the quantized HuggingFace checkpoints.
  • Add NVFP4 PTQ example for DeepSeek-R1.
  • Add end-to-end AutoDeploy example for AutoQuant LLM models.