Skip to content

Releases: NVIDIA/TensorRT-Model-Optimizer

ModelOpt 0.39.0 Release

13 Nov 07:25
f329b19

Choose a tag to compare

Deprecations

  • Deprecated modelopt.torch._deploy.utils.get_onnx_bytes API. Please use modelopt.torch._deploy.utils.get_onnx_bytes_and_metadata instead to access the ONNX model bytes with external data. See examples/onnx_ptq/download_example_onnx.py for example usage.

New Features

  • Added flag op_types_to_exclude_fp16 in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating 'fp32' precision in trt_plugins_precision.
  • Added LoRA mode support for MCore in a new peft submodule: modelopt.torch.peft.update_model(model, LORA_CFG).
  • Supported PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See examples/vllm_serve for more details.
  • Added support for nemotron-post-training-dataset-v2 and nemotron-post-training-dataset-v1 in examples/llm_ptq. Defaults to a mix of cnn_dailymail and nemotron-post-training-dataset-v2 (gated dataset accessed using the HF_TOKEN environment variable) if no dataset is specified.
  • Allows specifying calib_seq in examples/llm_ptq to set the maximum sequence length for calibration.
  • Added support for MCore MoE PTQ/QAT/QAD.
  • Added support for multi-node PTQ and export with FSDP2 in examples/llm_ptq/multinode_ptq.py. See examples/llm_ptq/README.md for more details.
  • Added support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
  • Added flags nodes_to_include and op_types_to_include in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
  • Added support for torch.compile and benchmarking in examples/diffusers/quantization/diffusion_trt.py.
  • Enabled native ModelOpt quantization support for FP8 and NVFP4 formats in SGLang. See SGLang quantization documentation for more details.
  • Added ModelOpt quantized checkpoints in vLLM/SGLang CI/CD pipelines (PRs are under review).
  • Added support for exporting QLoRA checkpoints finetuned using ModelOpt.

Documentation

ModelOpt 0.37.0 Release

08 Oct 16:43
df0882a

Choose a tag to compare

Deprecations

  • Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM, or TensorRT docker image directly or refer to the installation guide for more details.
  • Deprecated quantize_mode argument in examples/onnx_ptq/evaluate.py to support strong typing. Use engine_precision instead.
  • Deprecated TRT-LLM's TRT backend in examples/llm_ptq and examples/vlm_ptq. Tasks build and benchmark support are removed and replaced with quant. engine_dir is replaced with checkpoint_dir in examples/llm_ptq and examples/vlm_ptq. For performance evaluation, please use trtllm-bench directly.
  • The --export_fmt flag in examples/llm_ptq is removed. By default, we export to the unified Hugging Face checkpoint format.
  • Deprecated examples/vlm_eval as it depends on the deprecated TRT-LLM's TRT backend.

New Features

  • high_precision_dtype defaults to fp16 in ONNX quantization, i.e., quantized output model weights are now FP16 by default.
  • Upgraded TensorRT-LLM dependency to 1.1.0rc2.
  • Support for Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in examples/vlm_ptq.
  • Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
  • Added Minitron pruning example for the Megatron-LM framework. See examples/megatron-lm for more details.

ModelOpt 0.35.1 Release

20 Sep 08:32
0365238

Choose a tag to compare

ModelOpt 0.35.0 Release

04 Sep 05:50
c359cb7

Choose a tag to compare

Deprecations

  • Deprecate torch<2.6 support.
  • Deprecate NeMo 1.0 model support.

Bug Fixes

  • Fix attention head ranking logic for pruning Megatron Core GPT models.

New Features

  • ModelOpt now supports PTQ and QAT for GPT-OSS models. See examples/gpt_oss for end-to-end PTQ/QAT example.
  • Add support for QAT with HuggingFace + DeepSpeed. See examples/gpt_oss for an example.
  • Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See examples/gpt_oss for an example.
  • ModelOpt provides convenient trainers such as :class:QATTrainer, :class:QADTrainer, :class:KDTrainer, :class:QATSFTTrainer which inherits from Huggingface trainers.
    ModelOpt trainers can be used as drop in replacement of the corresponding Huggingface trainer. See usage examples in examples/gpt_oss, examples/llm_qat or examples/llm_distill.
  • (Experimental) Add quantization support for custom TensorRT op in ONNX models.
  • Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
  • Add tree decoding support for Megatron Eagle models.
  • For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
  • Add support for mamba_num_heads, mamba_head_dim, hidden_size and num_layers pruning for Megatron Core Mamba or Hybrid Transformer Mamba models in mcore_minitron (previously mcore_gpt_minitron) mode.
  • Add example for QAT/QAD training with LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>_. See examples/llm_qat/llama_factory for more details.
  • Upgrade TensorRT-LLM dependency to 1.0.0rc6.
  • Add unified HuggingFace model export support for quantized NVFP4 GPT-OSS models.

ModelOpt 0.33.1 Release

12 Aug 18:50
55b9106

Choose a tag to compare

Bug Fixes

  • Fix a Qwen3 MOE model export issue.

ModelOpt 0.33.0 Release

14 Jul 18:18

Choose a tag to compare

Backward Breaking Changes

  • PyTorch dependencies for modelopt.torch features are no longer optional and pip install nvidia-modelopt is now same as pip install nvidia-modelopt[torch].

New Features

  • Upgrade TensorRT-LLM dependency to 0.20.
  • Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
  • Add support for ONNX models with custom TensorRT ops in Autocast.
  • Add quantization aware distillation (QAD) support in llm_qat example.
  • Add support for BF16 in ONNX quantization.
  • Add per node calibration support in ONNX quantization.
  • ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires transformers>=4.52.0.
  • Support quantization of FSDP2 wrapped models and add FSDP2 support in the llm_qat example.
  • Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.

ModelOpt 0.31.0 Release

05 Jun 21:02

Choose a tag to compare

Backward Breaking Changes

  • NeMo and Megatron-LM distributed checkpoint (torch-dist) stored with legacy version can no longer be loaded. The remedy is to load the legacy distributed checkpoint with 0.29 and store a torch checkpoint and resume with 0.31 to convert to a new format. The following changes only apply to storing and resuming distributed checkpoint.
    • quantizer_state of :class:TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> is now stored in extra_state of :class:QuantModule <modelopt.torch.quantization.nn.module.QuantModule> where it used to be stored in the sharded modelopt_state.
    • The dtype and shape of amax and pre_quant_scale stored in the distributed checkpoint are now retored. Some dtype and shape are previously changed to make all decoder layers to have homogeneous structure in the checkpoint.
    • Togather with megatron.core-0.13, quantized model will store and resume distributed checkpoint in a heterogenous format.
  • auto_quantize API now accepts a list of quantization config dicts as the list of quantization choices.
    • This API previously accepts a list of strings of quantization format names. It was therefore limited to only pre-defined quantization formats unless through some hacks.
    • With this change, now user can easily use their own custom quantization formats for auto_quantize.
    • In addition, the quantization_formats now exclude None (indicating "do not quantize") as a valid format because the auto_quantize internally always add "do not quantize" as an option anyway.
  • Model export config is refactored. The quant config in hf_quant_config.json is converted and saved to config.json. hf_quant_config.json will be deprecated soon.

Deprecations

  • Deprecate Python 3.9 support.

New Features

  • Upgrade LLM examples to use TensorRT-LLM 0.19.
  • Add new model support in the llm_ptq example: Qwen3 MoE.
  • ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
  • Add AutoCast tool to convert ONNX models to FP16 or BF16.
  • Add --low_memory_mode flag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.

ModelOpt 0.29.0 Release

09 May 05:26

Choose a tag to compare

Backward Breaking Changes

  • Refactor SequentialQuantizer to improve its implementation and maintainability while preserving its functionality.

Deprecations

  • Deprecate torch<2.4 support.

New Features

  • Upgrade LLM examples to use TensorRT-LLM 0.18.
  • Add new model support in the llm_ptq example: Gemma-3, Llama-Nemotron.
  • Add INT8 real quantization support.
  • Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the mtq.compress <modelopt.torch.quantization.compress> API to accelerate evaluation of quantized models.
  • Use the shape of Pytorch parameters and buffers of TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> to initialize them during restore. This makes quantized model restoring more robust.
  • Support adding new custom quantization calibration algorithms. Please refer to mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate> or custom calibration algorithm doc for more details.
  • Add EAGLE3 (LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM.
  • Add support for --override_shapes flag to ONNX quantization.
    • --calibration_shapes is reserved for the input shapes used for calibration process.
    • --override_shapes is used to override the input shapes of the model with static shapes.
  • Add support for UNet ONNX quantization.
  • Enable concat_elimination pass by default to improve the performance of quantized ONNX models.
  • Enable Redundant Cast elimination pass by default in moq.quantize <modelopt.onnx.quantization.quantize>.
  • Add new attribute parallel_state to DynamicModule <modelopt.torch.opt.dynamic.DynamicModule> to support distributed parallelism such as data parallel and tensor parallel.
  • Add MXFP8, NVFP4 quantized ONNX export support.
  • Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.

ModelOpt 0.27.1 Release

15 Apr 18:24

Choose a tag to compare

Add experimental quantization support for Llama4, QwQ and Qwen MOE models.

ModelOpt 0.27.0 Release

03 Apr 05:24

Choose a tag to compare

Deprecations

  • Deprecate real quantization configs, please use mtq.compress <modelopt.torch.quantization.compress> API for model compression after quantization.

New Features

  • New model support in the llm_ptq example: OpenAI Whisper.
  • Blockwise FP8 quantization support in unified model export.
  • Add quantization support to the Transformer Engine Linear module.
  • Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
  • To support distributed checkpoint resume expert-parallel (EP), modelopt_state in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy modelopt_state in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
  • Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
  • Add a new API mtq.compress <modelopt.torch.quantization.compress> for model compression for weights after quantization.
  • Add option to simplify ONNX model before quantization is performed.
  • (Experimental) Improve support for ONNX models with custom TensorRT op:
    • Add support for --calibration_shapes flag.
    • Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

Known Issues

  • Quantization of T5 models is broken. Please use nvidia-modelopt==0.25.0 with transformers<4.50 meanwhile.