Releases: NVIDIA/TensorRT-Model-Optimizer
Releases · NVIDIA/TensorRT-Model-Optimizer
ModelOpt 0.37.0 Release
Deprecations
- Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM, or TensorRT docker image directly or refer to the installation guide for more details.
- Deprecated
quantize_modeargument inexamples/onnx_ptq/evaluate.pyto support strong typing. Useengine_precisioninstead. - Deprecated TRT-LLM's TRT backend in
examples/llm_ptqandexamples/vlm_ptq. Tasksbuildandbenchmarksupport are removed and replaced withquant.engine_diris replaced withcheckpoint_dirinexamples/llm_ptqandexamples/vlm_ptq. For performance evaluation, please usetrtllm-benchdirectly. - The
--export_fmtflag inexamples/llm_ptqis removed. By default, we export to the unified Hugging Face checkpoint format. - Deprecated
examples/vlm_evalas it depends on the deprecated TRT-LLM's TRT backend.
New Features
high_precision_dtypedefaults to fp16 in ONNX quantization, i.e., quantized output model weights are now FP16 by default.- Upgraded TensorRT-LLM dependency to 1.1.0rc2.
- Support for Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in
examples/vlm_ptq. - Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
- Added Minitron pruning example for the Megatron-LM framework. See
examples/megatron-lmfor more details.
ModelOpt 0.35.1 Release
- Import fixes
ModelOpt 0.35.0 Release
Deprecations
- Deprecate
torch<2.6support. - Deprecate NeMo 1.0 model support.
Bug Fixes
- Fix attention head ranking logic for pruning Megatron Core GPT models.
New Features
- ModelOpt now supports PTQ and QAT for GPT-OSS models. See
examples/gpt_ossfor end-to-end PTQ/QAT example. - Add support for QAT with HuggingFace + DeepSpeed. See
examples/gpt_ossfor an example. - Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See
examples/gpt_ossfor an example. - ModelOpt provides convenient trainers such as :class:
QATTrainer, :class:QADTrainer, :class:KDTrainer, :class:QATSFTTrainerwhich inherits from Huggingface trainers.
ModelOpt trainers can be used as drop in replacement of the corresponding Huggingface trainer. See usage examples inexamples/gpt_oss,examples/llm_qatorexamples/llm_distill. - (Experimental) Add quantization support for custom TensorRT op in ONNX models.
- Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
- Add tree decoding support for Megatron Eagle models.
- For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
- Add support for
mamba_num_heads,mamba_head_dim,hidden_sizeandnum_layerspruning for Megatron Core Mamba or Hybrid Transformer Mamba models inmcore_minitron(previouslymcore_gpt_minitron) mode. - Add example for QAT/QAD training with
LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>_. Seeexamples/llm_qat/llama_factoryfor more details. - Upgrade TensorRT-LLM dependency to 1.0.0rc6.
- Add unified HuggingFace model export support for quantized NVFP4 GPT-OSS models.
ModelOpt 0.33.1 Release
Bug Fixes
- Fix a Qwen3 MOE model export issue.
ModelOpt 0.33.0 Release
Backward Breaking Changes
- PyTorch dependencies for
modelopt.torchfeatures are no longer optional andpip install nvidia-modeloptis now same aspip install nvidia-modelopt[torch].
New Features
- Upgrade TensorRT-LLM dependency to 0.20.
- Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
- Add support for ONNX models with custom TensorRT ops in Autocast.
- Add quantization aware distillation (QAD) support in
llm_qatexample. - Add support for BF16 in ONNX quantization.
- Add per node calibration support in ONNX quantization.
- ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires
transformers>=4.52.0. - Support quantization of FSDP2 wrapped models and add FSDP2 support in the
llm_qatexample. - Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.
ModelOpt 0.31.0 Release
Backward Breaking Changes
- NeMo and Megatron-LM distributed checkpoint (
torch-dist) stored with legacy version can no longer be loaded. The remedy is to load the legacy distributed checkpoint with 0.29 and store atorchcheckpoint and resume with 0.31 to convert to a new format. The following changes only apply to storing and resuming distributed checkpoint.quantizer_stateof :class:TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>is now stored inextra_stateof :class:QuantModule <modelopt.torch.quantization.nn.module.QuantModule>where it used to be stored in the shardedmodelopt_state.- The dtype and shape of
amaxandpre_quant_scalestored in the distributed checkpoint are now retored. Some dtype and shape are previously changed to make all decoder layers to have homogeneous structure in the checkpoint. - Togather with megatron.core-0.13, quantized model will store and resume distributed checkpoint in a heterogenous format.
- auto_quantize API now accepts a list of quantization config dicts as the list of quantization choices.
- This API previously accepts a list of strings of quantization format names. It was therefore limited to only pre-defined quantization formats unless through some hacks.
- With this change, now user can easily use their own custom quantization formats for auto_quantize.
- In addition, the
quantization_formatsnow excludeNone(indicating "do not quantize") as a valid format because the auto_quantize internally always add "do not quantize" as an option anyway.
- Model export config is refactored. The quant config in
hf_quant_config.jsonis converted and saved toconfig.json.hf_quant_config.jsonwill be deprecated soon.
Deprecations
- Deprecate
Python 3.9support.
New Features
- Upgrade LLM examples to use TensorRT-LLM 0.19.
- Add new model support in the
llm_ptqexample: Qwen3 MoE. - ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
- Add AutoCast tool to convert ONNX models to FP16 or BF16.
- Add
--low_memory_modeflag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.
ModelOpt 0.29.0 Release
Backward Breaking Changes
- Refactor
SequentialQuantizerto improve its implementation and maintainability while preserving its functionality.
Deprecations
- Deprecate
torch<2.4support.
New Features
- Upgrade LLM examples to use TensorRT-LLM 0.18.
- Add new model support in the
llm_ptqexample: Gemma-3, Llama-Nemotron. - Add INT8 real quantization support.
- Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the
mtq.compress <modelopt.torch.quantization.compress>API to accelerate evaluation of quantized models. - Use the shape of Pytorch parameters and buffers of
TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>to initialize them during restore. This makes quantized model restoring more robust. - Support adding new custom quantization calibration algorithms. Please refer to
mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate>or custom calibration algorithm doc for more details. - Add EAGLE3 (
LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM. - Add support for
--override_shapesflag to ONNX quantization.--calibration_shapesis reserved for the input shapes used for calibration process.--override_shapesis used to override the input shapes of the model with static shapes.
- Add support for UNet ONNX quantization.
- Enable
concat_eliminationpass by default to improve the performance of quantized ONNX models. - Enable Redundant Cast elimination pass by default in
moq.quantize <modelopt.onnx.quantization.quantize>. - Add new attribute
parallel_statetoDynamicModule <modelopt.torch.opt.dynamic.DynamicModule>to support distributed parallelism such as data parallel and tensor parallel. - Add MXFP8, NVFP4 quantized ONNX export support.
- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
ModelOpt 0.27.1 Release
Add experimental quantization support for Llama4, QwQ and Qwen MOE models.
ModelOpt 0.27.0 Release
Deprecations
- Deprecate real quantization configs, please use
mtq.compress <modelopt.torch.quantization.compress>API for model compression after quantization.
New Features
- New model support in the
llm_ptqexample: OpenAI Whisper. - Blockwise FP8 quantization support in unified model export.
- Add quantization support to the Transformer Engine Linear module.
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
- To support distributed checkpoint resume expert-parallel (EP),
modelopt_statein Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacymodelopt_statein the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format. - Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
- Add a new API
mtq.compress <modelopt.torch.quantization.compress>for model compression for weights after quantization. - Add option to simplify ONNX model before quantization is performed.
- (Experimental) Improve support for ONNX models with custom TensorRT op:
- Add support for
--calibration_shapesflag. - Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
- Add support for
Known Issues
- Quantization of T5 models is broken. Please use
nvidia-modelopt==0.25.0withtransformers<4.50meanwhile.
ModelOpt 0.25.0 Release
Deprecations
- Deprecate Torch 2.1 support.
- Deprecate
humanevalbenchmark inllm_evalexamples. Please use the newly addedsimple_evalinstead. - Deprecate
fp8_naivequantization format inllm_ptqexamples. Please usefp8instead.
New Features
- Support fast hadamard transform in
TensorQuantizerclass (modelopt.torch.quantization.nn.modules.TensorQuantizer).
It can be used for rotation based quantization methods, e.g. QuaRot. Users need to install the package fast_hadamard_transfrom to use this feature. - Add affine quantization support for the KV cache, resolving the low accuracy issue in models such as Qwen2.5 and Phi-3/3.5.
- Add FSDP2 support. FSDP2 can now be used for QAT.
- Add LiveCodeBench and Simple Evals to the
llm_evalexamples. - Disabled saving modelopt state in unified hf export APIs by default, i.e., added
save_modelopt_stateflag inexport_hf_checkpointAPI and by default set to False. - Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
- The
modelopt.deploy.llm.LLMclass now support use thetensorrt_llm._torch.LLMbackend for the quantized HuggingFace checkpoints. - Add NVFP4 PTQ example for DeepSeek-R1.
- Add end-to-end AutoDeploy example for AutoQuant LLM models.