ModelOpt 0.31.0 Release
Backward Breaking Changes
- NeMo and Megatron-LM distributed checkpoint (
torch-dist) stored with legacy version can no longer be loaded. The remedy is to load the legacy distributed checkpoint with 0.29 and store atorchcheckpoint and resume with 0.31 to convert to a new format. The following changes only apply to storing and resuming distributed checkpoint.quantizer_stateof :class:TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>is now stored inextra_stateof :class:QuantModule <modelopt.torch.quantization.nn.module.QuantModule>where it used to be stored in the shardedmodelopt_state.- The dtype and shape of
amaxandpre_quant_scalestored in the distributed checkpoint are now retored. Some dtype and shape are previously changed to make all decoder layers to have homogeneous structure in the checkpoint. - Togather with megatron.core-0.13, quantized model will store and resume distributed checkpoint in a heterogenous format.
- auto_quantize API now accepts a list of quantization config dicts as the list of quantization choices.
- This API previously accepts a list of strings of quantization format names. It was therefore limited to only pre-defined quantization formats unless through some hacks.
- With this change, now user can easily use their own custom quantization formats for auto_quantize.
- In addition, the
quantization_formatsnow excludeNone(indicating "do not quantize") as a valid format because the auto_quantize internally always add "do not quantize" as an option anyway.
- Model export config is refactored. The quant config in
hf_quant_config.jsonis converted and saved toconfig.json.hf_quant_config.jsonwill be deprecated soon.
Deprecations
- Deprecate
Python 3.9support.
New Features
- Upgrade LLM examples to use TensorRT-LLM 0.19.
- Add new model support in the
llm_ptqexample: Qwen3 MoE. - ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
- Add AutoCast tool to convert ONNX models to FP16 or BF16.
- Add
--low_memory_modeflag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.