Skip to content

ModelOpt 0.27.0 Release

Choose a tag to compare

@kevalmorabia97 kevalmorabia97 released this 03 Apr 05:24
· 2 commits to release/0.27.0 since this release

Deprecations

  • Deprecate real quantization configs, please use mtq.compress <modelopt.torch.quantization.compress> API for model compression after quantization.

New Features

  • New model support in the llm_ptq example: OpenAI Whisper.
  • Blockwise FP8 quantization support in unified model export.
  • Add quantization support to the Transformer Engine Linear module.
  • Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
  • To support distributed checkpoint resume expert-parallel (EP), modelopt_state in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy modelopt_state in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
  • Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
  • Add a new API mtq.compress <modelopt.torch.quantization.compress> for model compression for weights after quantization.
  • Add option to simplify ONNX model before quantization is performed.
  • (Experimental) Improve support for ONNX models with custom TensorRT op:
    • Add support for --calibration_shapes flag.
    • Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

Known Issues

  • Quantization of T5 models is broken. Please use nvidia-modelopt==0.25.0 with transformers<4.50 meanwhile.