ModelOpt 0.27.0 Release
·
2 commits
to release/0.27.0
since this release
Deprecations
- Deprecate real quantization configs, please use
mtq.compress <modelopt.torch.quantization.compress>API for model compression after quantization.
New Features
- New model support in the
llm_ptqexample: OpenAI Whisper. - Blockwise FP8 quantization support in unified model export.
- Add quantization support to the Transformer Engine Linear module.
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
- To support distributed checkpoint resume expert-parallel (EP),
modelopt_statein Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacymodelopt_statein the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format. - Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
- Add a new API
mtq.compress <modelopt.torch.quantization.compress>for model compression for weights after quantization. - Add option to simplify ONNX model before quantization is performed.
- (Experimental) Improve support for ONNX models with custom TensorRT op:
- Add support for
--calibration_shapesflag. - Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
- Add support for
Known Issues
- Quantization of T5 models is broken. Please use
nvidia-modelopt==0.25.0withtransformers<4.50meanwhile.