ModelOpt 0.29.0 Release
Backward Breaking Changes
- Refactor
SequentialQuantizerto improve its implementation and maintainability while preserving its functionality.
Deprecations
- Deprecate
torch<2.4support.
New Features
- Upgrade LLM examples to use TensorRT-LLM 0.18.
- Add new model support in the
llm_ptqexample: Gemma-3, Llama-Nemotron. - Add INT8 real quantization support.
- Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the
mtq.compress <modelopt.torch.quantization.compress>API to accelerate evaluation of quantized models. - Use the shape of Pytorch parameters and buffers of
TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>to initialize them during restore. This makes quantized model restoring more robust. - Support adding new custom quantization calibration algorithms. Please refer to
mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate>or custom calibration algorithm doc for more details. - Add EAGLE3 (
LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM. - Add support for
--override_shapesflag to ONNX quantization.--calibration_shapesis reserved for the input shapes used for calibration process.--override_shapesis used to override the input shapes of the model with static shapes.
- Add support for UNet ONNX quantization.
- Enable
concat_eliminationpass by default to improve the performance of quantized ONNX models. - Enable Redundant Cast elimination pass by default in
moq.quantize <modelopt.onnx.quantization.quantize>. - Add new attribute
parallel_statetoDynamicModule <modelopt.torch.opt.dynamic.DynamicModule>to support distributed parallelism such as data parallel and tensor parallel. - Add MXFP8, NVFP4 quantized ONNX export support.
- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.