You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+35-5Lines changed: 35 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,50 @@
1
1
Model Optimizer Changelog (Linux)
2
2
=================================
3
3
4
-
0.27 (2025-04-03)
4
+
0.29 (2025-05-08)
5
5
^^^^^^^^^^^^^^^^^
6
6
7
7
**Backward Breaking Changes**
8
8
9
+
- Refactor ``SequentialQuantizer`` to improve its implementation and maintainability while preserving its functionality.
10
+
11
+
**Deprecations**
12
+
13
+
- Deprecate ``torch<2.4`` support.
14
+
15
+
**New Features**
16
+
17
+
- Upgrade LLM examples to use TensorRT-LLM 0.18.
18
+
- Add new model support in the ``llm_ptq`` example: Gemma-3, Llama-Nemotron.
19
+
- Add INT8 real quantization support.
20
+
- Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the :meth:`mtq.compress <modelopt.torch.quantization.compress>` API to accelerate evaluation of quantized models.
21
+
- Use the shape of Pytorch parameters and buffers of :class:`TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>` to initialize them during restore. This makes quantized model restoring more robust.
22
+
- Support adding new custom quantization calibration algorithms. Please refer to :func:`mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate>` or :ref:`custom calibration algorithm <custom_calibration_algorithm>` for more details.
23
+
- Add EAGLE3 (``LlamaForCausalLMEagle3``) training and unified ModelOpt checkpoint export support for Megatron-LM.
24
+
- Add support for ``--override_shapes`` flag to ONNX quantization.
25
+
- ``--calibration_shapes`` is reserved for the input shapes used for calibration process.
26
+
- ``--override_shapes`` is used to override the input shapes of the model with static shapes.
27
+
- Add support for UNet ONNX quantization.
28
+
- Enable ``concat_elimination`` pass by default to improve the performance of quantized ONNX models.
29
+
- Enable Redundant Cast elimination pass by default in :meth:`moq.quantize <modelopt.onnx.quantization.quantize>`.
30
+
- Add new attribute ``parallel_state`` to :class:`DynamicModule <modelopt.torch.opt.dynamic.DynamicModule>` to support distributed parallelism such as data parallel and tensor parallel.
31
+
- Add MXFP8, NVFP4 quantized ONNX export support.
32
+
- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
33
+
34
+
0.27 (2025-04-03)
35
+
^^^^^^^^^^^^^^^^^
36
+
37
+
**Deprecations**
38
+
9
39
- Deprecate real quantization configs, please use :meth:`mtq.compress <modelopt.torch.quantization.compress>` API for model compression after quantization.
10
40
11
41
**New Features**
12
42
13
-
- New model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
14
-
- Blockwise FP8 quantization support in unified model export.
43
+
- Add new model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
44
+
- Add blockwise FP8 quantization support in unified model export.
15
45
- Add quantization support to the Transformer Engine Linear module.
16
46
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
17
-
- To support distributed checkpoint resume expert-parallel (EP), ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
47
+
- Store ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) differently to support distributed checkpoint resume expert-parallel (EP). The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
18
48
- Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
19
49
- Add a new API :meth:`mtq.compress <modelopt.torch.quantization.compress>` for model compression for weights after quantization.
20
50
- Add option to simplify ONNX model before quantization is performed.
@@ -31,7 +61,7 @@ Model Optimizer Changelog (Linux)
31
61
0.25 (2025-03-03)
32
62
^^^^^^^^^^^^^^^^^
33
63
34
-
**Backward Breaking Changes**
64
+
**Deprecations**
35
65
36
66
- Deprecate Torch 2.1 support.
37
67
- Deprecate ``humaneval`` benchmark in ``llm_eval`` examples. Please use the newly added ``simple_eval`` instead.
Copy file name to clipboardExpand all lines: README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,7 @@
18
18
19
19
## Latest News
20
20
21
+
-[2025/04/21][Adobe optimized deployment using TensorRT-Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership](https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/)
21
22
-[2025/04/05][NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
22
23
-[2025/03/18][World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
23
24
-[2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
When installing from source, please make sure to re-run the install command everytime you pull new changes in the repository so dependencies are also updated.
This command installs ModelOpt-Windows and its ONNX module, along with the *onnxruntime-directml* (v1.20.0) package. If ModelOpt-Windows is installed without the additional parameter, only the bare minimum dependencies will be installed, without the relevant module and dependencies.
0 commit comments