Skip to content

Commit cafa7f6

Browse files
Update for 0.33.0 release
1 parent 33a45be commit cafa7f6

File tree

167 files changed

+8561
-1621
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

167 files changed

+8561
-1621
lines changed

CHANGELOG.rst

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,27 @@
11
Model Optimizer Changelog (Linux)
22
=================================
33

4+
0.33 (2025-07-xx)
5+
^^^^^^^^^^^^^^^^^
6+
7+
**Backward Breaking Changes**
8+
9+
- PyTorch dependencies for ``modelopt.torch`` features are no longer optional and ``pip install nvidia-modelopt`` is now same as ``pip install nvidia-modelopt[torch]``.
10+
11+
**Deprecations**
12+
13+
**New Features**
14+
15+
- Upgrade TensorRT-LLM dependency to 0.20.
16+
- Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
17+
- Add support for ONNX models with custom TensorRT ops in Autocast.
18+
- Add quantization aware distillation (QAD) support in ``llm_qat`` example.
19+
- Add support for BF16 in ONNX quantization.
20+
- Add per node calibration support in ONNX quantization.
21+
- ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires ``transformers>=4.52.0``.
22+
- Support quantization of FSDP2 wrapped models and add FSDP2 support in the ``llm_qat`` example.
23+
- Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distilllation.
24+
425
0.31 (2025-06-04)
526
^^^^^^^^^^^^^^^^^
627

@@ -28,6 +49,7 @@ Model Optimizer Changelog (Linux)
2849
- ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
2950
- Add AutoCast tool to convert ONNX models to FP16 or BF16.
3051
- Add ``--low_memory_mode`` flag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.
52+
- Support ``NemotronHForCausalLM``, ``Qwen3ForCausalLM``, ``Qwen3MoeForCausalLM`` Megatron Core model import/export (from/to HuggingFace).
3153

3254
0.29 (2025-05-08)
3355
^^^^^^^^^^^^^^^^^

docker/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
FROM nvcr.io/nvidia/pytorch:25.03-py3
1+
FROM nvcr.io/nvidia/pytorch:25.04-py3
22

33
ARG PIP_EXTRA_INDEX_URL="https://pypi.nvidia.com"
4-
ARG TRT_LLM_COMMIT=v0.19.0
4+
ARG TRT_LLM_COMMIT=v0.20.0
55
ARG REMOVE_TRT_LLM_SRC=1
66
ARG CUDA_ARCH="89-real;90-real;100-real"
77

docs/source/deployment/3_unified_hf.rst

Lines changed: 45 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -35,21 +35,55 @@ The export API (:meth:`export_hf_checkpoint <modelopt.torch.export.unified_expor
3535
Deployment Support Matrix
3636
==============================================
3737

38-
Currently, we support the following quantization formats with the unified HF export API:
39-
#. FP8
40-
#. FP8_PB
41-
#. NVFP4
42-
#. NVFP4_AWQ
43-
#. INT4_AWQ
44-
#. W4A8_AWQ
38+
Supported Quantization Formats
39+
------------------------------
4540

46-
For deployment with TensorRT-LLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 and NVFP4 checkpoints; Medusa and Eagle FP8 checkpoints are also tested.
41+
The unified HF export API supports the following quantization formats:
4742

48-
For deployment with vLLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 checkpoints.
43+
1. FP8 - 8-bit floating point
44+
2. FP8_PB - 8-bit floating point with per-block scaling
45+
3. NVFP4 - NVIDIA 4-bit floating point
46+
4. NVFP4_AWQ - NVIDIA 4-bit floating point with AWQ optimization
47+
5. INT4_AWQ - 4-bit integer with AWQ optimization
48+
6. W4A8_AWQ - 4-bit weights and 8-bit activations with AWQ optimization
4949

50-
For deployment with SGLang, we support llama 3.1, 3.3, with FP8 checkpoints.
50+
Framework-Specific Support
51+
--------------------------
5152

52-
Other models and quantization formats may work, but they are not thoroughly tested.
53+
TensorRT-LLM
54+
~~~~~~~~~~~~
55+
56+
Models:
57+
* Llama 4, 3.1, 3.3 (FP8, NVFP4)
58+
* Qwen 3 (FP8, NVFP4)
59+
* Deepseek R1 (NVFP4)
60+
* Mixtral 8x7B (FP8, NVFP4)
61+
* Medusa (FP8)
62+
* Eagle (FP8)
63+
64+
Requirements: TensorRT-LLM v0.17.0 or later
65+
66+
vLLM
67+
~~~~
68+
69+
Models:
70+
* Llama 3.1, 3.3 (FP8, NVFP4)
71+
* Mixtral 8x7B (FP8)
72+
* Deepseek R1 (NVFP4)
73+
74+
Requirements: vLLM v0.9.1 or later
75+
76+
SGLang
77+
~~~~~~
78+
79+
Models:
80+
* Llama 3.1, 3.3 (FP8, NVFP4)
81+
* Deepseek R1 (NVFP4)
82+
* Llama 4 (FP8)
83+
84+
Requirements: SGLang v0.4.7 or later
85+
86+
Note: While other models and quantization formats may work, they have not been thoroughly tested and validated.
5387

5488

5589
Deployment with Selected Inference Frameworks

docs/source/getting_started/_installation_for_Linux.rst

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
1616
+-------------------------+-----------------------------+
1717
| CUDA | >=12.0 |
1818
+-------------------------+-----------------------------+
19-
| PyTorch (Optional) | >=2.4 |
19+
| PyTorch | >=2.4 |
2020
+-------------------------+-----------------------------+
21-
| TensorRT-LLM (Optional) | 0.18 |
21+
| TensorRT-LLM (Optional) | 0.20 |
2222
+-------------------------+-----------------------------+
2323
| ONNX Runtime (Optional) | 1.22 |
2424
+-------------------------+-----------------------------+
@@ -107,8 +107,8 @@ optional dependencies as described below.
107107

108108
**Identify correct partial dependencies**
109109

110-
Note that when installing ``nvidia-modelopt`` without any optional dependencies, only the barebone
111-
requirements are installed and none of the modules will work without the appropriate optional
110+
Note that when installing ``nvidia-modelopt`` without any optional dependencies, only the ``modelopt.torch`` package
111+
requirements are installed and other modules may not work without the appropriate optional
112112
dependencies or ``[all]`` optional dependencies. Below is a list of optional dependencies that
113113
need to be installed to correctly use the corresponding modules:
114114

@@ -118,14 +118,10 @@ need to be installed to correctly use the corresponding modules:
118118

119119
* - Module
120120
- Optional dependencies
121-
* - ``modelopt.deploy``
122-
- ``[deploy]``
123121
* - ``modelopt.onnx``
124122
- ``[onnx]``
125-
* - ``modelopt.torch``
126-
- ``[torch]``
127123
* - ``modelopt.torch._deploy``
128-
- ``[torch, deploy]``
124+
- ``[onnx]``
129125

130126
Additionally, we support installing dependencies for following 3rd-party packages:
131127

docs/source/guides/8_autocast.rst

Lines changed: 49 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,10 @@ AutoCast can also be used programmatically through its Python API:
2323
.. code-block:: python
2424
2525
import onnx
26-
from modelopt.onnx.autocast import convert
26+
from modelopt.onnx.autocast import convert_to_mixed_precision
2727
2828
# Convert model to mixed precision
29-
converted_model = convert(
29+
converted_model = convert_to_mixed_precision(
3030
onnx_path="model.onnx",
3131
low_precision_type="fp16", # or "bf16"
3232
nodes_to_exclude=None, # optional list of node name patterns to keep in FP32
@@ -35,7 +35,10 @@ AutoCast can also be used programmatically through its Python API:
3535
init_max=65504, # threshold for initializers
3636
keep_io_types=False, # whether to preserve input/output types
3737
calibration_data=None, # optional path to input data file
38-
init_conversion_max_bytes=1073741824, # maximum size in bytes for initializer conversion, 1<<20
38+
init_conversion_max_bytes=None, # maximum size in bytes for initializer conversion
39+
providers=["cpu"], # list of Execution Providers for ONNX-Runtime backend
40+
trt_plugins=[], # list of TensorRT plugin library paths in .so format
41+
max_depth_of_reduction=None, # maximum depth of reduction allowed in low precision
3942
)
4043
4144
# Save the converted model
@@ -46,22 +49,26 @@ How It Works
4649

4750
AutoCast follows these steps to convert a model:
4851

49-
1. **Model Loading and Sanitization**:
52+
#. **Model Loading and Sanitization**:
53+
5054
- Loads the ONNX model
5155
- Performs graph sanitization and optimizations
5256
- Ensures minimum opset version requirements (22 for BF16, 13 for FP16)
5357

54-
2. **Node Classification**:
58+
#. **Node Classification**:
59+
5560
- Analyzes each node in the graph
5661
- Determines which nodes should remain in FP32 based on input and output tensors magnitudes, operation types and node name patterns
5762
- If a calibration dataset is provided, it will be used to generate intermediate tensor magnitudes for more accurate node classification, otherwise random data will be used.
5863

59-
3. **Precision Conversion**:
64+
#. **Precision Conversion**:
65+
6066
- Converts eligible nodes to lower precision
6167
- Automatically inserts necessary cast operations
6268
- Automatically replaces initializers with lower precision values
6369

64-
4. **Validation and Export**:
70+
#. **Validation and Export**:
71+
6572
- Verifying that the model is a valid ONNX model (using onnx.checker)
6673
- Checking that the output tensors are not disconnected
6774
- Verifying that the original and current network inputs/outputs names match
@@ -71,36 +78,50 @@ AutoCast follows these steps to convert a model:
7178
Best Practices
7279
--------------
7380

74-
1. **Start with Default Settings**:
75-
Begin with default thresholds and gradually adjust based on accuracy requirements.
81+
#. **Start with Default Settings**:
82+
83+
- Begin with default thresholds and gradually adjust based on accuracy requirements.
84+
85+
#. **Monitor Node Conversion**:
86+
87+
- Use INFO level logging to see what percentage of nodes were converted to lower precision.
88+
- Use DEBUG level logging to see more detailed information about the node classification process.
89+
90+
#. **Preserve Critical Operations**:
91+
92+
- Use ``op_types_to_exclude`` for operations known to be sensitive to precision reduction.
7693

77-
2. **Monitor Node Conversion**:
78-
Use INFO level logging to see what percentage of nodes were converted to lower precision.
79-
Use DEBUG level logging to see more detailed information about the node classification process.
94+
#. **Validate with Real Data**:
8095

81-
3. **Preserve Critical Operations**:
82-
Use ``op_types_to_exclude`` for operations known to be sensitive to precision reduction.
96+
- Provide representative input data using the ``calibration_data`` option for more accurate node classification.
8397

84-
4. **Validate with Real Data**:
85-
Provide representative input data using the ``calibration_data`` option for more accurate node classification.
98+
#. **Control Reduction Depth**:
99+
- Use ``max_depth_of_reduction`` to limit the depth of reduction operations that can be converted to low precision.
100+
Operations with higher reduction depths (e.g., large matrix multiplications, convolutions with large kernels) may be more sensitive to precision loss.
101+
102+
#. **BF16 Conversion**:
86103

87-
5. **BF16 Conversion**:
88104
- BF16 conversion is not supported for all operations.
89105
- AutoCast will automatically convert the model to opset 22 to enable more BF16 operations.
90106
- Use ``--op_types_to_exclude`` to exclude operations that are not supported in BF16.
91107
- BF16 accuracy may require additional tuning of the ``data_max`` and ``init_max`` thresholds.
92108
- TensorRT might not be able to support all BF16 converted models.
93109

94-
6. **Large Initializers**
95-
- Attempting to convert large initializers, might cause host memory issues.
110+
#. **Large Initializers**
111+
112+
- Attempting to convert very large initializers, might cause host memory issues.
96113
- Use ``--init_conversion_max_bytes`` to limit the size of initializers that will be converted at compile time.
97114
- Initializers larger than ``--init_conversion_max_bytes`` will be converted at runtime (using a cast operation).
98-
- Increasing this value may result in smaller models and faster inference, but could also result in AutoCast crash during the conversion process.
99-
- For best results, use the highest ``--init_conversion_max_bytes`` that the host memory can handle.
115+
116+
#. **TensorRT custom op support**
117+
118+
- Refer to :ref:`TensorRT Execution Provider requirements <ort_ep_requirements>`.
119+
- When a custom op is detected, the TensorRT Execution Provider is automatically enabled.
120+
- To also enable the CUDA execution provider, use ``--providers cpu cuda:x``, where ``x`` is your device ID (``x=0`` if your system only has 1 GPU).
121+
- Use ``--trt_plugins`` to provide the paths to the necessary TensorRT plugin libraries (in ``.so`` format).
100122

101123
Limitations and Restrictions
102124
----------------------------
103-
- AutoCast does not yet support models with custom operators / plugins.
104125
- AutoCast does not yet support quantized models.
105126
- BF16 conversion is not supported for all operations
106127
- Large models (e.g. over 2GB) might cause memory issues.
@@ -134,3 +155,9 @@ Bypass data magnitude check and keep specific node names in FP32:
134155
.. code-block:: bash
135156
136157
python -m modelopt.onnx.autocast --onnx_path model.onnx --data_max inf --nodes_to_exclude ".*attn.*"
158+
159+
Limit depth of reduction for precision-sensitive operations:
160+
161+
.. code-block:: bash
162+
163+
python -m modelopt.onnx.autocast --onnx_path model.onnx --max_depth_of_reduction 1024

docs/source/guides/_onnx_quantization.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Currently ONNX quantization supports FP8, INT4 and INT8 quantization.
1515
ModelOpt ONNX quantization generates new ONNX models with QDQ nodes following TensorRT rules.
1616
For real speedup, the generated ONNX should be compiled into TensorRT engine.
1717

18+
.. _ort_ep_requirements:
19+
1820
Requirements
1921
============
2022

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
- [PTQ for VLMs](./vlm_ptq/README.md) covers how to use Post-training quantization (PTQ) and export to [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) for deployment for popular Vision Language Models (VLMs).
1313
- [PTQ for ONNX Models](./onnx_ptq/README.md) shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
1414
- [QAT for LLMs](./llm_qat/README.md) demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or FP4 in [NVIDIA Blackwell platform](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)).
15+
- [QAT for CNNs](./cnn_qat/README.md) demonstrates the recipe and workflow for Quantization-aware Training (QAT) of CNN models, which can further preserve model accuracy at low precisions like INT8, FP8 etc.
1516
- [AutoDeploy for AutoQuant LLM models](./llm_autodeploy/README.md) demonstrates how to deploy mixed-precision models using ModelOpt's AutoQuant and TRT-LLM's AutoDeploy.
1617

1718
### Pruning

examples/chained_optimizations/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ on fine-tuning and QAT.
2424
Install Model Optimizer with optional torch and huggingface dependencies:
2525

2626
```bash
27-
pip install "nvidia-modelopt[torch,hf]"
27+
pip install "nvidia-modelopt[hf]"
2828
```
2929

3030
### Running the example

0 commit comments

Comments
 (0)