You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+22Lines changed: 22 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,27 @@
1
1
Model Optimizer Changelog (Linux)
2
2
=================================
3
3
4
+
0.33 (2025-07-xx)
5
+
^^^^^^^^^^^^^^^^^
6
+
7
+
**Backward Breaking Changes**
8
+
9
+
- PyTorch dependencies for ``modelopt.torch`` features are no longer optional and ``pip install nvidia-modelopt`` is now same as ``pip install nvidia-modelopt[torch]``.
10
+
11
+
**Deprecations**
12
+
13
+
**New Features**
14
+
15
+
- Upgrade TensorRT-LLM dependency to 0.20.
16
+
- Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
17
+
- Add support for ONNX models with custom TensorRT ops in Autocast.
18
+
- Add quantization aware distillation (QAD) support in ``llm_qat`` example.
19
+
- Add support for BF16 in ONNX quantization.
20
+
- Add per node calibration support in ONNX quantization.
21
+
- ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires ``transformers>=4.52.0``.
22
+
- Support quantization of FSDP2 wrapped models and add FSDP2 support in the ``llm_qat`` example.
@@ -28,6 +49,7 @@ Model Optimizer Changelog (Linux)
28
49
- ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
29
50
- Add AutoCast tool to convert ONNX models to FP16 or BF16.
30
51
- Add ``--low_memory_mode`` flag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.
52
+
- Support ``NemotronHForCausalLM``, ``Qwen3ForCausalLM``, ``Qwen3MoeForCausalLM`` Megatron Core model import/export (from/to HuggingFace).
Copy file name to clipboardExpand all lines: docs/source/deployment/3_unified_hf.rst
+45-11Lines changed: 45 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,21 +35,55 @@ The export API (:meth:`export_hf_checkpoint <modelopt.torch.export.unified_expor
35
35
Deployment Support Matrix
36
36
==============================================
37
37
38
-
Currently, we support the following quantization formats with the unified HF export API:
39
-
#. FP8
40
-
#. FP8_PB
41
-
#. NVFP4
42
-
#. NVFP4_AWQ
43
-
#. INT4_AWQ
44
-
#. W4A8_AWQ
38
+
Supported Quantization Formats
39
+
------------------------------
45
40
46
-
For deployment with TensorRT-LLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 and NVFP4 checkpoints; Medusa and Eagle FP8 checkpoints are also tested.
41
+
The unified HF export API supports the following quantization formats:
47
42
48
-
For deployment with vLLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 checkpoints.
43
+
1. FP8 - 8-bit floating point
44
+
2. FP8_PB - 8-bit floating point with per-block scaling
45
+
3. NVFP4 - NVIDIA 4-bit floating point
46
+
4. NVFP4_AWQ - NVIDIA 4-bit floating point with AWQ optimization
47
+
5. INT4_AWQ - 4-bit integer with AWQ optimization
48
+
6. W4A8_AWQ - 4-bit weights and 8-bit activations with AWQ optimization
49
49
50
-
For deployment with SGLang, we support llama 3.1, 3.3, with FP8 checkpoints.
50
+
Framework-Specific Support
51
+
--------------------------
51
52
52
-
Other models and quantization formats may work, but they are not thoroughly tested.
53
+
TensorRT-LLM
54
+
~~~~~~~~~~~~
55
+
56
+
Models:
57
+
* Llama 4, 3.1, 3.3 (FP8, NVFP4)
58
+
* Qwen 3 (FP8, NVFP4)
59
+
* Deepseek R1 (NVFP4)
60
+
* Mixtral 8x7B (FP8, NVFP4)
61
+
* Medusa (FP8)
62
+
* Eagle (FP8)
63
+
64
+
Requirements: TensorRT-LLM v0.17.0 or later
65
+
66
+
vLLM
67
+
~~~~
68
+
69
+
Models:
70
+
* Llama 3.1, 3.3 (FP8, NVFP4)
71
+
* Mixtral 8x7B (FP8)
72
+
* Deepseek R1 (NVFP4)
73
+
74
+
Requirements: vLLM v0.9.1 or later
75
+
76
+
SGLang
77
+
~~~~~~
78
+
79
+
Models:
80
+
* Llama 3.1, 3.3 (FP8, NVFP4)
81
+
* Deepseek R1 (NVFP4)
82
+
* Llama 4 (FP8)
83
+
84
+
Requirements: SGLang v0.4.7 or later
85
+
86
+
Note: While other models and quantization formats may work, they have not been thoroughly tested and validated.
Copy file name to clipboardExpand all lines: docs/source/guides/8_autocast.rst
+49-22Lines changed: 49 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,10 +23,10 @@ AutoCast can also be used programmatically through its Python API:
23
23
.. code-block:: python
24
24
25
25
import onnx
26
-
from modelopt.onnx.autocast importconvert
26
+
from modelopt.onnx.autocast importconvert_to_mixed_precision
27
27
28
28
# Convert model to mixed precision
29
-
converted_model =convert(
29
+
converted_model =convert_to_mixed_precision(
30
30
onnx_path="model.onnx",
31
31
low_precision_type="fp16", # or "bf16"
32
32
nodes_to_exclude=None, # optional list of node name patterns to keep in FP32
@@ -35,7 +35,10 @@ AutoCast can also be used programmatically through its Python API:
35
35
init_max=65504, # threshold for initializers
36
36
keep_io_types=False, # whether to preserve input/output types
37
37
calibration_data=None, # optional path to input data file
38
-
init_conversion_max_bytes=1073741824, # maximum size in bytes for initializer conversion, 1<<20
38
+
init_conversion_max_bytes=None, # maximum size in bytes for initializer conversion
39
+
providers=["cpu"], # list of Execution Providers for ONNX-Runtime backend
40
+
trt_plugins=[], # list of TensorRT plugin library paths in .so format
41
+
max_depth_of_reduction=None, # maximum depth of reduction allowed in low precision
39
42
)
40
43
41
44
# Save the converted model
@@ -46,22 +49,26 @@ How It Works
46
49
47
50
AutoCast follows these steps to convert a model:
48
51
49
-
1. **Model Loading and Sanitization**:
52
+
#. **Model Loading and Sanitization**:
53
+
50
54
- Loads the ONNX model
51
55
- Performs graph sanitization and optimizations
52
56
- Ensures minimum opset version requirements (22 for BF16, 13 for FP16)
53
57
54
-
2. **Node Classification**:
58
+
#. **Node Classification**:
59
+
55
60
- Analyzes each node in the graph
56
61
- Determines which nodes should remain in FP32 based on input and output tensors magnitudes, operation types and node name patterns
57
62
- If a calibration dataset is provided, it will be used to generate intermediate tensor magnitudes for more accurate node classification, otherwise random data will be used.
58
63
59
-
3. **Precision Conversion**:
64
+
#. **Precision Conversion**:
65
+
60
66
- Converts eligible nodes to lower precision
61
67
- Automatically inserts necessary cast operations
62
68
- Automatically replaces initializers with lower precision values
63
69
64
-
4. **Validation and Export**:
70
+
#. **Validation and Export**:
71
+
65
72
- Verifying that the model is a valid ONNX model (using onnx.checker)
66
73
- Checking that the output tensors are not disconnected
67
74
- Verifying that the original and current network inputs/outputs names match
@@ -71,36 +78,50 @@ AutoCast follows these steps to convert a model:
71
78
Best Practices
72
79
--------------
73
80
74
-
1. **Start with Default Settings**:
75
-
Begin with default thresholds and gradually adjust based on accuracy requirements.
81
+
#. **Start with Default Settings**:
82
+
83
+
- Begin with default thresholds and gradually adjust based on accuracy requirements.
84
+
85
+
#. **Monitor Node Conversion**:
86
+
87
+
- Use INFO level logging to see what percentage of nodes were converted to lower precision.
88
+
- Use DEBUG level logging to see more detailed information about the node classification process.
89
+
90
+
#. **Preserve Critical Operations**:
91
+
92
+
- Use ``op_types_to_exclude`` for operations known to be sensitive to precision reduction.
76
93
77
-
2. **Monitor Node Conversion**:
78
-
Use INFO level logging to see what percentage of nodes were converted to lower precision.
79
-
Use DEBUG level logging to see more detailed information about the node classification process.
94
+
#. **Validate with Real Data**:
80
95
81
-
3. **Preserve Critical Operations**:
82
-
Use ``op_types_to_exclude`` for operations known to be sensitive to precision reduction.
96
+
- Provide representative input data using the ``calibration_data`` option for more accurate node classification.
83
97
84
-
4. **Validate with Real Data**:
85
-
Provide representative input data using the ``calibration_data`` option for more accurate node classification.
98
+
#. **Control Reduction Depth**:
99
+
- Use ``max_depth_of_reduction`` to limit the depth of reduction operations that can be converted to low precision.
100
+
Operations with higher reduction depths (e.g., large matrix multiplications, convolutions with large kernels) may be more sensitive to precision loss.
101
+
102
+
#. **BF16 Conversion**:
86
103
87
-
5. **BF16 Conversion**:
88
104
- BF16 conversion is not supported for all operations.
89
105
- AutoCast will automatically convert the model to opset 22 to enable more BF16 operations.
90
106
- Use ``--op_types_to_exclude`` to exclude operations that are not supported in BF16.
91
107
- BF16 accuracy may require additional tuning of the ``data_max`` and ``init_max`` thresholds.
92
108
- TensorRT might not be able to support all BF16 converted models.
93
109
94
-
6. **Large Initializers**
95
-
- Attempting to convert large initializers, might cause host memory issues.
110
+
#. **Large Initializers**
111
+
112
+
- Attempting to convert very large initializers, might cause host memory issues.
96
113
- Use ``--init_conversion_max_bytes`` to limit the size of initializers that will be converted at compile time.
97
114
- Initializers larger than ``--init_conversion_max_bytes`` will be converted at runtime (using a cast operation).
98
-
- Increasing this value may result in smaller models and faster inference, but could also result in AutoCast crash during the conversion process.
99
-
- For best results, use the highest ``--init_conversion_max_bytes`` that the host memory can handle.
115
+
116
+
#. **TensorRT custom op support**
117
+
118
+
- Refer to :ref:`TensorRT Execution Provider requirements <ort_ep_requirements>`.
119
+
- When a custom op is detected, the TensorRT Execution Provider is automatically enabled.
120
+
- To also enable the CUDA execution provider, use ``--providers cpu cuda:x``, where ``x`` is your device ID (``x=0`` if your system only has 1 GPU).
121
+
- Use ``--trt_plugins`` to provide the paths to the necessary TensorRT plugin libraries (in ``.so`` format).
100
122
101
123
Limitations and Restrictions
102
124
----------------------------
103
-
- AutoCast does not yet support models with custom operators / plugins.
104
125
- AutoCast does not yet support quantized models.
105
126
- BF16 conversion is not supported for all operations
106
127
- Large models (e.g. over 2GB) might cause memory issues.
@@ -134,3 +155,9 @@ Bypass data magnitude check and keep specific node names in FP32:
Copy file name to clipboardExpand all lines: examples/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,6 +12,7 @@
12
12
-[PTQ for VLMs](./vlm_ptq/README.md) covers how to use Post-training quantization (PTQ) and export to [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) for deployment for popular Vision Language Models (VLMs).
13
13
-[PTQ for ONNX Models](./onnx_ptq/README.md) shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
14
14
-[QAT for LLMs](./llm_qat/README.md) demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or FP4 in [NVIDIA Blackwell platform](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)).
15
+
-[QAT for CNNs](./cnn_qat/README.md) demonstrates the recipe and workflow for Quantization-aware Training (QAT) of CNN models, which can further preserve model accuracy at low precisions like INT8, FP8 etc.
15
16
-[AutoDeploy for AutoQuant LLM models](./llm_autodeploy/README.md) demonstrates how to deploy mixed-precision models using ModelOpt's AutoQuant and TRT-LLM's AutoDeploy.
0 commit comments