NVIDIA
diff --git a/‎CHANGELOG.rst‎
Lines changed: 22 additions & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎docker/Dockerfile‎
Lines changed: 2 additions & 2 deletions b/‎docker/Dockerfile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/deployment/3_unified_hf.rst‎
Lines changed: 45 additions & 11 deletions b/‎docs/source/deployment/3_unified_hf.rst‎
Lines changed: 45 additions & 11 deletions
diff --git a/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 5 additions & 9 deletions b/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 5 additions & 9 deletions
diff --git a/‎docs/source/guides/8_autocast.rst‎
Lines changed: 49 additions & 22 deletions b/‎docs/source/guides/8_autocast.rst‎
Lines changed: 49 additions & 22 deletions
diff --git a/‎docs/source/guides/_onnx_quantization.rst‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/guides/_onnx_quantization.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/README.md‎
Lines changed: 1 addition & 0 deletions b/‎examples/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/chained_optimizations/README.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/chained_optimizations/README.md‎
Lines changed: 1 addition & 1 deletion
@@ -1,6 +1,27 @@
 Model Optimizer Changelog (Linux)
 =================================
 
+0.33 (2025-07-xx)
+^^^^^^^^^^^^^^^^^
+
+**Backward Breaking Changes**
+
+- PyTorch dependencies for ``modelopt.torch`` features are no longer optional and ``pip install nvidia-modelopt`` is now same as ``pip install nvidia-modelopt[torch]``.
+
+**Deprecations**
+
+**New Features**
+
+- Upgrade TensorRT-LLM dependency to 0.20.
+- Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
+- Add support for ONNX models with custom TensorRT ops in Autocast.
+- Add quantization aware distillation (QAD) support in ``llm_qat`` example.
+- Add support for BF16 in ONNX quantization.
+- Add per node calibration support in ONNX quantization.
+- ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires ``transformers>=4.52.0``.
+- Support quantization of FSDP2 wrapped models and add FSDP2 support in the ``llm_qat`` example.
+- Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distilllation.
+
 0.31 (2025-06-04)
 ^^^^^^^^^^^^^^^^^
 
@@ -28,6 +49,7 @@ Model Optimizer Changelog (Linux)
 - ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
 - Add AutoCast tool to convert ONNX models to FP16 or BF16.
 - Add ``--low_memory_mode`` flag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.
+- Support ``NemotronHForCausalLM``, ``Qwen3ForCausalLM``, ``Qwen3MoeForCausalLM`` Megatron Core model import/export (from/to HuggingFace).
 
 0.29 (2025-05-08)
 ^^^^^^^^^^^^^^^^^
 
@@ -1,7 +1,7 @@
-FROM nvcr.io/nvidia/pytorch:25.03-py3
+FROM nvcr.io/nvidia/pytorch:25.04-py3
 
 ARG PIP_EXTRA_INDEX_URL="https://pypi.nvidia.com"
-ARG TRT_LLM_COMMIT=v0.19.0
+ARG TRT_LLM_COMMIT=v0.20.0
 ARG REMOVE_TRT_LLM_SRC=1
 ARG CUDA_ARCH="89-real;90-real;100-real"
 
 
@@ -35,21 +35,55 @@ The export API (:meth:`export_hf_checkpoint <modelopt.torch.export.unified_expor
 Deployment Support Matrix
 ==============================================
 
-Currently, we support the following quantization formats with the unified HF export API:
-#. FP8
-#. FP8_PB
-#. NVFP4
-#. NVFP4_AWQ
-#. INT4_AWQ
-#. W4A8_AWQ
+Supported Quantization Formats
+------------------------------
 
-For deployment with TensorRT-LLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 and NVFP4 checkpoints; Medusa and Eagle FP8 checkpoints are also tested.
+The unified HF export API supports the following quantization formats:
 
-For deployment with vLLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 checkpoints.
+1. FP8 - 8-bit floating point
+2. FP8_PB - 8-bit floating point with per-block scaling
+3. NVFP4 - NVIDIA 4-bit floating point
+4. NVFP4_AWQ - NVIDIA 4-bit floating point with AWQ optimization
+5. INT4_AWQ - 4-bit integer with AWQ optimization
+6. W4A8_AWQ - 4-bit weights and 8-bit activations with AWQ optimization
 
-For deployment with SGLang, we support llama 3.1, 3.3, with FP8 checkpoints.
+Framework-Specific Support
+--------------------------
 
-Other models and quantization formats may work, but they are not thoroughly tested.
+TensorRT-LLM
+~~~~~~~~~~~~
+
+Models:
+  * Llama 4, 3.1, 3.3 (FP8, NVFP4)
+  * Qwen 3 (FP8, NVFP4)
+  * Deepseek R1 (NVFP4)
+  * Mixtral 8x7B (FP8, NVFP4)
+  * Medusa (FP8)
+  * Eagle (FP8)
+
+Requirements: TensorRT-LLM v0.17.0 or later
+
+vLLM
+~~~~
+
+Models:
+  * Llama 3.1, 3.3 (FP8, NVFP4)
+  * Mixtral 8x7B (FP8)
+  * Deepseek R1 (NVFP4)
+
+Requirements: vLLM v0.9.1 or later
+
+SGLang
+~~~~~~
+
+Models:
+  * Llama 3.1, 3.3 (FP8, NVFP4)
+  * Deepseek R1 (NVFP4)
+  * Llama 4 (FP8)
+
+Requirements: SGLang v0.4.7 or later
+
+Note: While other models and quantization formats may work, they have not been thoroughly tested and validated.
 
 
 Deployment with Selected Inference Frameworks
 
@@ -16,9 +16,9 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
 +-------------------------+-----------------------------+
 | CUDA                    |  >=12.0                     |
 +-------------------------+-----------------------------+
-| PyTorch (Optional)      |  >=2.4                      |
+| PyTorch                 |  >=2.4                      |
 +-------------------------+-----------------------------+
-| TensorRT-LLM (Optional) |  0.18                       |
+| TensorRT-LLM (Optional) |  0.20                       |
 +-------------------------+-----------------------------+
 | ONNX Runtime (Optional) |  1.22                       |
 +-------------------------+-----------------------------+
@@ -107,8 +107,8 @@ optional dependencies as described below.
 
 **Identify correct partial dependencies**
 
-Note that when installing ``nvidia-modelopt`` without any optional dependencies, only the barebone
-requirements are installed and none of the modules will work without the appropriate optional
+Note that when installing ``nvidia-modelopt`` without any optional dependencies, only the ``modelopt.torch`` package
+requirements are installed and other modules may not work without the appropriate optional
 dependencies or ``[all]`` optional dependencies. Below is a list of optional dependencies that
 need to be installed to correctly use the corresponding modules:
 
@@ -118,14 +118,10 @@ need to be installed to correctly use the corresponding modules:
 
     *   - Module
         - Optional dependencies
-    *   - ``modelopt.deploy``
-        - ``[deploy]``
     *   - ``modelopt.onnx``
         - ``[onnx]``
-    *   - ``modelopt.torch``
-        - ``[torch]``
     *   - ``modelopt.torch._deploy``
-        - ``[torch, deploy]``
+        - ``[onnx]``
 
 Additionally, we support installing dependencies for following 3rd-party packages:
 
 
@@ -23,10 +23,10 @@ AutoCast can also be used programmatically through its Python API:
 .. code-block:: python
 
    import onnx
-   from modelopt.onnx.autocast import convert
+   from modelopt.onnx.autocast import convert_to_mixed_precision
 
    # Convert model to mixed precision
-   converted_model = convert(
+   converted_model = convert_to_mixed_precision(
       onnx_path="model.onnx",
       low_precision_type="fp16",            # or "bf16"
       nodes_to_exclude=None,                # optional list of node name patterns to keep in FP32
@@ -35,7 +35,10 @@ AutoCast can also be used programmatically through its Python API:
       init_max=65504,                       # threshold for initializers
       keep_io_types=False,                  # whether to preserve input/output types
       calibration_data=None,                # optional path to input data file
-      init_conversion_max_bytes=1073741824, # maximum size in bytes for initializer conversion, 1<<20
+      init_conversion_max_bytes=None,       # maximum size in bytes for initializer conversion
+      providers=["cpu"],                    # list of Execution Providers for ONNX-Runtime backend
+      trt_plugins=[],                       # list of TensorRT plugin library paths in .so format
+      max_depth_of_reduction=None,          # maximum depth of reduction allowed in low precision
    )
 
    # Save the converted model
@@ -46,22 +49,26 @@ How It Works
 
 AutoCast follows these steps to convert a model:
 
-1. **Model Loading and Sanitization**:
+#. **Model Loading and Sanitization**:
+
    - Loads the ONNX model
    - Performs graph sanitization and optimizations
    - Ensures minimum opset version requirements (22 for BF16, 13 for FP16)
 
-2. **Node Classification**:
+#. **Node Classification**:
+
    - Analyzes each node in the graph
    - Determines which nodes should remain in FP32 based on input and output tensors magnitudes, operation types and node name patterns
    - If a calibration dataset is provided, it will be used to generate intermediate tensor magnitudes for more accurate node classification, otherwise random data will be used.
 
-3. **Precision Conversion**:
+#. **Precision Conversion**:
+
    - Converts eligible nodes to lower precision
    - Automatically inserts necessary cast operations
    - Automatically replaces initializers with lower precision values
 
-4. **Validation and Export**:
+#. **Validation and Export**:
+
    - Verifying that the model is a valid ONNX model (using onnx.checker)
    - Checking that the output tensors are not disconnected
    - Verifying that the original and current network inputs/outputs names match
@@ -71,36 +78,50 @@ AutoCast follows these steps to convert a model:
 Best Practices
 --------------
 
-1. **Start with Default Settings**:
-   Begin with default thresholds and gradually adjust based on accuracy requirements.
+#. **Start with Default Settings**:
+
+   - Begin with default thresholds and gradually adjust based on accuracy requirements.
+
+#. **Monitor Node Conversion**:
+
+   - Use INFO level logging to see what percentage of nodes were converted to lower precision.
+   - Use DEBUG level logging to see more detailed information about the node classification process.
+
+#. **Preserve Critical Operations**:
+
+   - Use ``op_types_to_exclude`` for operations known to be sensitive to precision reduction.
 
-2. **Monitor Node Conversion**:
-   Use INFO level logging to see what percentage of nodes were converted to lower precision.
-   Use DEBUG level logging to see more detailed information about the node classification process.
+#. **Validate with Real Data**:
 
-3. **Preserve Critical Operations**:
-   Use ``op_types_to_exclude`` for operations known to be sensitive to precision reduction.
+   - Provide representative input data using the ``calibration_data`` option for more accurate node classification.
 
-4. **Validate with Real Data**:
-   Provide representative input data using the ``calibration_data`` option for more accurate node classification.
+#. **Control Reduction Depth**:
+   - Use ``max_depth_of_reduction`` to limit the depth of reduction operations that can be converted to low precision.
+   Operations with higher reduction depths (e.g., large matrix multiplications, convolutions with large kernels) may be more sensitive to precision loss.
+
+#. **BF16 Conversion**:
 
-5. **BF16 Conversion**:
    - BF16 conversion is not supported for all operations.
    - AutoCast will automatically convert the model to opset 22 to enable more BF16 operations.
    - Use ``--op_types_to_exclude`` to exclude operations that are not supported in BF16.
    - BF16 accuracy may require additional tuning of the ``data_max`` and ``init_max`` thresholds.
    - TensorRT might not be able to support all BF16 converted models.
 
-6. **Large Initializers**
-   - Attempting to convert large initializers, might cause host memory issues.
+#. **Large Initializers**
+
+   - Attempting to convert very large initializers, might cause host memory issues.
    - Use ``--init_conversion_max_bytes`` to limit the size of initializers that will be converted at compile time.
    - Initializers larger than ``--init_conversion_max_bytes`` will be converted at runtime (using a cast operation).
-   - Increasing this value may result in smaller models and faster inference, but could also result in AutoCast crash during the conversion process.
-   - For best results, use the highest ``--init_conversion_max_bytes`` that the host memory can handle.
+
+#. **TensorRT custom op support**
+
+   - Refer to :ref:`TensorRT Execution Provider requirements <ort_ep_requirements>`.
+   - When a custom op is detected, the TensorRT Execution Provider is automatically enabled.
+   - To also enable the CUDA execution provider, use ``--providers cpu cuda:x``, where ``x`` is your device ID (``x=0`` if your system only has 1 GPU).
+   - Use ``--trt_plugins`` to provide the paths to the necessary TensorRT plugin libraries (in ``.so`` format).
 
 Limitations and Restrictions
 ----------------------------
-- AutoCast does not yet support models with custom operators / plugins.
 - AutoCast does not yet support quantized models.
 - BF16 conversion is not supported for all operations
 - Large models (e.g. over 2GB) might cause memory issues.
@@ -134,3 +155,9 @@ Bypass data magnitude check and keep specific node names in FP32:
 .. code-block:: bash
 
    python -m modelopt.onnx.autocast --onnx_path model.onnx --data_max inf --nodes_to_exclude ".*attn.*"
+
+Limit depth of reduction for precision-sensitive operations:
+
+.. code-block:: bash
+
+   python -m modelopt.onnx.autocast --onnx_path model.onnx --max_depth_of_reduction 1024
@@ -15,6 +15,8 @@ Currently ONNX quantization supports FP8, INT4 and INT8 quantization.
     ModelOpt ONNX quantization generates new ONNX models with QDQ nodes following TensorRT rules.
     For real speedup, the generated ONNX should be compiled into TensorRT engine.
 
+.. _ort_ep_requirements:
+
 Requirements
 ============
 
 
@@ -12,6 +12,7 @@
 - [PTQ for VLMs](./vlm_ptq/README.md) covers how to use Post-training quantization (PTQ) and export to [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) for deployment for popular Vision Language Models (VLMs).
 - [PTQ for ONNX Models](./onnx_ptq/README.md) shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
 - [QAT for LLMs](./llm_qat/README.md) demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or FP4 in [NVIDIA Blackwell platform](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)).
+- [QAT for CNNs](./cnn_qat/README.md) demonstrates the recipe and workflow for Quantization-aware Training (QAT) of CNN models, which can further preserve model accuracy at low precisions like INT8, FP8 etc.
 - [AutoDeploy for AutoQuant LLM models](./llm_autodeploy/README.md) demonstrates how to deploy mixed-precision models using ModelOpt's AutoQuant and TRT-LLM's AutoDeploy.
 
 ### Pruning
 
@@ -24,7 +24,7 @@ on fine-tuning and QAT.
 Install Model Optimizer with optional torch and huggingface dependencies:
 
 ```bash
-pip install "nvidia-modelopt[torch,hf]"
+pip install "nvidia-modelopt[hf]"
 ```
 
 ### Running the example