NVIDIA
diff --git a/‎CHANGELOG-Windows.rst‎
Lines changed: 9 additions & 0 deletions b/‎CHANGELOG-Windows.rst‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/source/deployment/2_directml.rst‎
Lines changed: 3 additions & 1 deletion b/‎docs/source/deployment/2_directml.rst‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/source/getting_started/4_quantization_windows.rst‎
Lines changed: 9 additions & 7 deletions b/‎docs/source/getting_started/4_quantization_windows.rst‎
Lines changed: 9 additions & 7 deletions
diff --git a/‎docs/source/getting_started/windows/_installation_for_Windows.rst‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/getting_started/windows/_installation_for_Windows.rst‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/getting_started/windows/_installation_standalone.rst‎
Lines changed: 7 additions & 3 deletions b/‎docs/source/getting_started/windows/_installation_standalone.rst‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎docs/source/guides/0_support_matrix.rst‎
Lines changed: 78 additions & 9 deletions b/‎docs/source/guides/0_support_matrix.rst‎
Lines changed: 78 additions & 9 deletions
@@ -2,6 +2,15 @@
 Model Optimizer Changelog (Windows)
 ===================================
 
+0.27 (2025-04-30)
+^^^^^^^^^^^^^^^^^
+
+**New Features**
+
+- New LLM models like DeepSeek etc. are supported with ONNX INT4 AWQ quantization on Windows. Refer `Windows Support Matrix <https://nvidia.github.io/TensorRT-Model-Optimizer/guides/0_support_matrix.html>`_ for details about supported features and models.
+- TensorRT Model Optimizer for Windows now supports ONNX INT8 and FP8 quantization (W8A8) of SAM2 and Whisper models. Check `example scripts <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq>`_ for getting started with quantizing these models.
+
+
 0.19 (2024-11-18)
 ^^^^^^^^^^^^^^^^^
 
 
@@ -5,7 +5,9 @@ DirectML
 ===================
 
 
-Once an ONNX FP16 model is quantized using TensorRT Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML backend via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
+Once an ONNX FP16 model is quantized using TensorRT Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML (DML) backend via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
+
+.. note:: Currently, DirectML backend doesn't support 8-bit precision. So, 8-bit quantized models should be deployed on other backends like ORT-CUDA etc. However, DML path does support INT4 quantized models.
 
 ONNX Runtime GenAI
 ==================
 
@@ -11,7 +11,7 @@ The ONNX quantization API in ModelOpt-Windows offers advanced Post-Training Quan
 ONNX Model Quantization (PTQ)
 ------------------------------
 
-The ONNX quantization API requires a model, calibration data, along with quantization settings like algorithm, calibration-EPs etc. Here’s an example implementing int4 AWQ:
+The ONNX quantization API requires a model, calibration data, along with quantization settings like algorithm, calibration-EPs etc. Here’s an example snippet to apply INT4 AWQ quantization:
 
 .. code-block:: python
 
@@ -32,22 +32,24 @@ The ONNX quantization API requires a model, calibration data, along with quantiz
         size_threshold=0,
     )
 
-Check :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` for details about quantization API.
+Check :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` for details about INT4 quantization API.
 
 Refer :ref:`Support_Matrix` for details about supported features and models.
 
-To know more about ONNX PTQ, refer :ref:`ONNX_PTQ_Guide_Windows` and `example script <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq/>`_.
+To know more about ONNX PTQ, refer :ref:`ONNX_PTQ_Guide_Windows` and `example scripts <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq/>`_.
 
 
 Deployment
 ----------
-The quantized ONNX model is deployment-ready, equivalent to a standard ONNX model. ModelOpt-Windows uses ONNX’s `DequantizeLinear <https://onnx.ai/onnx/operators/onnx__DequantizeLinear.html>`_ (DQ) nodes, which support INT4 data-type from opset version 21 onward. Ensure the model’s opset version is 21 or higher. Refer :ref:`Apply_ONNX_PTQ` for details.
+The quantized onnx model can be deployed using frameworks like onnxruntime. Ensure that model's opset is 19+ for FP8 quantization, and it is 21+ for INT4 quantization. This is needed due to different opset requirements of  ONNX's `Q <https://onnx.ai/onnx/operators/onnx__QuantizeLinear.html>`_/`DQ <https://onnx.ai/onnx/operators/onnx__DequantizeLinear.html>`_ nodes for INT4, FP8 data-types support. Refer :ref:`Apply_ONNX_PTQ` for details.
 
 .. code-block:: python
 
-    # write steps (say, upgrade_opset_to_21() method) to upgrade opset to 21, if it is lower than 21.
+    # write steps (say, upgrade_opset() method) to upgrade or patch opset of the model, if needed
+    # the opset-upgrade, if needed, can be done on either base ONNX model or on the quantized model
+    # finally, save the quantized model
 
-    quantized_onnx_model = upgrade_opset_to_21(quantized_onnx_model)
+    quantized_onnx_model = upgrade_opset(quantized_onnx_model)
     onnx.save_model(
         quantized_onnx_model,
         output_path,
@@ -56,7 +58,7 @@ The quantized ONNX model is deployment-ready, equivalent to a standard ONNX mode
         size_threshold=0,
     )
 
-Deploy the quantized model using the DirectML backend. For detailed deployment instructions, see the :ref:`DirectML_Deployment`.
+For detailed instructions about deployment of quantized models with DirectML backend (ORT-DML), see the :ref:`DirectML_Deployment`. Also, refer `example scripts <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq/>`_ for any possible model-specific inference guidance or script (if any).
 
 .. note::
 
 
@@ -21,7 +21,12 @@ The following system requirements are necessary to install and use TensorRT Mode
 +-------------------------+-----------------------------+
 | Nvidia Driver           |  565.90 or newer            |
 +-------------------------+-----------------------------+
+| Nvidia GPU              |  RTX 40 and 50 series       |
++-------------------------+-----------------------------+
 
+.. note::
+   - Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8.
+   - We currently support *Single-GPU* configuration.
 
 The TensorRT Model Optimizer - Windows can be used in following ways:
 
 
@@ -4,7 +4,7 @@
 Install ModelOpt-Windows as a Standalone Toolkit
 ================================================
 
-The TensorRT Model Optimizer - Windows (ModelOpt-Windows) can be installed as a standalone toolkit for quantizing Large Language Models (LLMs). Below are the setup steps:
+The TensorRT Model Optimizer - Windows (ModelOpt-Windows) can be installed as a standalone toolkit for quantizing ONNX models. Below are the setup steps:
 
 **1. Setup Prerequisites**
 
@@ -40,7 +40,7 @@ This command installs ModelOpt-Windows and its ONNX module, along with the *onnx
 
 **4. Setup ONNX Runtime (ORT) for Calibration**
 
-The ONNX Post-Training Quantization (PTQ) process involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CudaExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
+The ONNX Post-Training Quantization (PTQ) process involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
 
 - *onnxruntime-directml* provides the DirectML EP.
 - *onnxruntime-gpu* provides the CUDA EP.
@@ -68,7 +68,7 @@ If you prefer to use the CUDA EP for calibration, uninstall the existing *onnxru
 
 **5. Setup GPU Acceleration Tool for Quantization**
 
-ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ tool for GPU acceleration during the INT4 ONNX quantization process if you have CUDA 12.x.
+By default, ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ tool for GPU acceleration during the INT4 ONNX quantization process. This is compatible with CUDA 12.x.
 
 **6. Verify Installation**
 
@@ -79,6 +79,10 @@ Ensure the following steps are verified:
             - *onnxruntime-directml* (DirectML EP)
             - *onnxruntime-gpu* (CUDA EP)
             - *onnxruntime* (CPU EP)
+      - **Onnx and Onnxruntime Import**: Ensure that following python command runs successfully.
+            .. code-block:: python
+
+                python -c "import onnx; import onnxruntime"
       - **Environment Variables**: For workflows using CUDA dependencies (e.g., CUDA EP-based calibration), ensure environment variables like *CUDA_PATH*, *CUDA_V12_4*, or *CUDA_V11_8* etc. are set correctly. Reopen the command-prompt if any environment variable is updated or newly created.
       - **ModelOpt-Windows Import Check**: Run the following command to ensure the installation is successful:
 
 
@@ -29,7 +29,7 @@ Feature Support Matrix
           - PyTorch, ONNX*
           - TensorRT*, TensorRT-LLM
         * - INT8
-          - * Per-channel INT8 Weights, Per-Tensor FP8 Activations
+          - * Per-channel INT8 Weights, Per-Tensor INT8 Activations
             * Uses Smooth Quant Algorithm
             * GPUs: Ampere and Later
           - PyTorch, ONNX*
@@ -71,16 +71,18 @@ Feature Support Matrix
           - PyTorch*
           - TensorRT-LLM*
         * - FP8
-          - * Per-Tensor FP8 Weight & Activations
+          - * Per-Tensor FP8 Weight & Activations (PyTorch)
+            * Per-Tensor Activation and Per-Channel Weights quantization (ONNX)
+            * Uses Max calibration
             * GPUs: Ada and Later
-          - PyTorch*, ONNX*
-          - TensorRT*, TensorRT-LLM*
+          - PyTorch*, ONNX
+          - TensorRT*, TensorRT-LLM*, ORT-CUDA
         * - INT8
-          - * Per-channel INT8 Weights, Per-Tensor FP8 Activations
-            * Uses Smooth Quant Algorithm
+          - * Per-Channel INT8 Weights, Per-Tensor INT8 Activations
+            * Uses Smooth Quant (PyTorch)*, Max calibration (ONNX)
             * GPUs: Ada and Later
-          - PyTorch*, ONNX*
-          - TensorRT*, TensorRT-LLM*
+          - PyTorch*, ONNX
+          - TensorRT*, TensorRT-LLM*, ORT-CUDA
 
 .. note:: Features marked with an asterisk (*) are considered experimental.
 
@@ -98,16 +100,83 @@ Model Support Matrix
         :header-rows: 1
 
         * - Model
-          - ONNX INT4 AWQ
+          - ONNX INT4 AWQ (W4A16)
+          - ONNX INT8 Max (W8A8)
+          - ONNX FP8 Max (W8A8)
         * - Llama3.1-8B-Instruct
           - Yes
+          - No
+          - No
         * - Phi3.5-mini-Instruct
           - Yes
+          - No
+          - No
         * - Mistral-7B-Instruct-v0.3
           - Yes
+          - No
+          - No
         * - Llama3.2-3B-Instruct
           - Yes
+          - No
+          - No
         * - Gemma-2b-it
           - Yes
+          - No
+          - No
+        * - Gemma-2-2b
+          - Yes
+          - No
+          - No
+        * - Gemma-2-9b
+          - Yes
+          - No
+          - No
         * - Nemotron Mini 4B Instruct
           - Yes
+          - No
+          - No
+        * - Qwen2.5-7B-Instruct
+          - Yes
+          - No
+          - No
+        * - DeepSeek-R1-Distill-Llama-8B
+          - Yes
+          - No
+          - No
+        * - DeepSeek-R1-Distil-Qwen-1.5B
+          - Yes
+          - No
+          - No
+        * - DeepSeek-R1-Distil-Qwen-7B
+          - Yes
+          - No
+          - No
+        * - DeepSeek-R1-Distill-Qwen-14B
+          - Yes
+          - No
+          - No
+        * - Mistral-NeMo-Minitron-2B-128k-Instruct
+          - Yes
+          - No
+          - No
+        * - Mistral-NeMo-Minitron-4B-128k-Instruct
+          - Yes
+          - No
+          - No
+        * - Mistral-NeMo-Minitron-8B-128k-Instruct
+          - Yes
+          - No
+          - No
+        * - whisper-large
+          - No
+          - Yes
+          - Yes
+        * - sam2-hiera-large
+          - No
+          - Yes
+          - Yes
+
+  .. note::
+    - ``ONNX INT8 Max`` means INT8 (W8A8) quantization of ONNX model using Max calibration. Similar holds true for the term ``ONNX FP8 Max``.
+    - The LLMs in above table are `GenAI <https://github.com/microsoft/onnxruntime-genai/>`_ built LLMs unless specified otherwise.
+    - Check `examples <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq/>`_ for specific instructions and scripts.