Skip to content

Commit e048fb2

Browse files
Add changes for 0.27 Windows release
1 parent d59ca04 commit e048fb2

27 files changed

+1290
-228
lines changed

CHANGELOG-Windows.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,15 @@
22
Model Optimizer Changelog (Windows)
33
===================================
44

5+
0.27 (2025-04-30)
6+
^^^^^^^^^^^^^^^^^
7+
8+
**New Features**
9+
10+
- New LLM models like DeepSeek etc. are supported with ONNX INT4 AWQ quantization on Windows. Refer `Windows Support Matrix <https://nvidia.github.io/TensorRT-Model-Optimizer/guides/0_support_matrix.html>`_ for details about supported features and models.
11+
- TensorRT Model Optimizer for Windows now supports ONNX INT8 and FP8 quantization (W8A8) of SAM2 and Whisper models. Check `example scripts <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq>`_ for getting started with quantizing these models.
12+
13+
514
0.19 (2024-11-18)
615
^^^^^^^^^^^^^^^^^
716

docs/source/deployment/2_directml.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ DirectML
55
===================
66

77

8-
Once an ONNX FP16 model is quantized using TensorRT Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML backend via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
8+
Once an ONNX FP16 model is quantized using TensorRT Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML (DML) backend via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
9+
10+
.. note:: Currently, DirectML backend doesn't support 8-bit precision. So, 8-bit quantized models should be deployed on other backends like ORT-CUDA etc. However, DML path does support INT4 quantized models.
911

1012
ONNX Runtime GenAI
1113
==================

docs/source/getting_started/4_quantization_windows.rst

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The ONNX quantization API in ModelOpt-Windows offers advanced Post-Training Quan
1111
ONNX Model Quantization (PTQ)
1212
------------------------------
1313

14-
The ONNX quantization API requires a model, calibration data, along with quantization settings like algorithm, calibration-EPs etc. Here’s an example implementing int4 AWQ:
14+
The ONNX quantization API requires a model, calibration data, along with quantization settings like algorithm, calibration-EPs etc. Here’s an example snippet to apply INT4 AWQ quantization:
1515

1616
.. code-block:: python
1717
@@ -32,22 +32,24 @@ The ONNX quantization API requires a model, calibration data, along with quantiz
3232
size_threshold=0,
3333
)
3434
35-
Check :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` for details about quantization API.
35+
Check :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` for details about INT4 quantization API.
3636

3737
Refer :ref:`Support_Matrix` for details about supported features and models.
3838

39-
To know more about ONNX PTQ, refer :ref:`ONNX_PTQ_Guide_Windows` and `example script <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq/>`_.
39+
To know more about ONNX PTQ, refer :ref:`ONNX_PTQ_Guide_Windows` and `example scripts <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq/>`_.
4040

4141

4242
Deployment
4343
----------
44-
The quantized ONNX model is deployment-ready, equivalent to a standard ONNX model. ModelOpt-Windows uses ONNXs `DequantizeLinear <https://onnx.ai/onnx/operators/onnx__DequantizeLinear.html>`_ (DQ) nodes, which support INT4 data-type from opset version 21 onward. Ensure the model’s opset version is 21 or higher. Refer :ref:`Apply_ONNX_PTQ` for details.
44+
The quantized onnx model can be deployed using frameworks like onnxruntime. Ensure that model's opset is 19+ for FP8 quantization, and it is 21+ for INT4 quantization. This is needed due to different opset requirements of ONNX's `Q <https://onnx.ai/onnx/operators/onnx__QuantizeLinear.html>`_/`DQ <https://onnx.ai/onnx/operators/onnx__DequantizeLinear.html>`_ nodes for INT4, FP8 data-types support. Refer :ref:`Apply_ONNX_PTQ` for details.
4545

4646
.. code-block:: python
4747
48-
# write steps (say, upgrade_opset_to_21() method) to upgrade opset to 21, if it is lower than 21.
48+
# write steps (say, upgrade_opset() method) to upgrade or patch opset of the model, if needed
49+
# the opset-upgrade, if needed, can be done on either base ONNX model or on the quantized model
50+
# finally, save the quantized model
4951
50-
quantized_onnx_model = upgrade_opset_to_21(quantized_onnx_model)
52+
quantized_onnx_model = upgrade_opset(quantized_onnx_model)
5153
onnx.save_model(
5254
quantized_onnx_model,
5355
output_path,
@@ -56,7 +58,7 @@ The quantized ONNX model is deployment-ready, equivalent to a standard ONNX mode
5658
size_threshold=0,
5759
)
5860
59-
Deploy the quantized model using the DirectML backend. For detailed deployment instructions, see the :ref:`DirectML_Deployment`.
61+
For detailed instructions about deployment of quantized models with DirectML backend (ORT-DML), see the :ref:`DirectML_Deployment`. Also, refer `example scripts <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq/>`_ for any possible model-specific inference guidance or script (if any).
6062

6163
.. note::
6264

docs/source/getting_started/windows/_installation_for_Windows.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,12 @@ The following system requirements are necessary to install and use TensorRT Mode
2121
+-------------------------+-----------------------------+
2222
| Nvidia Driver | 565.90 or newer |
2323
+-------------------------+-----------------------------+
24+
| Nvidia GPU | RTX 40 and 50 series |
25+
+-------------------------+-----------------------------+
2426

27+
.. note::
28+
- Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8.
29+
- We currently support *Single-GPU* configuration.
2530

2631
The TensorRT Model Optimizer - Windows can be used in following ways:
2732

docs/source/getting_started/windows/_installation_standalone.rst

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
Install ModelOpt-Windows as a Standalone Toolkit
55
================================================
66

7-
The TensorRT Model Optimizer - Windows (ModelOpt-Windows) can be installed as a standalone toolkit for quantizing Large Language Models (LLMs). Below are the setup steps:
7+
The TensorRT Model Optimizer - Windows (ModelOpt-Windows) can be installed as a standalone toolkit for quantizing ONNX models. Below are the setup steps:
88

99
**1. Setup Prerequisites**
1010

@@ -40,7 +40,7 @@ This command installs ModelOpt-Windows and its ONNX module, along with the *onnx
4040

4141
**4. Setup ONNX Runtime (ORT) for Calibration**
4242

43-
The ONNX Post-Training Quantization (PTQ) process involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CudaExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
43+
The ONNX Post-Training Quantization (PTQ) process involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
4444

4545
- *onnxruntime-directml* provides the DirectML EP.
4646
- *onnxruntime-gpu* provides the CUDA EP.
@@ -68,7 +68,7 @@ If you prefer to use the CUDA EP for calibration, uninstall the existing *onnxru
6868

6969
**5. Setup GPU Acceleration Tool for Quantization**
7070

71-
ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ tool for GPU acceleration during the INT4 ONNX quantization process if you have CUDA 12.x.
71+
By default, ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ tool for GPU acceleration during the INT4 ONNX quantization process. This is compatible with CUDA 12.x.
7272

7373
**6. Verify Installation**
7474

@@ -79,6 +79,10 @@ Ensure the following steps are verified:
7979
- *onnxruntime-directml* (DirectML EP)
8080
- *onnxruntime-gpu* (CUDA EP)
8181
- *onnxruntime* (CPU EP)
82+
- **Onnx and Onnxruntime Import**: Ensure that following python command runs successfully.
83+
.. code-block:: python
84+
85+
python -c "import onnx; import onnxruntime"
8286
- **Environment Variables**: For workflows using CUDA dependencies (e.g., CUDA EP-based calibration), ensure environment variables like *CUDA_PATH*, *CUDA_V12_4*, or *CUDA_V11_8* etc. are set correctly. Reopen the command-prompt if any environment variable is updated or newly created.
8387
- **ModelOpt-Windows Import Check**: Run the following command to ensure the installation is successful:
8488

docs/source/guides/0_support_matrix.rst

Lines changed: 78 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Feature Support Matrix
2929
- PyTorch, ONNX*
3030
- TensorRT*, TensorRT-LLM
3131
* - INT8
32-
- * Per-channel INT8 Weights, Per-Tensor FP8 Activations
32+
- * Per-channel INT8 Weights, Per-Tensor INT8 Activations
3333
* Uses Smooth Quant Algorithm
3434
* GPUs: Ampere and Later
3535
- PyTorch, ONNX*
@@ -71,16 +71,18 @@ Feature Support Matrix
7171
- PyTorch*
7272
- TensorRT-LLM*
7373
* - FP8
74-
- * Per-Tensor FP8 Weight & Activations
74+
- * Per-Tensor FP8 Weight & Activations (PyTorch)
75+
* Per-Tensor Activation and Per-Channel Weights quantization (ONNX)
76+
* Uses Max calibration
7577
* GPUs: Ada and Later
76-
- PyTorch*, ONNX*
77-
- TensorRT*, TensorRT-LLM*
78+
- PyTorch*, ONNX
79+
- TensorRT*, TensorRT-LLM*, ORT-CUDA
7880
* - INT8
79-
- * Per-channel INT8 Weights, Per-Tensor FP8 Activations
80-
* Uses Smooth Quant Algorithm
81+
- * Per-Channel INT8 Weights, Per-Tensor INT8 Activations
82+
* Uses Smooth Quant (PyTorch)*, Max calibration (ONNX)
8183
* GPUs: Ada and Later
82-
- PyTorch*, ONNX*
83-
- TensorRT*, TensorRT-LLM*
84+
- PyTorch*, ONNX
85+
- TensorRT*, TensorRT-LLM*, ORT-CUDA
8486

8587
.. note:: Features marked with an asterisk (*) are considered experimental.
8688

@@ -98,16 +100,83 @@ Model Support Matrix
98100
:header-rows: 1
99101

100102
* - Model
101-
- ONNX INT4 AWQ
103+
- ONNX INT4 AWQ (W4A16)
104+
- ONNX INT8 Max (W8A8)
105+
- ONNX FP8 Max (W8A8)
102106
* - Llama3.1-8B-Instruct
103107
- Yes
108+
- No
109+
- No
104110
* - Phi3.5-mini-Instruct
105111
- Yes
112+
- No
113+
- No
106114
* - Mistral-7B-Instruct-v0.3
107115
- Yes
116+
- No
117+
- No
108118
* - Llama3.2-3B-Instruct
109119
- Yes
120+
- No
121+
- No
110122
* - Gemma-2b-it
111123
- Yes
124+
- No
125+
- No
126+
* - Gemma-2-2b
127+
- Yes
128+
- No
129+
- No
130+
* - Gemma-2-9b
131+
- Yes
132+
- No
133+
- No
112134
* - Nemotron Mini 4B Instruct
113135
- Yes
136+
- No
137+
- No
138+
* - Qwen2.5-7B-Instruct
139+
- Yes
140+
- No
141+
- No
142+
* - DeepSeek-R1-Distill-Llama-8B
143+
- Yes
144+
- No
145+
- No
146+
* - DeepSeek-R1-Distil-Qwen-1.5B
147+
- Yes
148+
- No
149+
- No
150+
* - DeepSeek-R1-Distil-Qwen-7B
151+
- Yes
152+
- No
153+
- No
154+
* - DeepSeek-R1-Distill-Qwen-14B
155+
- Yes
156+
- No
157+
- No
158+
* - Mistral-NeMo-Minitron-2B-128k-Instruct
159+
- Yes
160+
- No
161+
- No
162+
* - Mistral-NeMo-Minitron-4B-128k-Instruct
163+
- Yes
164+
- No
165+
- No
166+
* - Mistral-NeMo-Minitron-8B-128k-Instruct
167+
- Yes
168+
- No
169+
- No
170+
* - whisper-large
171+
- No
172+
- Yes
173+
- Yes
174+
* - sam2-hiera-large
175+
- No
176+
- Yes
177+
- Yes
178+
179+
.. note::
180+
- ``ONNX INT8 Max`` means INT8 (W8A8) quantization of ONNX model using Max calibration. Similar holds true for the term ``ONNX FP8 Max``.
181+
- The LLMs in above table are `GenAI <https://github.com/microsoft/onnxruntime-genai/>`_ built LLMs unless specified otherwise.
182+
- Check `examples <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/onnx_ptq/>`_ for specific instructions and scripts.

0 commit comments

Comments
 (0)