Skip to content

Commit 845bae6

Browse files
Add Windows 0.33.0 release changes
1 parent 55b9106 commit 845bae6

File tree

11 files changed

+257
-70
lines changed

11 files changed

+257
-70
lines changed

CHANGELOG-Windows.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22
Model Optimizer Changelog (Windows)
33
===================================
44

5+
0.33 (2025-09-03)
6+
^^^^^^^^^^^^^^^^^
7+
8+
**New Features**
9+
10+
- TensorRT Model Optimizer for Windows now supports `NvTensorRtRtx <https://onnxruntime.ai/docs/execution-providers/TensorRTRTX-ExecutionProvider.html>`_ execution-provider.
11+
12+
513
0.27 (2025-04-30)
614
^^^^^^^^^^^^^^^^^
715

docs/source/getting_started/windows/_installation_with_olive.rst

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,9 @@ Setup Steps for Olive with ModelOpt-Windows
2424
$ pip install onnxruntime-genai-directml>=0.4.0
2525
$ pip install onnxruntime-directml==1.20.0
2626
27+
- Above onnxruntime and onnxruntime-genai packages enable Olive workflow with DirectML Execution-Provider (EP). To use other EPs, install corresponding packages.
2728

28-
Additionally, ensure that dependencies for TensorRT Model Optimizer - Windows are met as mentioned in the :ref:`Install-Page-Standalone-Windows`.
29+
- Additionally, ensure that dependencies for TensorRT Model Optimizer - Windows are met as mentioned in the :ref:`Install-Page-Standalone-Windows`.
2930

3031
**2. Configure Olive for TensorRT Model Optimizer – Windows**
3132

@@ -36,7 +37,11 @@ Setup Steps for Olive with ModelOpt-Windows
3637

3738
- **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model. [Refer `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-tensorrt-model-optimizer>`_ Olive example]
3839

39-
**4. Run the Optimization**
40+
**4. Install other dependencies**
41+
42+
- Install other requirements as needed by the Olive scripts and config.
43+
44+
**5. Run the Optimization**
4045

4146
- **Execute Optimization:** To start the optimization process, run the following commands:
4247

@@ -56,4 +61,5 @@ Setup Steps for Olive with ModelOpt-Windows
5661
5762
**Note**:
5863

59-
#. Currently, the TensorRT-Model Optimizer - Windows only supports Onnx Runtime GenAI based models in the Olive workflow.
64+
#. Currently, the TensorRT-Model Optimizer - Windows only supports Onnx Runtime GenAI based LLM models in the Olive workflow.
65+
#. To try out different LLMs and EPs in the Olive workflow of ModelOpt-Windows, refer the details provided in `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-tensorrt-model-optimizer>`_ Olive example.

examples/windows/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
#### A Library to Quantize and Compress Deep Learning Models for Optimized Inference on Native Windows RTX GPUs
66

77
[![Documentation](https://img.shields.io/badge/Documentation-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-Model-Optimizer/)
8-
[![version](https://img.shields.io/badge/v0.27.0-orange?label=Release)](https://pypi.org/project/nvidia-modelopt/0.27.0/)
8+
[![version](https://img.shields.io/badge/v0.33.0-orange?label=Release)](https://pypi.org/project/nvidia-modelopt/)
99
[![license](https://img.shields.io/badge/License-Apache%202.0-blue)](../../LICENSE)
1010

1111
[Examples](#examples) |
@@ -59,7 +59,7 @@ pip install onnxruntime-genai-directml>=0.4.0
5959
pip install onnxruntime-directml==1.20.0
6060
```
6161

62-
For more details, please refer to the [detailed installation instructions](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html).
62+
For more details, please refer to the [detailed installation instructions](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/windows/_installation_for_Windows.html).
6363

6464
## Techniques
6565

examples/windows/accuracy_benchmark/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ This repository provides scripts, popular third-party benchmarks, and instructio
2929

3030
The MMLU benchmark assesses LLM performance across a wide range of tasks, producing a score between 0 and 1, where a higher score indicates better accuracy. Please refer the [MMLU Paper](https://arxiv.org/abs/2009.03300) for more details on this.
3131

32-
### MMLU Setup
32+
### Setup
3333

3434
The table below lists the setup steps to prepare your environment for evaluating LLMs using the MMLU benchmark.
3535

examples/windows/onnx_ptq/genai_llm/README.md

Lines changed: 29 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This example takes an ONNX model as input, along with the necessary quantization
88

99
### Setup
1010

11-
1. Install ModelOpt-Windows. Refer [installation instructions](../README.md).
11+
1. Install ModelOpt-Windows. Refer [installation instructions](../../README.md).
1212

1313
1. Install required dependencies
1414

@@ -43,30 +43,44 @@ The table below lists key command-line arguments of the ONNX PTQ example script.
4343
|---------------------------|------------------------------------------------------|-------------------------------------------------------------|
4444
| `--calib_size` | 32 (default), 64, 128 | Specifies the calibration size. |
4545
| `--dataset` | cnn (default), pilevel | Choose calibration dataset: cnn_dailymail or pile-val. |
46-
| `--algo` | awq_lite (default), awq_clip | Select the quantization algorithm. |
46+
| `--algo` | awq_lite (default), awq_clip, rtn, rtn_dq | Select the quantization algorithm. |
4747
| `--onnx_path` | input .onnx file path | Path to the input ONNX model. |
4848
| `--output_path` | output .onnx file path | Path to save the quantized ONNX model. |
49-
| `--use_zero_point` | True, False (default) | Enable zero-point based quantization. |
49+
| `--use_zero_point` | Default: zero-point is disabled | Use this option to enable zero-point based quantization. |
5050
| `--block-size` | 32, 64, 128 (default) | Block size for AWQ. |
5151
| `--awqlite_alpha_step` | 0.1 (default) | Step-size for AWQ scale search, user-defined |
52-
| `--awqlite_run_per_subgraph` | True, False (default) | If True, runs AWQ scale search at the subgraph level |
53-
| `--awqlite_fuse_nodes` | True (default), False | If True, fuses input scales in parent nodes. |
52+
| `--awqlite_run_per_subgraph` | Default: run_per_subgraph is disabled | Use this option to run AWQ scale search at the subgraph level |
53+
| `--awqlite_disable_fuse_nodes` | Default: fuse_nodes enabled | Use this option to disable fusion of input scales in parent nodes. |
5454
| `--awqclip_alpha_step` | 0.05 (default) | Step-size for AWQ weight clipping, user-defined |
5555
| `--awqclip_alpha_min` | 0.5 (default) | Minimum AWQ weight-clipping threshold, user-defined |
5656
| `--awqclip_bsz_col` | 1024 (default) | Chunk size in columns during weight clipping, user-defined |
57-
| `--calibration_eps` | dml, cuda, cpu, NvTensorRtRtx (default: [dml,cpu]) | List of calibration endpoints. |
57+
| `--calibration_eps` | dml, cuda, cpu, NvTensorRtRtx (default: [dml,cpu]) | List of execution-providers to use for session run during calibration |
58+
| `--no_position_ids` | Default: position_ids input enabled | Use this option to disable position_ids input in calibration data|
5859

5960
Run the following command to view all available parameters in the script:
6061

6162
```bash
6263
python quantize.py --help
6364
```
6465

66+
Note:
67+
68+
1. For the `algo` argument, we have following options to choose form: awq_lite, awq_clip, rtn, rtn_dq.
69+
- The 'awq_lite' option does core AWQ scale search and INT4 quantization.
70+
- The 'awq_clip' option primarily does weight clipping and INT4 quantization.
71+
- The 'rtn' option does INT4 RTN quantization with Q->DQ nodes for weights.
72+
- The 'rtn_dq' option does INT4 RTN quantization with only DQ nodes for weights.
73+
1. RTN algorithm doesn't use calibration-data.
74+
1. If needed for the input base model, use `--no_position_ids` command-line option to disable
75+
generating position_ids calibration input. The GenAI built LLM models produced with DML EP has
76+
position_ids input but ones produced with CUDA EP, NvTensorRtRtx EP don't have position_ids input.
77+
Use `--help` or command-line options table above to inspect default values.
78+
6579
Please refer to `quantize.py` for further details on command-line parameters.
6680

6781
### Evaluate the Quantized Model
6882

69-
To evaluate the quantized model, please refer to the [accuracy benchmarking](../accuracy_benchmark/README.md) and [onnxruntime-genai performance benchmarking](https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python).
83+
To evaluate the quantized model, please refer to the [accuracy benchmarking](../../accuracy_benchmark/README.md) and [onnxruntime-genai performance benchmarking](https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python).
7084

7185
### Deployment
7286

@@ -107,3 +121,11 @@ Please refer to [support matrix](https://nvidia.github.io/TensorRT-Model-Optimiz
107121
1. **Check Input Model**
108122

109123
During INT4 AWQ execution, the input onnx model (one mentioned in `--onnx_path` argument) will be run with onnxruntime (ORT) for calibration (using ORT EP mentioned in `--calibration_eps` argument). So, make sure that input onnx model is running fine with the specified ORT EP.
124+
125+
1. **Config availability for calibration with NvTensorRtRtx EP**
126+
127+
Note that while using `NvTensorRtRtx` for INT4 AWQ quantization, profile (min/max/opt ranges) of input-shapes of the model is created internally using the details from the model's config (e.g. config.json in HuggingFace model card). This input-shapes-profile is used during onnxruntime session creation. Make sure that config.json is available in the model-directory if `model_name` is a local model path (instead of HuggingFace model-name).
128+
129+
1. **Error - Invalid Position-IDs input to the ONNX model**
130+
131+
The ONNX models produced using ONNX GenerativeAI (GenAI) have different IO bindings for models produced using different execution-providers (EPs). For instance, model built with DML EP has position-ids input in the ONNX model but models builts using CUDA EP or NvTensorRtRtx EP don't have position-ids inputs. So, if base model requires, use `no_position_ids` command-line argument for disabling position_ids calibration input or set "add_position_ids" variable to `False` value (hard-code) in the quantize script if required.

0 commit comments

Comments
 (0)