The example script showcases how to utilize the ModelOpt-Windows toolkit for optimizing ONNX (Open Neural Network Exchange) models through quantization. This toolkit is designed for developers looking to enhance model performance, reduce size, and accelerate inference times, while preserving the accuracy of neural networks deployed with backends like DirectML on local RTX GPUs running Windows.
Quantization is a technique that converts models from floating-point to lower-precision formats, such as integers, which are more computationally efficient. This process can significantly speed up execution on supported hardware, while also reducing memory and bandwidth requirements.
This example takes an ONNX model as input, along with the necessary quantization settings, and generates a quantized ONNX model as output. This script can be used for quantizing popular, ONNX Runtime GenAI built Large Language Models (LLMs) in the ONNX format.
-
Install ModelOpt-Windows. Refer installation instructions.
-
Install required dependencies
pip install -r requirements.txt
You may generate the base model using the model builder that comes with onnxruntime-genai. The ORT-GenAI's model-builder downloads the original Pytorch model from Hugging Face, and produces an ONNX GenAI compatible base model in ONNX format. See example command-line below:
python -m onnxruntime_genai.models.builder -m meta-llama/Meta-Llama-3-8B -p fp16 -e dml -o E:\llama3-8b-fp16-dml-genaiTo begin quantization, run the script like below:
python quantize.py --model_name=meta-llama/Meta-Llama-3-8B \
--onnx_path="E:\model_store\genai\llama3-8b-fp16-dml-genai\opset_21\model.onnx" \
--output_path="E:\model_store\genai\llama3-8b-fp16-dml-genai\opset_21\cnn_32_lite_0.1_16\model.onnx" \
--calib_size=32 --algo=awq_lite --dataset=cnnThe table below lists key command-line arguments of the ONNX PTQ example script.
| Argument | Supported Values | Description |
|---|---|---|
--calib_size |
32 (default), 64, 128 | Specifies the calibration size. |
--dataset |
cnn (default), pilevel | Choose calibration dataset: cnn_dailymail or pile-val. |
--algo |
awq_lite (default), awq_clip, rtn, rtn_dq | Select the quantization algorithm. |
--onnx_path |
input .onnx file path | Path to the input ONNX model. |
--output_path |
output .onnx file path | Path to save the quantized ONNX model. |
--use_zero_point |
Default: zero-point is disabled | Use this option to enable zero-point based quantization. |
--block-size |
32, 64, 128 (default) | Block size for AWQ. |
--awqlite_alpha_step |
0.1 (default) | Step-size for AWQ scale search, user-defined |
--awqlite_run_per_subgraph |
Default: run_per_subgraph is disabled | Use this option to run AWQ scale search at the subgraph level |
--awqlite_disable_fuse_nodes |
Default: fuse_nodes enabled | Use this option to disable fusion of input scales in parent nodes. |
--awqclip_alpha_step |
0.05 (default) | Step-size for AWQ weight clipping, user-defined |
--awqclip_alpha_min |
0.5 (default) | Minimum AWQ weight-clipping threshold, user-defined |
--awqclip_bsz_col |
1024 (default) | Chunk size in columns during weight clipping, user-defined |
--calibration_eps |
dml, cuda, cpu, NvTensorRtRtx (default: [dml,cpu]) | List of execution-providers to use for session run during calibration |
--no_position_ids |
Default: position_ids input enabled | Use this option to disable position_ids input in calibration data |
--enable_mixed_quant |
Default: disabled mixed quant | Use this option to enable mixed precsion quantization |
--layers_8bit |
Default: None | Use this option to Overrides default mixed quant strategy |
--gather_quantize_axis |
Default: None | Use this option to enable INT4 quantization of Gather nodes - choose 0 or 1 |
--gather_block_size |
Default: 32 | Block-size for Gather node's INT4 quantization (when its enabled using gather_quantize_axis option) |
Run the following command to view all available parameters in the script:
python quantize.py --helpNote:
- For the
algoargument, we have following options to choose form: awq_lite, awq_clip, rtn, rtn_dq.- The 'awq_lite' option does core AWQ scale search and INT4 quantization.
- The 'awq_clip' option primarily does weight clipping and INT4 quantization.
- The 'rtn' option does INT4 RTN quantization with Q->DQ nodes for weights.
- The 'rtn_dq' option does INT4 RTN quantization with only DQ nodes for weights.
- RTN algorithm doesn't use calibration-data.
- If needed for the input base model, use
--no_position_idscommand-line option to disable generating position_ids calibration input. The GenAI built LLM models produced with DML EP has position_ids input but ones produced with CUDA EP, NvTensorRtRtx EP don't have position_ids input. Use--helpor command-line options table above to inspect default values.
Please refer to quantize.py for further details on command-line parameters.
ModelOpt-Windows supports mixed precision quantization, where different layers in the model can be quantized to different bit-widths. This approach combines INT4 quantization for most layers (for maximum compression and speed) with INT8 quantization for important or sensitive layers (to preserve accuracy).
Mixed precision quantization provides an optimal balance between:
- Model Size: Primarily INT4 keeps the model small
- Inference Speed: INT4 layers run faster and smaller
- Accuracy Preservation: Critical layers in INT8 maintain model quality
Based on benchmark results, mixed precision quantization shows significant advantages:
| Model | Metric | INT4 RTN | Mixed RTN (INT4+INT8) | Improvement |
|---|---|---|---|---|
| DeepSeek R1 1.5B | MMLU | 32.40% | 33.90% | +1.5% |
| Perplexity | 46.304 | 44.332 | -2.0 (lower is better) | |
| Llama 3.2 1B | MMLU | 39.90% | 44.70% | +4.8% |
| Perplexity | 16.900 | 14.176 | -2.7 (lower is better) | |
| Qwen 2.5 1.5B | MMLU | 56.70% | 57.50% | +0.8% |
| Perplexity | 10.933 | 10.338 | -0.6 (lower is better) |
As shown above, mixed precision significantly improves accuracy with minimal disk size increase (~85-109 MB).
The quantization strategy selects which layers to quantize to INT8 vs INT4:
-
INT8 Layers (Higher Precision): Important layers that significantly impact model quality. Quantized per-channel
-
INT4 Layers (Maximum Compression): All other layers. Qunatized blockwise.
This strategy preserves accuracy for the most sensitive layers while maintaining aggressive compression elsewhere.
python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \
--onnx_path="E:\models\llama3.2-1b-fp16\model.onnx" \
--output_path="E:\models\llama3.2-1b-int4-int8-mixed\model.onnx" \
--algo=awq_lite \
--calib_size=32 \
--enable_mixed_quantThe --enable_mixed_quant flag automatically applies the default strategy.
python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \
--onnx_path="E:\models\llama3.2-1b-fp16\model.onnx" \
--output_path="E:\models\llama3.2-1b-int4-int8-custom\model.onnx" \
--algo=awq_lite \
--calib_size=32 \
--layers_8bit="layers.0,layers.1,layers.15,layers.16"The --layers_8bit option allows you to manually specify which layers to quantize to INT8. You can use:
- Layer indices:
layers.0,layers.5,layers.10 - Layer paths:
model/layers.0/attn/qkv_proj - Partial names:
qkv_proj,down_proj
- Block Size: INT4 layers use block-wise quantization (default block-size=128), INT8 uses per-channel quantization
- Quantization Axis: INT4 (per-block), INT8 (per-channel row-wise)
- Compatibility: Works with both
awq_liteandrtn_dqalgorithms - Automatic Detection: The
--layers_8bitoption automatically enables mixed quantization
For more benchmark results and detailed accuracy metrics, refer to the Benchmark Guide.
To evaluate the quantized model, please refer to the accuracy benchmarking and onnxruntime-genai performance benchmarking.
Once an ONNX FP16 model is quantized using ModelOpt-Windows, the resulting quantized ONNX model can be deployed on the DirectML backend using ORT-GenAI or ORT.
Refer to the following example scripts and tutorials for deployment:
Please refer to support matrix for a full list of supported features and models.
-
Configure Directories
- Update the
cache_dirvariable in themain()function to specify the path where you want to store Hugging Face files (optional). - If you're low on space on the C: drive, change the TMP and TEMP environment variable to a different drive (e.g.,
D:\temp).
- Update the
-
Authentication for Restricted Models
If the model you wish to use is hosted on Hugging Face and requires authentication, log in using the huggingface-cli before running the quantization script.
huggingface-cli login --token <HF_TOKEN>
-
Check Read/Write Permissions
Ensure that both the input and output model paths have the necessary read and write permissions to avoid any permission-related errors.
-
Check Output Path
Ensure that output .onnx file doesn't exist already. For example, if the output path is
C:\dir1\dir2\quant\model_quant.onnxthen the pathC:\dir1\dir2\quantshould be valid and the directoryquantshould not already containmodel_quant.onnxfile before quantization. If the output .onnx file already exists, then that can get appended during saving of the quantized model resulting in corrupted or invalid output model. -
Check Input Model
During INT4 AWQ execution, the input onnx model (one mentioned in
--onnx_pathargument) will be run with onnxruntime (ORT) for calibration (using ORT EP mentioned in--calibration_epsargument). So, make sure that input onnx model is running fine with the specified ORT EP. -
Config availability for calibration with NvTensorRtRtx EP
Note that while using
NvTensorRtRtxfor INT4 AWQ quantization, profile (min/max/opt ranges) of input-shapes of the model is created internally using the details from the model's config (e.g. config.json in HuggingFace model card). This input-shapes-profile is used during onnxruntime session creation. Make sure that config.json is available in the model-directory ifmodel_nameis a local model path (instead of HuggingFace model-name). -
Error - Invalid Position-IDs input to the ONNX model
The ONNX models produced using ONNX GenerativeAI (GenAI) have different IO bindings for models produced using different execution-providers (EPs). For instance, model built with DML EP has position-ids input in the ONNX model but models builts using CUDA EP or NvTensorRtRtx EP don't have position-ids inputs. So, if base model requires, use
no_position_idscommand-line argument for disabling position_ids calibration input or set "add_position_ids" variable toFalsevalue (hard-code) in the quantize script if required.