Torch Quantization to ONNX Export

This example demonstrates how to quantize PyTorch models (vision and LLM) followed by export to ONNX format. The scripts leverage the ModelOpt toolkit for both quantization and ONNX export.

Section	Description	Link
Pre-Requisites	Required packages to use this example	Link
Vision Models	Quantize timm models and export to ONNX	Link
LLM Export	Export LLMs to quantized ONNX	Link
Mixed Precision	Auto mode for optimal per-layer quantization	Link
Support Matrix	View the ONNX export supported LLM models	Link
Resources	Extra links to relevant resources	Link

Pre-Requisites

Docker

Please use the TensorRT docker image (e.g., nvcr.io/nvidia/tensorrt:26.02-py3) or visit our installation docs for more information.

Set the following environment variables inside the TensorRT docker.

export CUDNN_LIB_DIR=/usr/lib/x86_64-linux-gnu/
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${LD_LIBRARY_PATH}"

Local Installation

Install Model Optimizer with onnx dependencies using pip from PyPI and install the requirements for the example:

pip install -U "nvidia-modelopt[onnx]"
pip install -r requirements.txt

For TensorRT Compiler framework workloads:

Install the latest TensorRT from here.

Vision Models

The torch_quant_to_onnx.py script quantizes timm vision models and exports them to ONNX.

What it does

Loads a pretrained timm torch model (default: ViT-Base).
Quantizes the torch model to FP8, MXFP8, INT8, NVFP4, or INT4_AWQ using ModelOpt.
Exports the quantized model to ONNX.
Postprocesses the ONNX model to be compatible with TensorRT.
Saves the final ONNX model.

Opset 20 is used to export the torch models to ONNX.

Usage

python torch_quant_to_onnx.py \
    --timm_model_name=vit_base_patch16_224 \
    --quantize_mode=<fp8|mxfp8|int8|nvfp4|int4_awq> \
    --onnx_save_path=<path to save the exported ONNX model>

Evaluation

If the input model is of type image classification, use the following script to evaluate it. The script automatically downloads and uses the ILSVRC/imagenet-1k dataset from Hugging Face. This gated repository requires authentication via Hugging Face access token. See https://huggingface.co/docs/hub/en/security-tokens for details.

Note: TensorRT 10.11 or later is required to evaluate the MXFP8 or NVFP4 ONNX models.

python ../onnx_ptq/evaluate.py \
    --onnx_path=<path to the exported ONNX model> \
    --imagenet_path=<HF dataset card or local path to the ImageNet dataset> \
    --engine_precision=stronglyTyped \
    --model_name=vit_base_patch16_224

LLM Export

The llm_export.py script exports LLM models to ONNX with optional quantization.

What it does

Loads a HuggingFace LLM model (local path or model name).
Optionally quantizes the model to FP8, INT4_AWQ, or NVFP4.
Exports the model to ONNX format.
Post-processes the ONNX graph for TensorRT compatibility.

Usage

python llm_export.py \
    --hf_model_path=<HuggingFace model name or local path> \
    --dtype=<fp16|fp8|int4_awq|nvfp4> \
    --output_dir=<directory to save ONNX model>

Examples

Export Qwen2 to FP16 ONNX:

python llm_export.py \
    --hf_model_path=Qwen/Qwen2-0.5B-Instruct \
    --dtype=fp16 \
    --output_dir=./qwen2_fp16

Export Qwen2 to FP8 ONNX with quantization:

python llm_export.py \
    --hf_model_path=Qwen/Qwen2-0.5B-Instruct \
    --dtype=fp8 \
    --output_dir=./qwen2_fp8

Export to NVFP4 with custom calibration:

python llm_export.py \
    --hf_model_path=Qwen/Qwen3-0.6B \
    --dtype=nvfp4 \
    --calib_size=512 \
    --output_dir=./qwen3_nvfp4

Key Parameters

Parameter	Description
`--hf_model_path`	HuggingFace model name (e.g., `Qwen/Qwen2-0.5B-Instruct`) or local model path
`--dtype`	Export precision: `fp16`, `fp8`, `int4_awq`, or `nvfp4`
`--output_dir`	Directory to save the exported ONNX model
`--calib_size`	Number of calibration samples for quantization (default: 512)
`--lm_head`	Precision of lm_head layer (default: `fp16`)
`--save_original`	Save the raw ONNX before post-processing
`--trust_remote_code`	Trust remote code when loading from HuggingFace Hub

Mixed Precision Quantization (Auto Mode)

The auto mode enables mixed precision quantization by searching for the optimal quantization format per layer. This approach balances model accuracy and compression by assigning different precision formats (e.g., NVFP4, FP8) to different layers based on their sensitivity.

How it works

Sensitivity Analysis: Computes per-layer sensitivity scores using gradient-based analysis
Format Search: Searches across specified quantization formats for each layer
Constraint Optimization: Finds the optimal format assignment that satisfies the effective bits constraint while minimizing accuracy loss

Key Parameters

Parameter	Default	Description
`--effective_bits`	4.8	Target average bits per weight across the model. Lower values = more compression but potentially lower accuracy. The search algorithm finds the optimal per-layer format assignment that meets this constraint while minimizing accuracy loss. For example, 4.8 means an average of 4.8 bits per weight (mix of FP4 and FP8 layers).
`--num_score_steps`	128	Number of forward/backward passes used to compute per-layer sensitivity scores via gradient-based analysis. Higher values provide more accurate sensitivity estimates but increase search time. Recommended range: 64-256.
`--calibration_data_size`	512	Number of calibration samples used for both sensitivity scoring and calibration. For auto mode, labels are required for loss computation.

Usage

python torch_quant_to_onnx.py \
    --timm_model_name=vit_base_patch16_224 \
    --quantize_mode=auto \
    --auto_quantization_formats NVFP4_AWQ_LITE_CFG FP8_DEFAULT_CFG \
    --effective_bits=4.8 \
    --num_score_steps=128 \
    --calibration_data_size=512 \
    --evaluate \
    --onnx_save_path=vit_base_patch16_224.auto_quant.onnx

Results (ViT-Base)

	Top-1 accuracy (torch)	Top-5 accuracy (torch)
Torch autocast (FP16)	85.11%	97.53%
NVFP4 Quantized	84.558%	97.36%
Auto Quantized (FP8 + NVFP4, 4.78 effective bits)	84.726%	97.434%

ONNX Export Supported LLM Models

Model	FP16	INT4	FP8	NVFP4
Llama-3-8B-Instruct	✅	✅	✅	✅
Llama3.1-8B	✅	✅	✅	✅
Llama3.2-3B	✅	✅	✅	✅
Qwen2-0.5B-Instruct	✅	✅	✅	✅
Qwen2-1.5B-Instruct	✅	✅	✅	✅
Qwen2-7B-Instruct	✅	✅	✅	✅
Qwen2.5-0.5B-Instruct	✅	✅	✅	✅
Qwen2.5-1.5B-Instruct	✅	✅	✅	✅
Qwen2.5-3B-Instruct	✅	✅	✅	✅
Qwen2.5-7B-Instruct	✅	✅	✅	✅

Resources

Technical Resources

There are many quantization schemes supported in the example scripts:

The FP8 format is available on the Hopper and Ada GPUs with CUDA compute capability greater than or equal to 8.9.
The INT4 AWQ is an INT4 weight only quantization and calibration method. INT4 AWQ is particularly effective for low batch inference where inference latency is dominated by weight loading time rather than the computation time itself. For low batch inference, INT4 AWQ could give lower latency than FP8/INT8 and lower accuracy degradation than INT8.
The NVFP4 is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch Quantization to ONNX Export

Pre-Requisites

Docker

Local Installation

Vision Models

What it does

Usage

Evaluation

LLM Export

What it does

Usage

Examples

Key Parameters

Mixed Precision Quantization (Auto Mode)

How it works

Key Parameters

Usage

Results (ViT-Base)

ONNX Export Supported LLM Models

Resources

Technical Resources

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Torch Quantization to ONNX Export

Pre-Requisites

Docker

Local Installation

Vision Models

What it does

Usage

Evaluation

LLM Export

What it does

Usage

Examples

Key Parameters

Mixed Precision Quantization (Auto Mode)

How it works

Key Parameters

Usage

Results (ViT-Base)

ONNX Export Supported LLM Models

Resources

Technical Resources