Skip to content

ONNX model conversion failure of TensorRT 10.3 when running trtexec on GPU Jetson Orin #4494

@aw632

Description

@aw632

Description

I tried to run

trtexec --int8 --onnx=model_quantized_int8_512x512_iter4.onnx --saveEngine=model_quantized_int8_512x512_iter4.engine

on Jetson Orin (so I am locked to TRT 10.3 and Cuda 12.6) but got this error:

&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --int8 --onnx=model_quantized_int8_512x512_iter4.onnx --saveEngine=model_quantized_int8_512x512_iter4.engine
[06/20/2025-12:02:56] [I] === Model Options ===
[06/20/2025-12:02:56] [I] Format: ONNX
[06/20/2025-12:02:56] [I] Model: model_quantized_int8_512x512_iter4.onnx
[06/20/2025-12:02:56] [I] Output:
[06/20/2025-12:02:56] [I] === Build Options ===
[06/20/2025-12:02:56] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[06/20/2025-12:02:56] [I] avgTiming: 8
[06/20/2025-12:02:56] [I] Precision: FP32+INT8
[06/20/2025-12:02:56] [I] LayerPrecisions: 
[06/20/2025-12:02:56] [I] Layer Device Types: 
[06/20/2025-12:02:56] [I] Calibration: Dynamic
[06/20/2025-12:02:56] [I] Refit: Disabled
[06/20/2025-12:02:56] [I] Strip weights: Disabled
[06/20/2025-12:02:56] [I] Version Compatible: Disabled
[06/20/2025-12:02:56] [I] ONNX Plugin InstanceNorm: Disabled
[06/20/2025-12:02:56] [I] TensorRT runtime: full
[06/20/2025-12:02:56] [I] Lean DLL Path: 
[06/20/2025-12:02:56] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[06/20/2025-12:02:56] [I] Exclude Lean Runtime: Disabled
[06/20/2025-12:02:56] [I] Sparsity: Disabled
[06/20/2025-12:02:56] [I] Safe mode: Disabled
[06/20/2025-12:02:56] [I] Build DLA standalone loadable: Disabled
[06/20/2025-12:02:56] [I] Allow GPU fallback for DLA: Disabled
[06/20/2025-12:02:56] [I] DirectIO mode: Disabled
[06/20/2025-12:02:56] [I] Restricted mode: Disabled
[06/20/2025-12:02:56] [I] Skip inference: Disabled
[06/20/2025-12:02:56] [I] Save engine: model_quantized_int8_512x512_iter4.engine
[06/20/2025-12:02:56] [I] Load engine: 
[06/20/2025-12:02:56] [I] Profiling verbosity: 0
[06/20/2025-12:02:56] [I] Tactic sources: Using default tactic sources
[06/20/2025-12:02:56] [I] timingCacheMode: local
[06/20/2025-12:02:56] [I] timingCacheFile: 
[06/20/2025-12:02:56] [I] Enable Compilation Cache: Enabled
[06/20/2025-12:02:56] [I] errorOnTimingCacheMiss: Disabled
[06/20/2025-12:02:56] [I] Preview Features: Use default preview flags.
[06/20/2025-12:02:56] [I] MaxAuxStreams: -1
[06/20/2025-12:02:56] [I] BuilderOptimizationLevel: -1
[06/20/2025-12:02:56] [I] Calibration Profile Index: 0
[06/20/2025-12:02:56] [I] Weight Streaming: Disabled
[06/20/2025-12:02:56] [I] Runtime Platform: Same As Build
[06/20/2025-12:02:56] [I] Debug Tensors: 
[06/20/2025-12:02:56] [I] Input(s)s format: fp32:CHW
[06/20/2025-12:02:56] [I] Output(s)s format: fp32:CHW
[06/20/2025-12:02:56] [I] Input build shapes: model
[06/20/2025-12:02:56] [I] Input calibration shapes: model
[06/20/2025-12:02:56] [I] === System Options ===
[06/20/2025-12:02:56] [I] Device: 0
[06/20/2025-12:02:56] [I] DLACore: 
[06/20/2025-12:02:56] [I] Plugins:
[06/20/2025-12:02:56] [I] setPluginsToSerialize:
[06/20/2025-12:02:56] [I] dynamicPlugins:
[06/20/2025-12:02:56] [I] ignoreParsedPluginLibs: 0
[06/20/2025-12:02:56] [I] 
[06/20/2025-12:02:56] [I] === Inference Options ===
[06/20/2025-12:02:56] [I] Batch: Explicit
[06/20/2025-12:02:56] [I] Input inference shapes: model
[06/20/2025-12:02:56] [I] Iterations: 10
[06/20/2025-12:02:56] [I] Duration: 3s (+ 200ms warm up)
[06/20/2025-12:02:56] [I] Sleep time: 0ms
[06/20/2025-12:02:56] [I] Idle time: 0ms
[06/20/2025-12:02:56] [I] Inference Streams: 1
[06/20/2025-12:02:56] [I] ExposeDMA: Disabled
[06/20/2025-12:02:56] [I] Data transfers: Enabled
[06/20/2025-12:02:56] [I] Spin-wait: Disabled
[06/20/2025-12:02:56] [I] Multithreading: Disabled
[06/20/2025-12:02:56] [I] CUDA Graph: Disabled
[06/20/2025-12:02:56] [I] Separate profiling: Disabled
[06/20/2025-12:02:56] [I] Time Deserialize: Disabled
[06/20/2025-12:02:56] [I] Time Refit: Disabled
[06/20/2025-12:02:56] [I] NVTX verbosity: 0
[06/20/2025-12:02:56] [I] Persistent Cache Ratio: 0
[06/20/2025-12:02:56] [I] Optimization Profile Index: 0
[06/20/2025-12:02:56] [I] Weight Streaming Budget: 100.000000%
[06/20/2025-12:02:56] [I] Inputs:
[06/20/2025-12:02:56] [I] Debug Tensor Save Destinations:
[06/20/2025-12:02:56] [I] === Reporting Options ===
[06/20/2025-12:02:56] [I] Verbose: Disabled
[06/20/2025-12:02:56] [I] Averages: 10 inferences
[06/20/2025-12:02:56] [I] Percentiles: 90,95,99
[06/20/2025-12:02:56] [I] Dump refittable layers:Disabled
[06/20/2025-12:02:56] [I] Dump output: Disabled
[06/20/2025-12:02:56] [I] Profile: Disabled
[06/20/2025-12:02:56] [I] Export timing to JSON file: 
[06/20/2025-12:02:56] [I] Export output to JSON file: 
[06/20/2025-12:02:56] [I] Export profile to JSON file: 
[06/20/2025-12:02:56] [I] 
[06/20/2025-12:02:56] [I] === Device Information ===
[06/20/2025-12:02:56] [I] Available Devices: 
[06/20/2025-12:02:56] [I]   Device 0: "Orin" UUID: GPU-109d6538-e4f4-58ad-af9f-2e602e32dc99
[06/20/2025-12:02:56] [I] Selected Device: Orin
[06/20/2025-12:02:56] [I] Selected Device ID: 0
[06/20/2025-12:02:56] [I] Selected Device UUID: GPU-109d6538-e4f4-58ad-af9f-2e602e32dc99
[06/20/2025-12:02:56] [I] Compute Capability: 8.7
[06/20/2025-12:02:56] [I] SMs: 16
[06/20/2025-12:02:56] [I] Device Global Memory: 62840 MiB
[06/20/2025-12:02:56] [I] Shared Memory per SM: 164 KiB
[06/20/2025-12:02:56] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/20/2025-12:02:56] [I] Application Compute Clock Rate: 1.3 GHz
[06/20/2025-12:02:56] [I] Application Memory Clock Rate: 1.3 GHz
[06/20/2025-12:02:56] [I] 
[06/20/2025-12:02:56] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[06/20/2025-12:02:56] [I] 
[06/20/2025-12:02:56] [I] TensorRT version: 10.3.0
[06/20/2025-12:02:56] [I] Loading standard plugins
[06/20/2025-12:02:56] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 31, GPU 3536 (MiB)
[06/20/2025-12:02:58] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +928, GPU +752, now: CPU 1002, GPU 4332 (MiB)
[06/20/2025-12:02:58] [I] Start parsing network model.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1721632923
[06/20/2025-12:02:59] [I] [TRT] ----------------------------------------------------------------
[06/20/2025-12:02:59] [I] [TRT] Input filename:   model_quantized_int8_512x512_iter4.onnx
[06/20/2025-12:02:59] [I] [TRT] ONNX IR version:  0.0.8
[06/20/2025-12:02:59] [I] [TRT] Opset version:    17
[06/20/2025-12:02:59] [I] [TRT] Producer name:    pytorch
[06/20/2025-12:02:59] [I] [TRT] Producer version: 2.7.0
[06/20/2025-12:02:59] [I] [TRT] Domain:           
[06/20/2025-12:02:59] [I] [TRT] Model version:    0
[06/20/2025-12:02:59] [I] [TRT] Doc string:       
[06/20/2025-12:02:59] [I] [TRT] ----------------------------------------------------------------
[06/20/2025-12:03:02] [I] Finished parsing network model. Parse time: 4.28002
[06/20/2025-12:03:02] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[06/20/2025-12:03:04] [W] [TRT] Calibrator won't be used in explicit quantization mode. Please insert Quantize/Dequantize layers to indicate which tensors to quantize/dequantize.
[06/20/2025-12:03:14] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[06/20/2025-12:03:14] [E] Error[2]: [optimizer.cpp::filterQDQFormats::5035] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. classifier.1.conv2.weight + /classifier/classifier.1/conv2/weight_quantizer/QuantizeLinear + /classifier/classifier.1/conv2/Conv + /classifier/classifier.1/Add + /classifier/classifier.1/relu_1/Relu[CONVOLUTION]: All of the candidates were removed, which points to the node being incorrectly marked as an int8 node.)
[06/20/2025-12:03:14] [E] Engine could not be created from network
[06/20/2025-12:03:14] [E] Building engine failed
[06/20/2025-12:03:14] [E] Failed to create engine from model or file.
[06/20/2025-12:03:14] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --int8 --onnx=model_quantized_int8_512x512_iter4.onnx --saveEngine=model_quantized_int8_512x512_iter4.engine

I built the model following the PyTorch Quantization with PTQ steps here: https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html#apply-post-training-quantization-ptq. Then, I exported the model to ONNX.

Environment

TensorRT Version: 10.3

NVIDIA GPU: Jetson Orin

NVIDIA Driver Version: 540.4.0

CUDA Version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:14:07_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

CUDNN Version: 9.3.0

Relevant Files

Model link:
I'm using the Foundation Stereo model: github.com/NVlabs/FoundationStereo/
https://drive.google.com/file/d/179O4HM9qAHxROi372BosA7PHusuzh_Wp/view?usp=sharing

Steps To Reproduce

I'm attaching the quantization script I used. You can directly run this in the root of the FoundationStereo repo, linked above. Make sure to change the extension from .py.txt to .py.

quantize.py.txt

To run this script, you can run

python quantize.py --mode single --quantization INT8 --save_model

The dataset I used for calibration was also the one provided by the paper authors, found here: https://drive.google.com/file/d/1dJwK5x8xsaCazz5xPGJ2OKFIWrd9rQT5/view

Use /usr/src/tensorrt/bin/trtexec --int8 --onnx=model_quantized_int8_512x512_iter4.onnx --saveEngine=model_quantized_int8_512x512_iter4.engine as the build command.

Have you tried the latest release?: Cannot, Jetson Orin only has 10.3.

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes, PyTorch and polygraphy both run successfully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions