-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
I tried to run
trtexec --int8 --onnx=model_quantized_int8_512x512_iter4.onnx --saveEngine=model_quantized_int8_512x512_iter4.engine
on Jetson Orin (so I am locked to TRT 10.3 and Cuda 12.6) but got this error:
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --int8 --onnx=model_quantized_int8_512x512_iter4.onnx --saveEngine=model_quantized_int8_512x512_iter4.engine
[06/20/2025-12:02:56] [I] === Model Options ===
[06/20/2025-12:02:56] [I] Format: ONNX
[06/20/2025-12:02:56] [I] Model: model_quantized_int8_512x512_iter4.onnx
[06/20/2025-12:02:56] [I] Output:
[06/20/2025-12:02:56] [I] === Build Options ===
[06/20/2025-12:02:56] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[06/20/2025-12:02:56] [I] avgTiming: 8
[06/20/2025-12:02:56] [I] Precision: FP32+INT8
[06/20/2025-12:02:56] [I] LayerPrecisions:
[06/20/2025-12:02:56] [I] Layer Device Types:
[06/20/2025-12:02:56] [I] Calibration: Dynamic
[06/20/2025-12:02:56] [I] Refit: Disabled
[06/20/2025-12:02:56] [I] Strip weights: Disabled
[06/20/2025-12:02:56] [I] Version Compatible: Disabled
[06/20/2025-12:02:56] [I] ONNX Plugin InstanceNorm: Disabled
[06/20/2025-12:02:56] [I] TensorRT runtime: full
[06/20/2025-12:02:56] [I] Lean DLL Path:
[06/20/2025-12:02:56] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[06/20/2025-12:02:56] [I] Exclude Lean Runtime: Disabled
[06/20/2025-12:02:56] [I] Sparsity: Disabled
[06/20/2025-12:02:56] [I] Safe mode: Disabled
[06/20/2025-12:02:56] [I] Build DLA standalone loadable: Disabled
[06/20/2025-12:02:56] [I] Allow GPU fallback for DLA: Disabled
[06/20/2025-12:02:56] [I] DirectIO mode: Disabled
[06/20/2025-12:02:56] [I] Restricted mode: Disabled
[06/20/2025-12:02:56] [I] Skip inference: Disabled
[06/20/2025-12:02:56] [I] Save engine: model_quantized_int8_512x512_iter4.engine
[06/20/2025-12:02:56] [I] Load engine:
[06/20/2025-12:02:56] [I] Profiling verbosity: 0
[06/20/2025-12:02:56] [I] Tactic sources: Using default tactic sources
[06/20/2025-12:02:56] [I] timingCacheMode: local
[06/20/2025-12:02:56] [I] timingCacheFile:
[06/20/2025-12:02:56] [I] Enable Compilation Cache: Enabled
[06/20/2025-12:02:56] [I] errorOnTimingCacheMiss: Disabled
[06/20/2025-12:02:56] [I] Preview Features: Use default preview flags.
[06/20/2025-12:02:56] [I] MaxAuxStreams: -1
[06/20/2025-12:02:56] [I] BuilderOptimizationLevel: -1
[06/20/2025-12:02:56] [I] Calibration Profile Index: 0
[06/20/2025-12:02:56] [I] Weight Streaming: Disabled
[06/20/2025-12:02:56] [I] Runtime Platform: Same As Build
[06/20/2025-12:02:56] [I] Debug Tensors:
[06/20/2025-12:02:56] [I] Input(s)s format: fp32:CHW
[06/20/2025-12:02:56] [I] Output(s)s format: fp32:CHW
[06/20/2025-12:02:56] [I] Input build shapes: model
[06/20/2025-12:02:56] [I] Input calibration shapes: model
[06/20/2025-12:02:56] [I] === System Options ===
[06/20/2025-12:02:56] [I] Device: 0
[06/20/2025-12:02:56] [I] DLACore:
[06/20/2025-12:02:56] [I] Plugins:
[06/20/2025-12:02:56] [I] setPluginsToSerialize:
[06/20/2025-12:02:56] [I] dynamicPlugins:
[06/20/2025-12:02:56] [I] ignoreParsedPluginLibs: 0
[06/20/2025-12:02:56] [I]
[06/20/2025-12:02:56] [I] === Inference Options ===
[06/20/2025-12:02:56] [I] Batch: Explicit
[06/20/2025-12:02:56] [I] Input inference shapes: model
[06/20/2025-12:02:56] [I] Iterations: 10
[06/20/2025-12:02:56] [I] Duration: 3s (+ 200ms warm up)
[06/20/2025-12:02:56] [I] Sleep time: 0ms
[06/20/2025-12:02:56] [I] Idle time: 0ms
[06/20/2025-12:02:56] [I] Inference Streams: 1
[06/20/2025-12:02:56] [I] ExposeDMA: Disabled
[06/20/2025-12:02:56] [I] Data transfers: Enabled
[06/20/2025-12:02:56] [I] Spin-wait: Disabled
[06/20/2025-12:02:56] [I] Multithreading: Disabled
[06/20/2025-12:02:56] [I] CUDA Graph: Disabled
[06/20/2025-12:02:56] [I] Separate profiling: Disabled
[06/20/2025-12:02:56] [I] Time Deserialize: Disabled
[06/20/2025-12:02:56] [I] Time Refit: Disabled
[06/20/2025-12:02:56] [I] NVTX verbosity: 0
[06/20/2025-12:02:56] [I] Persistent Cache Ratio: 0
[06/20/2025-12:02:56] [I] Optimization Profile Index: 0
[06/20/2025-12:02:56] [I] Weight Streaming Budget: 100.000000%
[06/20/2025-12:02:56] [I] Inputs:
[06/20/2025-12:02:56] [I] Debug Tensor Save Destinations:
[06/20/2025-12:02:56] [I] === Reporting Options ===
[06/20/2025-12:02:56] [I] Verbose: Disabled
[06/20/2025-12:02:56] [I] Averages: 10 inferences
[06/20/2025-12:02:56] [I] Percentiles: 90,95,99
[06/20/2025-12:02:56] [I] Dump refittable layers:Disabled
[06/20/2025-12:02:56] [I] Dump output: Disabled
[06/20/2025-12:02:56] [I] Profile: Disabled
[06/20/2025-12:02:56] [I] Export timing to JSON file:
[06/20/2025-12:02:56] [I] Export output to JSON file:
[06/20/2025-12:02:56] [I] Export profile to JSON file:
[06/20/2025-12:02:56] [I]
[06/20/2025-12:02:56] [I] === Device Information ===
[06/20/2025-12:02:56] [I] Available Devices:
[06/20/2025-12:02:56] [I] Device 0: "Orin" UUID: GPU-109d6538-e4f4-58ad-af9f-2e602e32dc99
[06/20/2025-12:02:56] [I] Selected Device: Orin
[06/20/2025-12:02:56] [I] Selected Device ID: 0
[06/20/2025-12:02:56] [I] Selected Device UUID: GPU-109d6538-e4f4-58ad-af9f-2e602e32dc99
[06/20/2025-12:02:56] [I] Compute Capability: 8.7
[06/20/2025-12:02:56] [I] SMs: 16
[06/20/2025-12:02:56] [I] Device Global Memory: 62840 MiB
[06/20/2025-12:02:56] [I] Shared Memory per SM: 164 KiB
[06/20/2025-12:02:56] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/20/2025-12:02:56] [I] Application Compute Clock Rate: 1.3 GHz
[06/20/2025-12:02:56] [I] Application Memory Clock Rate: 1.3 GHz
[06/20/2025-12:02:56] [I]
[06/20/2025-12:02:56] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[06/20/2025-12:02:56] [I]
[06/20/2025-12:02:56] [I] TensorRT version: 10.3.0
[06/20/2025-12:02:56] [I] Loading standard plugins
[06/20/2025-12:02:56] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 31, GPU 3536 (MiB)
[06/20/2025-12:02:58] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +928, GPU +752, now: CPU 1002, GPU 4332 (MiB)
[06/20/2025-12:02:58] [I] Start parsing network model.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1721632923
[06/20/2025-12:02:59] [I] [TRT] ----------------------------------------------------------------
[06/20/2025-12:02:59] [I] [TRT] Input filename: model_quantized_int8_512x512_iter4.onnx
[06/20/2025-12:02:59] [I] [TRT] ONNX IR version: 0.0.8
[06/20/2025-12:02:59] [I] [TRT] Opset version: 17
[06/20/2025-12:02:59] [I] [TRT] Producer name: pytorch
[06/20/2025-12:02:59] [I] [TRT] Producer version: 2.7.0
[06/20/2025-12:02:59] [I] [TRT] Domain:
[06/20/2025-12:02:59] [I] [TRT] Model version: 0
[06/20/2025-12:02:59] [I] [TRT] Doc string:
[06/20/2025-12:02:59] [I] [TRT] ----------------------------------------------------------------
[06/20/2025-12:03:02] [I] Finished parsing network model. Parse time: 4.28002
[06/20/2025-12:03:02] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[06/20/2025-12:03:04] [W] [TRT] Calibrator won't be used in explicit quantization mode. Please insert Quantize/Dequantize layers to indicate which tensors to quantize/dequantize.
[06/20/2025-12:03:14] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[06/20/2025-12:03:14] [E] Error[2]: [optimizer.cpp::filterQDQFormats::5035] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. classifier.1.conv2.weight + /classifier/classifier.1/conv2/weight_quantizer/QuantizeLinear + /classifier/classifier.1/conv2/Conv + /classifier/classifier.1/Add + /classifier/classifier.1/relu_1/Relu[CONVOLUTION]: All of the candidates were removed, which points to the node being incorrectly marked as an int8 node.)
[06/20/2025-12:03:14] [E] Engine could not be created from network
[06/20/2025-12:03:14] [E] Building engine failed
[06/20/2025-12:03:14] [E] Failed to create engine from model or file.
[06/20/2025-12:03:14] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --int8 --onnx=model_quantized_int8_512x512_iter4.onnx --saveEngine=model_quantized_int8_512x512_iter4.engine
I built the model following the PyTorch Quantization with PTQ steps here: https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html#apply-post-training-quantization-ptq. Then, I exported the model to ONNX.
Environment
TensorRT Version: 10.3
NVIDIA GPU: Jetson Orin
NVIDIA Driver Version: 540.4.0
CUDA Version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:14:07_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
CUDNN Version: 9.3.0
Relevant Files
Model link:
I'm using the Foundation Stereo model: github.com/NVlabs/FoundationStereo/
https://drive.google.com/file/d/179O4HM9qAHxROi372BosA7PHusuzh_Wp/view?usp=sharing
Steps To Reproduce
I'm attaching the quantization script I used. You can directly run this in the root of the FoundationStereo repo, linked above. Make sure to change the extension from .py.txt to .py.
To run this script, you can run
python quantize.py --mode single --quantization INT8 --save_model
The dataset I used for calibration was also the one provided by the paper authors, found here: https://drive.google.com/file/d/1dJwK5x8xsaCazz5xPGJ2OKFIWrd9rQT5/view
Use /usr/src/tensorrt/bin/trtexec --int8 --onnx=model_quantized_int8_512x512_iter4.onnx --saveEngine=model_quantized_int8_512x512_iter4.engine as the build command.
Have you tried the latest release?: Cannot, Jetson Orin only has 10.3.
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes, PyTorch and polygraphy both run successfully.