Skip to content

How to debug trt fp16 Nan output with polygraphy? #4502

@Neronjust2017

Description

@Neronjust2017

I used polygraphy run model.onnx --trt --fp16 --precision-constraints none --data-loader-script loader.py -v --validate --fail-fast to check fp16 onnx' outputs (model.onnx is fp32) and got the following error.

[I] Output Validation | Runners: ['trt-runner-N0-06/30/25-19:45:22']
[I]     trt-runner-N0-06/30/25-19:45:22     | Validating output: pred_multipath (check_inf=True, check_nan=True)
[I]         mean=nan, std-dev=nan, var=nan, median=nan, min=nan at (0, 0, 0, 0), max=nan at (0, 0, 0, 0), avg-magnitude=nan, p90=nan, p95=nan, p99=nan
[V]             Could not generate histogram. Note: Error was: autodetected range of [nan, nan] is not finite
[V]             
[E]         NaN Detected | One or more NaNs were encountered in this output
[I]         Note: Use -vv or set logging verbosity to EXTRA_VERBOSE to display locations of NaNs
[E]         Inf Detected | One or more non-finite values were encountered in this output
[I]         Note: Use -vv or set logging verbosity to EXTRA_VERBOSE to display non-finite values
[E]         FAILED | Errors detected in output: pred_multipath
[I]     trt-runner-N0-06/30/25-19:45:22     | Validating output: path_prob (check_inf=True, check_nan=True)
[I]         mean=nan, std-dev=nan, var=nan, median=nan, min=nan at (0, 0), max=nan at (0, 0), avg-magnitude=nan, p90=nan, p95=nan, p99=nan
[V]             ---- Values ----
                    [[nan nan nan nan nan]]
[V]             Could not generate histogram. Note: Error was: autodetected range of [nan, nan] is not finite
[V]             
[E]         NaN Detected | One or more NaNs were encountered in this output
[E]         Inf Detected | One or more non-finite values were encountered in this output
[E]         FAILED | Errors detected in output: path_prob
[I]     trt-runner-N0-06/30/25-19:45:22     | Validating output: pred_target_agent_attribute (check_inf=True, check_nan=True)
[I]         mean=20.279, std-dev=30.123, var=907.39, median=3.0488, min=1.7881e-07 at (0, 3), max=80.562 at (0, 0), avg-magnitude=20.279, p90=63.444, p95=72.003, p99=78.851
[V]             ---- Values ----
                    [[8.0562500e+01 5.2031250e+01 7.6293945e-05 1.7881393e-07 3.0488281e+00
                      4.5117188e+00 1.8007812e+00]]
[V]             ---- Histogram ----
                Bin Range        |  Num Elems | Visualization
                (1.79e-07, 8.06) |          5 | ########################################
                (8.06    , 16.1) |          0 | 
                (16.1    , 24.2) |          0 | 
                (24.2    , 32.2) |          0 | 
                (32.2    , 40.3) |          0 | 
                (40.3    , 48.3) |          0 | 
                (48.3    , 56.4) |          1 | ########
                (56.4    , 64.4) |          0 | 
                (64.4    , 72.5) |          0 | 
                (72.5    , 80.6) |          1 | ########
[I]         PASSED | Output: pred_target_agent_attribute is valid
[I]     trt-runner-N0-06/30/25-19:45:22     | Validating output: pred_scores (check_inf=True, check_nan=True)
[I]         mean=0.092712, std-dev=0, var=0, median=0.092712, min=0.092712 at (0,), max=0.092712 at (0,), avg-magnitude=0.092712, p90=0.092712, p95=0.092712, p99=0.092712
[V]             ---- Values ----
                    [0.0927124]
[V]             ---- Histogram ----
                Bin Range            |  Num Elems | Visualization
                (-0.407  , -0.307  ) |          0 | 
                (-0.307  , -0.207  ) |          0 | 
                (-0.207  , -0.107  ) |          0 | 
                (-0.107  , -0.00729) |          0 | 
                (-0.00729, 0.0927  ) |          0 | 
                (0.0927  , 0.193   ) |          1 | ########################################
                (0.193   , 0.293   ) |          0 | 
                (0.293   , 0.393   ) |          0 | 
                (0.393   , 0.493   ) |          0 | 
                (0.493   , 0.593   ) |          0 | 
[I]         PASSED | Output: pred_scores is valid
[I]     trt-runner-N0-06/30/25-19:45:22     | Validating output: pred_ttc (check_inf=True, check_nan=True)
[I]         mean=4, std-dev=0, var=0, median=4, min=4 at (0,), max=4 at (0,), avg-magnitude=4, p90=4, p95=4, p99=4
[V]             ---- Values ----
                    [4.]
[V]             ---- Histogram ----
                Bin Range  |  Num Elems | Visualization
                (3.5, 3.6) |          0 | 
                (3.6, 3.7) |          0 | 
                (3.7, 3.8) |          0 | 
                (3.8, 3.9) |          0 | 
                (3.9, 4  ) |          0 | 
                (4  , 4.1) |          1 | ########################################
                (4.1, 4.2) |          0 | 
                (4.2, 4.3) |          0 | 
                (4.3, 4.4) |          0 | 
                (4.4, 4.5) |          0 | 
[I]         PASSED | Output: pred_ttc is valid
[E]     FAILED | Output Validation

I used polygraphy run model.onnx --trt --fp16 --precision-constraints none --data-loader-script loader.py -v --validate --fail-fast --trt-outputs mark all --save-outputs outputs.json to get layerwise outputs but got the following error:

[E] 2: [myelinBuilderUtils.cpp::operator()::751] Error Code 2: Internal Error ([ShapeHostToDeviceCopy 0] requires bool or uint8 I/O but node can not be handled by Myelin. Operation is not supported.)
[!] Invalid Engine. Please ensure the engine was built correctly

It says --trt outputs mark all may disable TensorRT's layer fusion, resulting in performance degradation or errors (such as Myelin optimizer errors). Can I get the layerwise outputs from trt model? How can I debug this Nan issue?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions