Skip to content

Conversation

i-riyad
Copy link
Contributor

@i-riyad i-riyad commented Sep 4, 2025

What does this PR do?

Type of change: new example

Overview: TensorRT deprecated weakly typing in support of explicitly typed onnx model. Modelopt's evaluation and deployment utilities should reflect that.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: No, we have changed the parameter name in example script
  • Did you write any new necessary tests?: Yes
  • Did you add or update any necessary documentation?: Yes
  • Did you update Changelog?: Yes

Additional Information

Summary by CodeRabbit

  • Documentation
    • Updated ONNX PTQ docs and examples to use --engine_precision=stronglyTyped; clarified high_precision_dtype default now fp16 and deprecated old flag.
  • Refactor
    • TensorRT engine-build behavior treats stronglyTyped as low-bit mode.
  • Chores
    • Replaced quantize_mode with engine_precision CLI and removed legacy int8_iq precision.
  • Bug Fixes
    • Quantization checks now tolerate an intermediate Cast after DequantizeLinear.
  • Tests
    • Example and unit tests updated to match new precision defaults and evaluation paths.

Copy link

copy-pr-bot bot commented Sep 4, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link

coderabbitai bot commented Sep 4, 2025

Walkthrough

Replaces CLI --quantize_mode with --engine_precision in ONNX PTQ examples; defaults ONNX high-precision dtype to fp16; removes INT8_IQ precision and its flags from TRT runtime; updates QDQ traversal to skip an intervening Cast; adjusts tests and docs to match these changes.

Changes

Cohort / File(s) Summary
Docs: ONNX PTQ README
examples/onnx_ptq/README.md
Example commands switched from --quantize_mode to --engine_precision=stronglyTyped.
Examples: Evaluate & Evaluation
examples/onnx_ptq/evaluate.py, examples/onnx_ptq/evaluation.py
evaluate.py: replaced --quantize_mode with --engine_precision (choices ["best","fp16","stronglyTyped"]) and uses args.engine_precision for deployment precision. evaluation.py: deployment precision set to "stronglyTyped".
Tests: example runner
tests/examples/test_onnx_ptq.sh
Test script maps quant modes to evaluation ONNX paths and to --engine_precision=$precision; int8_iq evaluated against FP16 ONNX with precision "best".
ONNX quantization CLI & API
modelopt/onnx/quantization/__main__.py, modelopt/onnx/quantization/quantize.py, modelopt/onnx/quantization/int8.py
CLI default --high_precision_dtype set to "fp16"; quantize() signatures default high_precision_dtype: str = "fp16" (removed None); docstrings and call sites aligned to concrete default.
QDQ utilities
modelopt/onnx/quantization/qdq_utils.py
Successor-consumer traversal skips a Cast following DequantizeLinear; replaced graph.node.clear() with del graph.node[:].
TensorRT runtime/config
modelopt/torch/_deploy/_runtime/tensorrt/constants.py, modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py, modelopt/torch/_deploy/_runtime/trt_client.py, modelopt/torch/_deploy/_runtime/tensorrt/tensorrt_utils.py
Removed INT8_IQ constant and its flag; removed validate_precision; added STRONGLY_TYPED to low-bit detection in engine builder; removed int8_iq from client precision table.
Tests: ONNX quantization assertions
tests/_test_utils/onnx_quantization/utils.py, tests/unit/onnx/test_qdq_rules_int8.py
Assertions updated to tolerate an intermediate Cast before DequantizeLinear and to only assert DQ presence for Variables that have producers.
Changelog
CHANGELOG.rst
Documents deprecation of quantize_mode and defaulting high_precision_dtype to fp16.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User / CI
  participant Eval as examples/onnx_ptq/evaluate.py
  participant Deploy as Deployment config
  participant EB as TRT EngineBuilder
  participant TRT as TensorRT

  U->>Eval: run evaluate (--engine_precision)
  Eval->>Deploy: build deployment (precision = args.engine_precision)
  Deploy->>EB: build_engine(trt_mode = precision)
  alt trt_mode == STRONGLY_TYPED
    Note right of EB #b3e5fc: treated as low-bit → opt_level = 4
    EB->>TRT: create engine(opt_level=4)
  else other modes
    EB->>TRT: create engine(opt_level=builder_optimization_level)
  end
  TRT-->>Eval: engine ready / run inference
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title “Making stronglyTyped default for modelopt evaluation” succinctly captures the primary change of the pull request—switching the default precision to stronglyTyped in the Modelopt evaluation flow—without extraneous detail or unrelated content. It is specific enough for reviewers to understand the main intent at a glance and omits unnecessary noise.
Description Check ✅ Passed The description clearly explains that TensorRT’s deprecation of weak typing motivates updating Modelopt’s evaluation and deployment utilities, notes the change is non-backward compatible, and references added tests and documentation updates, which aligns with the raw summary of changes. Although some sections are placeholders, the description remains on-topic and related to the changeset.

Poem

A nibble of bits, a hop through graphs,
I skip a Cast and follow the paths.
StronglyTyped now leads the race,
INT8_IQ steps out of place.
With fp16 whiskers, engines hum—🐇⚙️

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rislam/strongly-typed-default

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@i-riyad i-riyad force-pushed the rislam/strongly-typed-default branch from 12911e6 to 97cd184 Compare September 4, 2025 22:55
Copy link

codecov bot commented Sep 4, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.87%. Comparing base (d6d2e75) to head (5d3adfc).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/onnx/quantization/qdq_utils.py 40.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #287   +/-   ##
=======================================
  Coverage   73.86%   73.87%           
=======================================
  Files         172      172           
  Lines       17415    17416    +1     
=======================================
+ Hits        12864    12866    +2     
+ Misses       4551     4550    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@i-riyad i-riyad force-pushed the rislam/strongly-typed-default branch from 97cd184 to 22dde2f Compare September 8, 2025 23:06
@i-riyad i-riyad marked this pull request as ready for review September 9, 2025 02:59
@i-riyad i-riyad requested review from a team as code owners September 9, 2025 02:59
@i-riyad i-riyad requested review from gcunhase and cjluo-nv September 9, 2025 02:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/onnx_ptq/README.md (1)

119-125: Remove unsupported flag from docs
The evaluate.py CLI no longer accepts --quantize_mode (or any precision flag); drop --quantize_mode=stronglyTyped from the ONNX PTQ snippet (lines 119–125) and the later LLM evaluation example in examples/onnx_ptq/README.md.

modelopt/onnx/quantization/int8.py (1)

156-172: Fix None deref: nodes_to_exclude may be None before .extend().

nodes_to_exclude.extend(...) will crash when the param is not provided. Normalize optionals up front.

     logger.info("Detecting GEMV patterns for TRT optimization")
     matmul_nodes_to_exclude = find_nodes_from_matmul_to_exclude(
         onnx_path,
         use_external_data_format,
         intermediate_generated_files,
         calibration_data_reader,
         calibration_eps,
         calibration_shapes,
     )
-    nodes_to_exclude.extend(matmul_nodes_to_exclude)  # type: ignore[union-attr]
+    # Normalize optionals before use
+    nodes_to_exclude = list(nodes_to_exclude or [])
+    intermediate_generated_files = list(intermediate_generated_files or [])
+    custom_ops_to_quantize = list(custom_ops_to_quantize or [])
+    nodes_to_exclude.extend(matmul_nodes_to_exclude)
🧹 Nitpick comments (8)
examples/onnx_ptq/evaluation.py (1)

29-34: Defaulting to stronglyTyped is good; consider a lightweight override.

Keeping stronglyTyped as default aligns with TRT guidance. As a convenience, allow an env override (no CLI churn) so users can quickly A/B test.

 import torch
 import torchvision.transforms as transforms
 from torchvision.datasets import ImageNet
 from tqdm import tqdm
+import os

 ...
 deployment = {
     "runtime": "TRT",
     "accelerator": "GPU",
-    "precision": "stronglyTyped",
+    "precision": os.getenv("TRT_PRECISION", "stronglyTyped"),
     "onnx_opset": "21",
 }

Please confirm the minimal TRT version that supports strongly typed networks here and mention it in the README “Evaluate” section.

modelopt/onnx/quantization/int8.py (2)

127-135: Validate or warn on unsupported high_precision_dtype values.

You only handle {"fp16","bf16"}; other strings silently no-op. Emit a warning for unexpected values to aid debugging.

-    if high_precision_dtype in ["fp16", "bf16"]:
+    if high_precision_dtype in ["fp16", "bf16"]:
         ...
+    else:
+        logger.warning(
+            "Unknown high_precision_dtype '%s'; skipping float downcast. Expected one of {'fp16','bf16'}.",
+            high_precision_dtype,
+        )

Also applies to: 275-285


113-135: Avoid mutable defaults for list parameters (future-proofing).

intermediate_generated_files, custom_ops_to_quantize have mutable defaults. You mitigated at runtime above; consider following up to switch to None defaults in a future API clean-up.

modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py (1)

103-111: Confirm adding STRONGLY_TYPED to “low-bit” bucket (forces opt level 4).

Treating STRONGLY_TYPED as low-bit changes builderOptimizationLevel to 4 and overrides user-provided levels. Is that intentional for all stronglyTyped builds, including fp16/fp32-typed graphs? If yes, consider renaming _is_low_bit_mode to reflect the broader intent, or add an inline comment to avoid confusion. Otherwise, gate the opt-level bump more narrowly.

Suggested inline doc tweak:

 def _is_low_bit_mode(trt_mode: str) -> bool:
-    return trt_mode in [
+    # Modes that benefit from max builderOptimizationLevel (4), not just "low-bit".
+    return trt_mode in [
         TRTMode.INT8,
         TRTMode.INT4,
         TRTMode.FLOAT8,
         TRTMode.BEST,
         TRTMode.STRONGLY_TYPED,
     ]
tests/examples/test_onnx_ptq.sh (2)

160-177: Fix array-membership test (ShellCheck SC2199/SC2076).

[[ " ${latency_models[@]} " =~ " $model_name " ]] is brittle. Use a loop or regex without quotes.

Apply:

-        if [[ " ${latency_models[@]} " =~ " $model_name " ]]; then
+        in_latency_set=false
+        for m in "${latency_models[@]}"; do
+            if [[ "$m" == "$model_name" ]]; then in_latency_set=true; break; fi
+        done
+        if $in_latency_set; then

48-56: Consider removing int8_iq from modes to avoid confusion.

You skip quantization for int8_iq and evaluate FP16 with precision="int8". If IQ is deprecated, drop it from quant_modes and the evaluation mapping to simplify.

Proposed minimal:

-if [ $cuda_capability -ge 89 ]; then
-    quant_modes=("fp8" "int8" "int8_iq")
+if [ $cuda_capability -ge 89 ]; then
+    quant_modes=("fp8" "int8")
 else
     echo "CUDA capability is less than 89, skipping fp8 mode!"
-    quant_modes=("int8" "int8_iq")
+    quant_modes=("int8")
 fi
-all_modes=("${base_modes[@]}" "${quant_modes[@]}")
+all_modes=("${base_modes[@]}" "${quant_modes[@]}")
modelopt/onnx/quantization/quantize.py (2)

222-222: API tightening may break external callers; offer soft-landing.

Changing high_precision_dtype from optional to required can break programmatic users. Consider accepting Optional at the type level and defaulting at runtime for back-compat.

-def quantize(
+def quantize(
@@
-    high_precision_dtype: str = "fp16",
+    high_precision_dtype: str | None = "fp16",
@@
 ) -> None:

And near the top of the body:

-    configure_logging(log_level.upper(), log_file)
+    configure_logging(log_level.upper(), log_file)
+    if high_precision_dtype is None:
+        high_precision_dtype = "fp16"

289-296: Docstring reads well; minor clarity tweak optional.

Suggest noting explicitly that no conversion occurs if the input is already fp16/bf16.

-            and the input model is of dtype fp32, model's weight and activation will be converted to
-            'fp16' or 'bf16'.
+            and the input model is of dtype fp32, weights and activations are converted accordingly.
+            If the input is already fp16/bf16, no conversion is applied.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf6f1d4 and 22dde2f.

📒 Files selected for processing (13)
  • examples/onnx_ptq/README.md (1 hunks)
  • examples/onnx_ptq/evaluate.py (1 hunks)
  • examples/onnx_ptq/evaluation.py (1 hunks)
  • modelopt/onnx/quantization/__main__.py (2 hunks)
  • modelopt/onnx/quantization/int8.py (1 hunks)
  • modelopt/onnx/quantization/qdq_utils.py (2 hunks)
  • modelopt/onnx/quantization/quantize.py (3 hunks)
  • modelopt/torch/_deploy/_runtime/tensorrt/constants.py (0 hunks)
  • modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py (1 hunks)
  • modelopt/torch/_deploy/_runtime/trt_client.py (0 hunks)
  • tests/_test_utils/onnx_quantization/utils.py (1 hunks)
  • tests/examples/test_onnx_ptq.sh (2 hunks)
  • tests/unit/onnx/test_qdq_rules_int8.py (1 hunks)
💤 Files with no reviewable changes (2)
  • modelopt/torch/_deploy/_runtime/tensorrt/constants.py
  • modelopt/torch/_deploy/_runtime/trt_client.py
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py (1)
modelopt/torch/_deploy/_runtime/tensorrt/constants.py (1)
  • TRTMode (83-91)
🪛 Shellcheck (0.10.0)
tests/examples/test_onnx_ptq.sh

[error] 176-176: Arrays implicitly concatenate in [[ ]]. Use a loop (or explicit * instead of @).

(SC2199)


[warning] 176-176: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)

🔇 Additional comments (5)
tests/unit/onnx/test_qdq_rules_int8.py (1)

44-47: LGTM: safer guard before checking producer op.

Requiring inp.inputs avoids probing graph inputs/constants. Good stability improvement.

modelopt/onnx/quantization/qdq_utils.py (1)

998-1000: In-place node list clear is fine.

del graph.node[:] preserves references while rebuilding the node list; the change looks good.

modelopt/onnx/quantization/__main__.py (2)

183-189: Good default: --high_precision_dtype now fp16 with clear help text.


285-286: Passing dtype through unconditionally looks right.

Directly forwarding args.high_precision_dtype aligns with the stricter API and removes hidden fallbacks.

modelopt/onnx/quantization/quantize.py (1)

465-466: Good: dtype forwarded into int8/fp8 paths (removes hidden fallback).

Comment on lines 57 to 67
# Note. stronglyTyped is recommended, all other modes have been deprecated in TensorRT
deployment = {
"runtime": "TRT",
"precision": args.quantize_mode,
"precision": "stronglyTyped",
}
Copy link

@coderabbitai coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

evaluate.py drops --quantize_mode but tests still pass it → argparse error.

Current tests invoke evaluate.py with --quantize_mode; without accepting it here, argparse will fail. Easiest fix: accept and ignore it (deprecated), preserving back-compat.

Apply:

 def main():
     parser = argparse.ArgumentParser()
@@
     args = parser.parse_args()
 
-    # Note. stronglyTyped is recommended, all other modes have been deprecated in TensorRT
+    # Back-compat: accept deprecated --quantize_mode but ignore its value.
+    # StronglyTyped is recommended; other modes are deprecated in TensorRT.
+    if hasattr(args, "quantize_mode"):
+        pass
+
+    # Note. stronglyTyped is recommended, all other modes have been deprecated in TensorRT
     deployment = {
         "runtime": "TRT",
         "precision": "stronglyTyped",
     }

If you prefer to remove the flag from callers instead, I can update the test script calls too (see test comment).

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In examples/onnx_ptq/evaluate.py around lines 57 to 61, the script no longer
accepts the deprecated --quantize_mode argument so tests that pass it cause an
argparse error; update the argument parser to add a --quantize_mode option
(accepting the same expected values or a string) and simply ignore its value at
runtime to preserve backward compatibility, and mark it in the help text as
deprecated; no further behavior changes are needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a terrible idea. Need to fix the tests instead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@galagam You're absolutely right! Maintaining backward compatibility for a deprecated argument goes against the clean API approach. Let me check the test files and provide fixes for them instead.


🏁 Script executed:

#!/bin/bash
# Find all files that call evaluate.py with --quantize_mode
echo "=== Searching for files that call evaluate.py with --quantize_mode ==="
rg -n --type=sh --type=py -C3 "evaluate\.py.*--quantize_mode"

echo -e "\n=== Looking at test files that might need updating ==="
fd -e sh -e py test | xargs rg -l "evaluate\.py"

Length of output: 158


🏁 Script executed:

#!/bin/bash
# Examine the test file that calls evaluate.py
echo "=== Content of tests/examples/test_onnx_ptq.sh ==="
cat tests/examples/test_onnx_ptq.sh

echo -e "\n=== Searching for any --quantize_mode usage in test files ==="
rg -n --type=sh --type=py "quantize_mode" tests/

Length of output: 11217


Perfect! I found the issue. The test script is still passing --quantize_mode to evaluate.py on lines 180 and 189, but evaluate.py no longer accepts this argument. Here's the fix:

The fix is to simply remove --quantize_mode=$precision from both evaluation calls in the test script, since evaluate.py now hardcodes "stronglyTyped" precision.

Fix for tests/examples/test_onnx_ptq.sh:

        if [[ " ${latency_models[@]} " =~ " $model_name " ]]; then
            CUDA_VISIBLE_DEVICES=$gpu_id python evaluate.py \
                --onnx_path=$eval_model_path \
                --model_name="${timm_model_name[$model_name]}" \
-               --quantize_mode=$precision \
                --results_path=$model_dir/$quant_mode/${model_name}_${quant_mode}.csv &
        else
            CUDA_VISIBLE_DEVICES=$gpu_id python evaluate.py \
                --onnx_path=$eval_model_path \
                --imagenet_path=$imagenet_path \
                --eval_data_size=$calib_size \
                --batch_size $batch_size \
                --model_name="${timm_model_name[$model_name]}" \
-               --quantize_mode=$precision \
                --results_path=$model_dir/$quant_mode/${model_name}_${quant_mode}.csv &

This removes the deprecated argument from both evaluation paths and aligns the tests with the new API where evaluate.py always uses "stronglyTyped" precision.


🏁 Script executed:

#!/bin/bash
# Show the specific lines that need to be fixed
echo "=== Lines that need to be fixed in tests/examples/test_onnx_ptq.sh ==="
sed -n '175,195p' tests/examples/test_onnx_ptq.sh

Length of output: 1097


Exactly! Here's the precise fix needed for tests/examples/test_onnx_ptq.sh:

Remove --quantize_mode=$precision from both evaluate.py calls:

Line ~180 (latency models branch):

            CUDA_VISIBLE_DEVICES=$gpu_id python evaluate.py \
                --onnx_path=$eval_model_path \
                --model_name="${timm_model_name[$model_name]}" \
-               --quantize_mode=$precision \
                --results_path=$model_dir/$quant_mode/${model_name}_${quant_mode}.csv &

Line ~189 (regular evaluation branch):

            CUDA_VISIBLE_DEVICES=$gpu_id python evaluate.py \
                --onnx_path=$eval_model_path \
                --imagenet_path=$imagenet_path \
                --eval_data_size=$calib_size \
                --batch_size $batch_size \
                --model_name="${timm_model_name[$model_name]}" \
-               --quantize_mode=$precision \
                --results_path=$model_dir/$quant_mode/${model_name}_${quant_mode}.csv &

Optional cleanup: Since $precision is no longer used, you could also remove the precision assignment logic (lines ~166-174), but that's not strictly necessary for fixing the immediate issue.

This approach is much cleaner than maintaining deprecated arguments - thanks for steering toward the right solution! 🎯

Comment on lines 57 to 67
# Note. stronglyTyped is recommended, all other modes have been deprecated in TensorRT
deployment = {
"runtime": "TRT",
"precision": args.quantize_mode,
"precision": "stronglyTyped",
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a terrible idea. Need to fix the tests instead.

MHA accumulation dtype. One of ['fp32', 'fp16']. 'fp16' by default.
If quantize_mode == 'fp8' and mha_accumulation_dtype == 'fp32', Cast nodes will be added to
MHA's bmm1 and bmm2's input and output tensors.
MHA accumulation dtype. One of ['fp32', 'fp16']. If quantize_mode == 'fp8' and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MHA accumulation dtype. One of ['fp32', 'fp16']. If quantize_mode == 'fp8' and
MHA accumulation dtype. One of ['fp32', 'fp16']. 'fp16' by default. If quantize_mode == 'fp8' and

@i-riyad i-riyad force-pushed the rislam/strongly-typed-default branch from 22dde2f to f2c5ff9 Compare September 9, 2025 08:01
@i-riyad i-riyad force-pushed the rislam/strongly-typed-default branch from f2c5ff9 to 9123d26 Compare September 9, 2025 08:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/_test_utils/onnx_quantization/utils.py (1)

23-29: Fix Cast hop indexing and add None-guards to avoid AttributeError.

After stepping into a Cast, its input index is 0. Reusing inp_idx is wrong and there’s no guard for missing producers. This can raise on models where inputs aren’t quantized or Cast is terminal.

-                qnode = node
-                # After quantization, the quantized node can be casted
-                if qnode.i(inp_idx).op == "Cast":
-                    qnode = qnode.i(inp_idx)
-                assert qnode.i(inp_idx).op == "DequantizeLinear", (
-                    f"Input '{inp.name}' of node '{qnode.name}' is not quantized but should be!"
-                )
+                producer = node.i(inp_idx)
+                # Quantized path may include a Cast right after DQ
+                if producer and producer.op == "Cast":
+                    producer = producer.i(0)
+                assert producer and producer.op == "DequantizeLinear", (
+                    f"Input '{inp.name}' of node '{node.name}' is not quantized but should be!"
+                )
modelopt/onnx/quantization/qdq_utils.py (1)

529-536: Guard against None when skipping a Cast consumer.

quantized_node.op_type is accessed before checking for None, and the second hop after Cast isn’t validated. This can crash on graphs where DQ feeds only a Cast or Cast has no consumer.

-    quantized_node = tensor_consumers.get(dq_node.output[0], [None])[0]
-    if quantized_node.op_type == "Cast":
-        quantized_node = tensor_consumers.get(quantized_node.output[0], [None])[0]
-
-    if not quantized_node:
-        raise ValueError(f"No consumer found for {dq_node.name}")
+    quantized_node = tensor_consumers.get(dq_node.output[0], [None])[0]
+    if not quantized_node:
+        raise ValueError(f"No consumer found for {dq_node.name}")
+    if quantized_node.op_type == "Cast":
+        next_node = tensor_consumers.get(quantized_node.output[0], [None])[0]
+        if not next_node:
+            raise ValueError(f"No consumer found after Cast for {quantized_node.name}")
+        quantized_node = next_node
🧹 Nitpick comments (3)
modelopt/onnx/quantization/qdq_utils.py (1)

998-1000: Prefer explicit protobuf clearing for readability.

Minor: graph.ClearField("node") is clearer than slicing deletion on a protobuf repeated field.

-    del graph.node[:]
-    graph.node.extend(new_nodes)
+    graph.ClearField("node")
+    graph.node.extend(new_nodes)
tests/examples/test_onnx_ptq.sh (2)

176-181: Fix array membership test (ShellCheck SC2199/SC2076).

The regex-like test on arrays is brittle. Use a loop flag for exact membership.

-        if [[ " ${latency_models[@]} " =~ " $model_name " ]]; then
+        should_eval_latency=false
+        for m in "${latency_models[@]}"; do
+            if [[ "$m" == "$model_name" ]]; then
+                should_eval_latency=true
+                break
+            fi
+        done
+        if $should_eval_latency; then
             CUDA_VISIBLE_DEVICES=$gpu_id python evaluate.py \
                 --onnx_path=$eval_model_path \
                 --model_name="${timm_model_name[$model_name]}" \
-                --engine_precision=$precision \
+                --engine_precision=$precision \
                 --results_path=$model_dir/$quant_mode/${model_name}_${quant_mode}.csv &

189-190: Mirror the membership fix in the non-latency branch invocation block if you refactor the conditional.

No functional change here; just ensure consistency after the conditional refactor above.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22dde2f and 9123d26.

📒 Files selected for processing (13)
  • examples/onnx_ptq/README.md (2 hunks)
  • examples/onnx_ptq/evaluate.py (1 hunks)
  • examples/onnx_ptq/evaluation.py (1 hunks)
  • modelopt/onnx/quantization/__main__.py (2 hunks)
  • modelopt/onnx/quantization/int8.py (1 hunks)
  • modelopt/onnx/quantization/qdq_utils.py (2 hunks)
  • modelopt/onnx/quantization/quantize.py (3 hunks)
  • modelopt/torch/_deploy/_runtime/tensorrt/constants.py (0 hunks)
  • modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py (1 hunks)
  • modelopt/torch/_deploy/_runtime/trt_client.py (0 hunks)
  • tests/_test_utils/onnx_quantization/utils.py (1 hunks)
  • tests/examples/test_onnx_ptq.sh (2 hunks)
  • tests/unit/onnx/test_qdq_rules_int8.py (1 hunks)
💤 Files with no reviewable changes (2)
  • modelopt/torch/_deploy/_runtime/trt_client.py
  • modelopt/torch/_deploy/_runtime/tensorrt/constants.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py
  • examples/onnx_ptq/evaluation.py
  • examples/onnx_ptq/README.md
  • tests/unit/onnx/test_qdq_rules_int8.py
  • modelopt/onnx/quantization/quantize.py
  • modelopt/onnx/quantization/int8.py
  • modelopt/onnx/quantization/main.py
🧰 Additional context used
🪛 Shellcheck (0.10.0)
tests/examples/test_onnx_ptq.sh

[error] 176-176: Arrays implicitly concatenate in [[ ]]. Use a loop (or explicit * instead of @).

(SC2199)


[warning] 176-176: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (2)
examples/onnx_ptq/evaluate.py (1)

52-58: CLI switch to --engine_precision looks good.

Choices and default align with the PR goal of stronglyTyped by default.

tests/examples/test_onnx_ptq.sh (1)

164-173: Approve mode-to-precision mapping. Runtime honors “best” for int8_iq and maps other modes to “stronglyTyped”.

@i-riyad i-riyad requested a review from galagam September 9, 2025 08:11
@i-riyad i-riyad force-pushed the rislam/strongly-typed-default branch 2 times, most recently from 822ee09 to 262a623 Compare September 9, 2025 08:29
Signed-off-by: Riyad Islam <[email protected]>
@i-riyad i-riyad force-pushed the rislam/strongly-typed-default branch from 262a623 to 9e7cd86 Compare September 9, 2025 08:29
# Remove transpose and reshape nodes
new_nodes = [node for node in graph.node if node.name not in nodes_to_remove]
graph.node.clear()
del graph.node[:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we use this over graph.node.clear() ?

@i-riyad i-riyad enabled auto-merge (squash) September 9, 2025 22:53
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
modelopt/torch/_deploy/_runtime/tensorrt/tensorrt_utils.py (4)

60-63: Return type mismatch (bytes vs bytearray).

Annotated as bytes but returns bytearray. Simplify and drop the type ignore.

 def get_engine_bytes(engine: trt.tensorrt.ICudaEngine) -> bytes:
     """Return serialized TensorRT engine bytes."""
-    return bytearray(engine.serialize())  # type: ignore[return-value]
+    return bytes(engine.serialize())

65-74: Incorrect function signature: function returns a tuple but is annotated as engine only.

This will confuse type checkers and callers.

-from tensorrt import Logger
+from typing import Optional, Tuple
@@
-def load_engine(buffer: bytes, log_level: int = trt.Logger.ERROR) -> trt.tensorrt.ICudaEngine:
+def load_engine(buffer: bytes, log_level: int = trt.Logger.ERROR) -> Tuple[Optional[trt.tensorrt.ICudaEngine], str]:
@@
-            return runtime.deserialize_cuda_engine(buffer), ""
+            return runtime.deserialize_cuda_engine(buffer), ""
     except Exception as e:
         logging.exception(str(e))
         return None, str(e)

110-131: get_output_shapes returns TRT dims objects, not List[List[int]] as annotated.

Materialize to Python lists for a stable, JSON-safe structure.

-    output_shapes = []
+    output_shapes: list[list[int]] = []
@@
-        if not engine.binding_is_input(binding_index):
-            shape = context.get_binding_shape(binding_index)
-            output_shapes.append(shape)
+        if not engine.binding_is_input(binding_index):
+            dims = context.get_binding_shape(binding_index)
+            output_shapes.append(list(dims))

170-180: Hashing logic hashes the payload twice; fix and clean up docstring typo.

Current code computes SHA256(engine_bytes || engine_bytes). The docstring says hash of engine bytes only.

 def prepend_hash_to_bytes(engine_bytes: bytes) -> bytes:
     """Prepend the engine bytes with the SHA256 hash of the engine bytes
-    This has will serve as a unique identifier for the engine and will be used to manage
+    This hash will serve as a unique identifier for the engine and will be used to manage
     TRTSessions in the TRTClient.
     """
-    hash_object = hashlib.sha256(engine_bytes)
-    hash_object.update(engine_bytes)
-    hash_bytes = hash_object.digest()
+    hash_bytes = hashlib.sha256(engine_bytes).digest()
     engine_bytes = hash_bytes + engine_bytes
     return engine_bytes
🧹 Nitpick comments (5)
CHANGELOG.rst (2)

8-8: Tighten wording; show exact flag and choices.

Use “strong typing/strongly typed,” include the CLI flag, and note the TRT rationale for clarity.

-- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
+- Deprecated ``quantize_mode`` in ``examples/onnx_ptq/evaluate.py`` in favor of strong typing (to align with TensorRT deprecating weak typing). Use ``--engine_precision`` instead (choices: ``best``, ``fp16``, ``stronglyTyped``).

13-13: Fix grammar and surface behavior change.

“defaults to,” code-literal for fp16, and concise note that output weights change.

-- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
+- ``high_precision_dtype`` now defaults to ``fp16`` in ONNX quantization; quantized output model weights are FP16 by default.
+  (This changes prior behavior for users expecting FP32 weights.)
modelopt/torch/_deploy/_runtime/tensorrt/tensorrt_utils.py (3)

134-152: Avoid double-parsing ONNX and guard input alignment.

Parse once, validate tensor count, use integer division.

-    input_names = get_onnx_input_names(onnx.load_from_string(onnx_bytes))
-
-    batch_size = get_batch_size(onnx.load_from_string(onnx_bytes))
+    model = onnx.load_from_string(onnx_bytes)
+    input_names = get_onnx_input_names(model)
+    batch_size = get_batch_size(model)
     if not batch_size or batch_size <= 0:
         batch_size = 1
+    if len(input_tensors) != len(input_names):
+        raise ValueError(f"Expected {len(input_names)} input tensors, got {len(input_tensors)}.")
     # If input tensor batch % batch_size != 0, we don't use all input tensors for calibration.
-    num_batches = int(input_tensors[0].shape[0] / batch_size)
+    num_batches = input_tensors[0].shape[0] // batch_size

154-168: Prefer explicit exception over assert; consider FP8 support when available.

Asserts can be stripped with -O. Also, optionally map FP8 when both TRT and torch support it.

 def convert_trt_dtype_to_torch(trt_dtype: trt.tensorrt.DataType) -> torch.dtype:
@@
-    assert trt_dtype in trt_to_torch_dtype_map, f"Unsupported TensorRT data type: {trt_dtype}"
-    return trt_to_torch_dtype_map[trt_dtype]
+    if hasattr(trt.DataType, "FP8") and hasattr(torch, "float8_e4m3fn"):
+        trt_to_torch_dtype_map[trt.DataType.FP8] = torch.float8_e4m3fn
+    if trt_dtype not in trt_to_torch_dtype_map:
+        raise ValueError(f"Unsupported TensorRT data type: {trt_dtype}")
+    return trt_to_torch_dtype_map[trt_dtype]

182-196: Minor doc and determinism nits in convert_shape_to_string.

Fix example and sort keys for stable output.

     """Convert a shape dictionary to a string.
     For example, if the shape is:
         {
-            "input": [1, 3, 224, 224],
+            "input": [1, 3, 224, 224],
             "output": [1, 1000]
         }.
     The output string will be:
-        input:1x3x244x244,output:1x1000
+        input:1x3x224x224,output:1x1000
     """
     result = ""
-    for key, value in shape.items():
-        result += f"{key}:{'x'.join(map(str, value))},"
+    for key in sorted(shape):
+        value = shape[key]
+        result += f"{key}:{'x'.join(map(str, value))},"
     return result[:-1]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9123d26 and 5d3adfc.

📒 Files selected for processing (4)
  • CHANGELOG.rst (1 hunks)
  • modelopt/onnx/quantization/qdq_utils.py (2 hunks)
  • modelopt/torch/_deploy/_runtime/tensorrt/tensorrt_utils.py (1 hunks)
  • tests/_test_utils/onnx_quantization/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/_test_utils/onnx_quantization/utils.py
  • modelopt/onnx/quantization/qdq_utils.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/_deploy/_runtime/tensorrt/tensorrt_utils.py (1)

27-27: Resolved: no lingering precision helper references — search confirms zero occurrences of validate_precision or INT8_IQ in the codebase; existing TRTMode uses are intentional.

@i-riyad i-riyad merged commit b913290 into main Sep 10, 2025
25 of 27 checks passed
@i-riyad i-riyad deleted the rislam/strongly-typed-default branch September 10, 2025 02:19
jingyu-ml pushed a commit that referenced this pull request Sep 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants