-
Notifications
You must be signed in to change notification settings - Fork 170
[5532019][AutoCast] Update GraphSanitizer to convert Double to Float32 #364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: ajrasane <[email protected]>
WalkthroughAdds an FP64-to-FP32 normalization pass to the ONNX GraphSanitizer and invokes it during autocast conversion. Updates unit tests to cover the new conversions across initializers, IO types, and node kinds. Documentation for ONNX PTQ examples adds environment variable exports for CUDA/cuDNN in TensorRT Docker usage. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Convert as convert_to_f16
participant Sanitizer as GraphSanitizer
participant ONNX as ONNX Model
User->>Convert: Invoke convert_to_f16(model)
Convert->>Sanitizer: sanitize(model)
rect rgba(200,240,255,0.3)
note over Sanitizer: Existing sanitize steps (custom nodes, opset, LN patterns, names, cleanup)
Sanitizer->>Sanitizer: convert_fp64_to_fp32() [NEW]
note right of Sanitizer: Convert initializers, IO types,<br/>Cast/Constant/ConstantOfShape to FP32
Sanitizer->>ONNX: Infer shapes after conversion
end
Sanitizer-->>Convert: Sanitized model
Convert->>ONNX: Proceed with precision conversion to FP16
ONNX-->>User: Returns converted model
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
modelopt/onnx/autocast/precisionconverter.py (3)
184-189
: Keep I/O types but still clear output shape dims to let inference refresh shapesWhen keep_io_types is True, you skip resetting output elem_type (good), but you also skip clearing output dims. That can leave stale shapes on outputs and prevent shape inference from updating them. Recommend: only guard elem_type, not the dim reset.
Apply this diff:
- if not self.keep_io_types: - for out in self.model.graph.output: - out.type.tensor_type.elem_type = onnx.TensorProto.UNDEFINED - for idx, d in enumerate(out.type.tensor_type.shape.dim): - if d.dim_value: - out.type.tensor_type.shape.dim[idx].dim_param = "unk" + if not self.keep_io_types: + for out in self.model.graph.output: + out.type.tensor_type.elem_type = onnx.TensorProto.UNDEFINED + # Always clear output dims so shape inference can refresh them + for out in self.model.graph.output: + for idx, d in enumerate(out.type.tensor_type.shape.dim): + if d.dim_value: + out.type.tensor_type.shape.dim[idx].dim_param = "unk"
217-242
: Deduplicate and broaden restoration (cover value_info entries too)Current logic duplicates input/output loops and only updates value_info_map, not graph.value_info entries that might mirror I/O. DRY the code and also update any matching entries in model.graph.value_info for completeness.
Apply this diff:
- def _restore_original_io_types(self): - """Restore original I/O types.""" - # Restore input types - for input_tensor in self.model.graph.input: - if input_tensor.name in self.original_network_io: - original_type = self.original_network_io[input_tensor.name] - if input_tensor.type.tensor_type.elem_type != original_type: - input_tensor.type.tensor_type.elem_type = original_type - # Update value_info_map if tensor exists there - if input_tensor.name in self.value_info_map: - self.value_info_map[ - input_tensor.name - ].type.tensor_type.elem_type = original_type - - # Restore output types - for output_tensor in self.model.graph.output: - if output_tensor.name in self.original_network_io: - original_type = self.original_network_io[output_tensor.name] - if output_tensor.type.tensor_type.elem_type != original_type: - output_tensor.type.tensor_type.elem_type = original_type - # Update value_info_map if tensor exists there - if output_tensor.name in self.value_info_map: - self.value_info_map[ - output_tensor.name - ].type.tensor_type.elem_type = original_type + def _restore_original_io_types(self): + """Restore original I/O types.""" + for vi in list(self.model.graph.input) + list(self.model.graph.output): + name = vi.name + if name not in self.original_network_io: + continue + original_type = self.original_network_io[name] + if vi.type.tensor_type.elem_type != original_type: + vi.type.tensor_type.elem_type = original_type + # Sync value_info_map entry if present + if name in self.value_info_map: + self.value_info_map[name].type.tensor_type.elem_type = original_type + # Also sync a matching entry in graph.value_info if it exists + for extra_vi in self.model.graph.value_info: + if extra_vi.name == name: + extra_vi.type.tensor_type.elem_type = original_type + break
455-456
: Skip generating no‑op casts at model I/O boundariesEven though redundant casts are cleaned up later, you can avoid inserting them by checking the current tensor type. This reduces graph churn and log noise.
Apply this diff:
- # Add cast down for network outputs (only if not keeping I/O types) - if not self.keep_io_types: - cast_to_fp16.extend([output for output in node.output if output in network_outputs]) + # Add cast down for network outputs (only if not keeping I/O types), and only if needed + if not self.keep_io_types: + for output in node.output: + if output in network_outputs and self._get_tensor_type(output) != self.low_precision_type.onnx_type: + cast_to_fp16.append(output)Optionally do the same for inputs:
- cast_to_fp32.extend([input for input in node.input if input in network_inputs]) + for inp in node.input: + if inp in network_inputs and self._get_tensor_type(inp) != self.high_precision_type.onnx_type: + cast_to_fp32.append(inp)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/onnx/autocast/precisionconverter.py
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/onnx/autocast/precisionconverter.py (1)
204-206
: LGTM: Restore original I/O types after cleanupRestoring I/O types right after cleanup and before the sanity check is the correct spot.
Please confirm we don’t need an extra infer_shapes pass after restoring I/O elem_type for models where outputs were previously UNDEFINED. If checker ever complains on some models, running a final
infer_shapes(..., check_type=True)
here would be a safe follow‑up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
modelopt/onnx/autocast/precisionconverter.py (2)
216-241
: Also sync graph.value_info to avoid map/graph divergenceIf value_info_map doesn’t alias graph.value_info entries, types can drift. Update graph.value_info alongside the map for robustness.
Apply this diff inside the existing blocks where you update value_info_map:
@@ - if input_tensor.name in self.value_info_map: + if input_tensor.name in self.value_info_map: self.value_info_map[ input_tensor.name ].type.tensor_type.elem_type = original_type + # Also update corresponding entry in graph.value_info if present + for vi in self.model.graph.value_info: + if vi.name == input_tensor.name: + vi.type.tensor_type.elem_type = original_type + break @@ - if output_tensor.name in self.value_info_map: + if output_tensor.name in self.value_info_map: self.value_info_map[ output_tensor.name ].type.tensor_type.elem_type = original_type + # Also update corresponding entry in graph.value_info if present + for vi in self.model.graph.value_info: + if vi.name == output_tensor.name: + vi.type.tensor_type.elem_type = original_type + break
451-455
: Honor the “only if not keeping I/O types” comment with a guardYou’re still collecting output tensors for cast-down when keep_io_types=True; cleanup will remove them, but we can avoid churn.
- # Add cast down for network outputs (only if not keeping I/O types) - cast_to_fp16.extend([output for output in node.output if output in network_outputs]) + # Add cast down for network outputs (only if not keeping I/O types) + if not self.keep_io_types: + cast_to_fp16.extend([output for output in node.output if output in network_outputs])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/onnx/autocast/precisionconverter.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (1)
modelopt/onnx/autocast/precisionconverter.py (1)
203-205
: LGTM: restoring original I/O types post-cleanup is well-placedCalling restoration after cleanup and before sanity-check keeps graph names stable and fixes I/O dtypes as intended.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #364 +/- ##
==========================================
+ Coverage 73.48% 73.53% +0.05%
==========================================
Files 172 172
Lines 17636 17700 +64
==========================================
+ Hits 12960 13016 +56
- Misses 4676 4684 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: ajrasane <[email protected]>
7ccc35e
to
1e4d697
Compare
Signed-off-by: ajrasane <[email protected]>
if tensor.type.tensor_type.elem_type != original_type: | ||
tensor.type.tensor_type.elem_type = original_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're just updating the tensor's metadata here. Shouldn't we also inject a cast here?
Looking at the bigger picture, keep_io_types
flag was covered in the unit tests, and is also set as default in MOQ, so generally we know it is working, but there's some edge case we're first seeing in https://nvbugspro.nvidia.com/bug/5532019.
Inputs are clearly handled in lines 145-148, and outputs are presumably handled in _cleanup_pre_output_same_type_cast
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this function and created conversion logic in the GraphSanitizer
.
Signed-off-by: ajrasane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/onnx/autocast/convert.py (1)
100-103
: Don’t mutate default providers list.providers has a list default and is mutated via insert, causing sticky state across calls.
Apply this diff:
-def convert_to_mixed_precision( +def convert_to_mixed_precision( @@ - providers: list[str] = ["cpu"], + providers: list[str] | None = None, @@ -) -> onnx.ModelProto: +) -> onnx.ModelProto: @@ - # Load and process model + # Load and process model model = onnx.load(onnx_path, load_external_data=True) assert low_precision_type in ["fp16", "bf16"], "low_precision_type must be either fp16 or bf16" + providers = (providers or ["cpu"]).copy() @@ - if "trt" not in providers and graph_sanitizer.custom_ops: - providers.insert(0, "trt") + if "trt" not in providers and graph_sanitizer.custom_ops: + providers = ["trt"] + providersmodelopt/onnx/autocast/graphsanitizer.py (1)
320-329
: Bug: scale_dimension may be referenced before assignment.If the pattern matches “bias only, no scale,” scale remains None and scale_dimension is never set before the None check, causing an UnboundLocalError.
Minimal fix:
- if scale is not None: - scale_dimension = scale.shape - - # Skip pattern if we can't determine the scale dimension - if scale_dimension is None: + scale_dimension = None + if scale is not None: + scale_dimension = scale.shape + # Skip pattern if we can't determine the scale dimension + if scale_dimension is None: logger.debug( f"Could not determine scale dimension for LayerNorm pattern at {mean_node.name}" ) return NoneOptional improvement: derive scale_dimension from the input shape and axis when scale is absent.
🧹 Nitpick comments (8)
modelopt/onnx/autocast/convert.py (2)
151-157
: Avoid mutable default args (op_block_list, trt_plugins).Using [] as a default can leak state across calls. Prefer None and initialize inside.
Apply this diff:
-def convert_to_f16( - model: onnx.ModelProto, - low_precision_type: str = "fp16", - keep_io_types: bool = True, - op_block_list: list[str] = [], - trt_plugins: list[str] | None = [], -) -> onnx.ModelProto: +def convert_to_f16( + model: onnx.ModelProto, + low_precision_type: str = "fp16", + keep_io_types: bool = True, + op_block_list: list[str] | None = None, + trt_plugins: list[str] | None = None, +) -> onnx.ModelProto: @@ - sanitizer = GraphSanitizer( + op_block_list = op_block_list or [] + trt_plugins = trt_plugins or [] + sanitizer = GraphSanitizer(
185-187
: Avoid redundant shape inference.convert_fp64_to_fp32 already calls infer_shapes(strict_mode=True) when it modifies the graph. Consider gating the second inference to reduce overhead (e.g., have convert_fp64_to_fp32 return a modified flag).
modelopt/onnx/autocast/graphsanitizer.py (4)
70-89
: Consider returning a flag to signal modifications.Returning modified (bool) can let callers skip redundant work (like extra infer_shapes).
Apply this diff:
- def convert_fp64_to_fp32(self) -> None: + def convert_fp64_to_fp32(self) -> bool: @@ - if modified: + if modified: logger.info("Converted FP64 initializers, I/O types, and nodes to FP32") self.model = onnx_utils.infer_shapes(self.model, strict_mode=True) + return modifiedAnd in sanitize():
- self.convert_fp64_to_fp32() + _ = self.convert_fp64_to_fp32()
454-472
: Guard against non-tensor value_info entries.Accessing tensor.type.tensor_type assumes tensor_type is set; sequence/map types would error. Add a defensive check.
Apply this diff:
- def convert_tensor_list(tensors, tensor_type): + def convert_tensor_list(tensors, tensor_type): nonlocal modified for tensor in tensors: - if tensor.type.tensor_type.elem_type == onnx.TensorProto.DOUBLE: + # Only process TensorProto entries + tt = getattr(tensor.type, "tensor_type", None) + if tt is not None and tt.elem_type == onnx.TensorProto.DOUBLE: - tensor.type.tensor_type.elem_type = onnx.TensorProto.FLOAT + tt.elem_type = onnx.TensorProto.FLOAT modified = True logger.debug(f"Converted {tensor_type} {tensor.name} from FP64 to FP32")
485-506
: Broaden FP64 normalization coverage (optional).Some ops encode dtypes via attributes (e.g., RandomNormal, RandomUniform, EyeLike, CastLike) or use value_float/double forms. Consider handling those for completeness.
31-38
: Avoid mutable default arg in init (trt_plugins).Use None and normalize inside to prevent shared state across instances.
Apply this diff:
- def __init__( + def __init__( self, model: onnx.ModelProto, min_opset: int = 13, max_ir_version: int | None = None, - trt_plugins: list[str] | None = [], + trt_plugins: list[str] | None = None, ) -> None: @@ - self.trt_plugins = trt_plugins + self.trt_plugins = trt_plugins or []tests/unit/onnx/autocast/test_graphsanitizer.py (2)
328-373
: Add tests for keep_io_types on outputs (PR objective).Please add coverage ensuring outputs are preserved when keep_io_types=True (and cast when False).
Here’s a minimal addition to append at the end of this file:
+def _simple_fp32_model_with_fp16able_body(): + x = helper.make_tensor_value_info("X", TensorProto.FLOAT, [1, 2]) + y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, 2]) + w = numpy_helper.from_array(np.ones((1, 2), dtype=np.float32), name="W") + add = helper.make_node("Add", ["X", "W"], ["Y"]) + graph = helper.make_graph([add], "mp_test", [x], [y], initializer=[w]) + model = helper.make_model(graph) + return model + +def test_keep_io_types_preserves_outputs_fp32_in_convert_to_f16(): + from modelopt.onnx.autocast.convert import convert_to_f16 + model = _simple_fp32_model_with_fp16able_body() + mod = convert_to_f16(model, low_precision_type="fp16", keep_io_types=True) + # Output should remain FLOAT + assert mod.graph.output[0].type.tensor_type.elem_type == TensorProto.FLOAT + +def test_keep_io_types_false_allows_output_downcast_in_convert_to_f16(): + from modelopt.onnx.autocast.convert import convert_to_f16 + model = _simple_fp32_model_with_fp16able_body() + mod = convert_to_f16(model, low_precision_type="fp16", keep_io_types=False) + # Output is allowed to be cast down; exact check depends on implementation. + # If outputs are explicitly set, expect FLOAT16; otherwise cast nodes at boundary. + # At minimum, ensure graph contains a Cast to FLOAT16 feeding Y or output dtype is FLOAT16. + output_dtype = mod.graph.output[0].type.tensor_type.elem_type + has_cast_to_fp16 = any( + n.op_type == "Cast" and any(a.name == "to" and a.i == TensorProto.FLOAT16 for a in n.attribute) + for n in mod.graph.node + ) + assert output_dtype == TensorProto.FLOAT16 or has_cast_to_fp16
91-111
: Cover “bias-only” LayerNorm case.Add a test with add_scale=False, add_bias=True to catch the scale_dimension bug and validate the fix.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/onnx_ptq/README.md
(1 hunks)modelopt/onnx/autocast/convert.py
(1 hunks)modelopt/onnx/autocast/graphsanitizer.py
(2 hunks)tests/unit/onnx/autocast/test_graphsanitizer.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- examples/onnx_ptq/README.md
🧰 Additional context used
🧬 Code graph analysis (2)
modelopt/onnx/autocast/convert.py (1)
modelopt/onnx/autocast/graphsanitizer.py (1)
convert_fp64_to_fp32
(70-88)
tests/unit/onnx/autocast/test_graphsanitizer.py (1)
modelopt/onnx/autocast/graphsanitizer.py (4)
_convert_fp64_initializers
(429-451)_convert_fp64_io_types
(453-473)_convert_fp64_nodes
(475-506)convert_fp64_to_fp32
(70-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (4)
modelopt/onnx/autocast/convert.py (2)
172-184
: Good call: normalize FP64 early in convert_to_f16.Adding sanitizer.convert_fp64_to_fp32() here aligns convert_to_f16 with sanitize() flow and prevents downstream dtype surprises.
117-126
: keep_io_types semantics validated end-to-end
PrecisionConverter.convert skips I/O casting when keep_io_types=True and restores original output dtypes when False; unit tests in test_precisionconverter.py (I/O dtype assertions, lines 93–98 & 151–156) and test_autocast.py (lines 144–147) already cover both inputs and outputs.modelopt/onnx/autocast/graphsanitizer.py (1)
68-69
: LGTM: integrate FP64→FP32 into sanitize().This centralizes normalization and keeps downstream passes consistent.
tests/unit/onnx/autocast/test_graphsanitizer.py (1)
188-410
: Solid coverage for FP64→FP32 conversions.Tests exercise initializers, IO, nodes, and integration paths well.
Set the following environment variables inside the TensorRT docker. | ||
|
||
```bash | ||
export CUDNN_LIB_DIR=/usr/lib/x86_64-linux-gnu/ | ||
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${LD_LIBRARY_PATH}" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we do in the CICD: https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/_installation_for_Linux.html#environment-setup
If that is sufficient, we can also just link this doc here. Else we can update that file too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The steps mentioned above are a bit different for the TRT container.
There is no /usr/local/tensorrt/
in the TRT container
trtexec
is stored under /opt/tensorrt/bin
, which is already in PATH
And adding /usr/lib/x86_64-linux-gnu
is sufficient for getting libcudnn.so, which is required for the TRTExecutionProvider
.
I would suggest we keep this as it is as it will be more convenient for the user to run these steps as part of the setup right after they runs their container.
The steps you shared above should also work fine for the CI, but they are not required for the TRT container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tested with 2 CLIs:
$ python -m modelopt.onnx.quantization --onnx_path=far3d_opset17.onnx --calibration_eps trt cuda:0 cpu
$ python -m modelopt.onnx.quantization --onnx_path=far3d_opset17.onnx --calibration_eps trt cuda:0 cpu --direct_io_types
#364) Signed-off-by: ajrasane <[email protected]>
#364) Signed-off-by: ajrasane <[email protected]> Signed-off-by: Ye Yu <[email protected]>
What does this PR do?
Type of change:
New Feature
Overview:
convert_fp64_to_fp32()
in graph sanitizer.GraphSanitizer.sanitize()
orconvert_to_f16()
.Testing
Summary by CodeRabbit
Documentation
New Features
Tests