-
Notifications
You must be signed in to change notification settings - Fork 169
Enable direct INT8/FP8 input in ONNX graph #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable direct INT8/FP8 input in ONNX graph #354
Conversation
Signed-off-by: gcunhase <[email protected]>
WalkthroughIntroduces a new utility to remove QuantizeLinear nodes at graph inputs, integrates it into the quantization flow when direct_io_types is enabled, and updates CLI help text for quantize_mode and direct_io_types. No public signatures changed; ONNX model IR is set to 10 when applying the new cleanup. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant QZ as quantize.py
participant QDU as qdq_utils
participant M as ONNX Model
U->>QZ: run quantize(...)
QZ->>QDU: remove_input_dq_and_output_q(M)
QDU-->>QZ: M (updated)
alt direct_io_types == True
QZ->>QDU: remove_graph_input_q(M)
QDU-->>QZ: M (inputs rewired, IR=10)
else direct_io_types == False
Note over QZ: Skip input Q removal
end
QZ->>QZ: topological sort / export
QZ-->>U: quantized ONNX model
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
modelopt/onnx/quantization/__main__.py (1)
245-251
: Clarify scope and behavior of --direct_io_types.Consider making the help explicit that this currently:
- Applies only for INT8/FP8 quantization paths (inputs), not INT4; and
- Adjusts inputs only (outputs remain unchanged).
This avoids user confusion.
Apply this minimal tweak:
- "If True, the I/O types in the quantized ONNX model will be modified to be lower precision whenever " - "possible. Else, they will match the I/O types in the given ONNX model. " - "The currently supported precisions are {fp16, int8, fp8}." + "If True, attempts to use lower‑precision graph I/O where possible. Currently affects inputs only for " + "INT8/FP8 quantization; outputs are unchanged. Otherwise, I/O types match the input ONNX model. " + "Supported input precisions: {fp16, int8, fp8}."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
modelopt/onnx/quantization/__main__.py
(2 hunks)modelopt/onnx/quantization/qdq_utils.py
(1 hunks)modelopt/onnx/quantization/quantize.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/onnx/quantization/quantize.py (1)
modelopt/onnx/quantization/qdq_utils.py (2)
remove_graph_input_q
(874-943)remove_input_dq_and_output_q
(738-871)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (3)
modelopt/onnx/quantization/__main__.py (1)
39-40
: Help text reflow looks good.No behavioral change; wording is clear.
modelopt/onnx/quantization/quantize.py (2)
63-67
: Import wiring looks correct.New utility is imported alongside existing Q/DQ helpers; consistent with module boundaries.
505-507
: Guard against models without explicit zero-points at input DQ.
remove_graph_input_q
assumes a 3‑input DQ (scale + zero_point) and that the zero_point is an initializer. Some toolchains emit 2‑input DQ (no zero_point) or feed zero_point via Constant. That would raise at runtime.Action: Harden
remove_graph_input_q
(preferred; see suggested patch in qdq_utils.py comment) or add a precheck here before calling it.Would you confirm that our INT8/FP8 activation quant flows always materialize zero_point as an initializer for input DQ? If not, I’ll push a defensive change.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #354 +/- ##
==========================================
- Coverage 73.84% 73.72% -0.12%
==========================================
Files 172 172
Lines 17453 17484 +31
==========================================
+ Hits 12888 12890 +2
- Misses 4565 4594 +29 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: gcunhase <[email protected]>
Signed-off-by: gcunhase <[email protected]> Signed-off-by: Ye Yu <[email protected]>
What does this PR do?
Type of change: Expand existing feature
Overview: Allows users to generate a quantized ONNX model with direct quantized input via the existing
--direct_io_types
flag. This flag currently only support FP16 as the direct IO type, but this PR extends that to allow INT8 / FP8 precision as the graph input. For example, if the 1st layer if a model is being quantized and the user enables--direct_io_types
, the graph input would directly connect to the 1st DQ node, thus setting the graph input to the network'squantize_mode
.Usage
Testing
Before your PR is "Ready for review"
Summary by CodeRabbit
New Features
Documentation