You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix DQ1 output type error in DQ1->DQ2 for FP4 weights in NVFP4 model (#513)
## What does this PR do?
**Type of change:** Bug Fix
**Overview:**
- In post-processing after NVFP4 PTQ and ONNX Export, we convert FP4-QDQ
into DQ1->DQ2 for FP4 weights of the MatMuls. The output of DQ1 is of
the original weight-type (FP16 for FP16 base model) but its scale is in
FP32. There is a cast-to-fp16 after DQ2.
- In above setting, with FP16 base model weights, DQ1 has x_scale in
FP32 but its output type is set to FP16. This hybrid precision mode is
not allowed up to opset-21, and thereby it leads to error when run with
Onnxruntime.
- Note that such hybrid precision mode is allowed in opset-23+ but they
are not fully supported with onnxruntime EPs today, and even in future
we would want to support opset < 23 too.
- So, in this change, setting output of DQ1 to FP32 since its scale is
in FP32. There is already a cast-to-fp16 after DQ2 (before Gemm).
## Testing
- Checked with trtexec binary and onnxruntime-trt-rtx ep - using
sd3.5-medium model, on Windows RTX 5090.
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->
## Additional Information
<!-- E.g. related issue. -->
---------
Signed-off-by: vipandya <[email protected]>
0 commit comments