You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Description
This PR allows a user to set the IO dtype (i.e. the input/output dtype)
for an INT4 CUDA ONNX model to be bfloat16 precision instead of float16
precision. This can be used by using `-p/--precision int4`,
`-e/--execution_provider cuda`, and then setting `--extra_options
use_bf16_cuda=true/True/1`.
### Motivation and Context
Models lose accuracy when converting weights from their native bfloat16
precision to float16 precision. With the [recent
support](microsoft/onnxruntime#25161) of
bfloat16 precision in `MatMulNBits`, the conversion is not always needed
any more.
0 commit comments