Is NaN Propagation Necessary in ONNX Runtime? (Architectural Differences e.g. RISC-V vs x86/ARM) #24589
Unanswered
qiujiandong
asked this question in
General
Replies: 1 comment
-
I think it would depend on model. Most models are not trained on RISC-V. When we run the models on RISC-V, we should get a good enough accuracy practically. MLperf has imagenet benchmarks. If we ran the same model on RISC-V (with any backend) with the full imagenet validation dataset as the input, as long as the accuracy on the images still makes sense, the other things are not important. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I've been exploring the behavior of floating-point NaN propagation in ONNX Runtime and noticed that the specification and implementation might implicitly assume that NaNs propagate across operations (i.e., a NaN input results in a NaN output that retains its bit pattern).
However, this behavior is not consistent across architectures.
For example, in RISC-V, the floating-point units (as defined by the F and D extensions) do not propagate input NaN payloads. Instead, any operation involving a NaN input always produces a fixed, canonical NaN output (e.g., 0x7fc00000 for float32). This aligns strictly with the IEEE 754 spec but eliminates any form of NaN payload preservation.
By contrast, on some other architectures (like x86 or ARM), the hardware may propagate NaN payloads, and in some cases, even preserve the first encountered quiet NaN’s payload. This could cause subtle differences in outputs, particularly in deep learning pipelines that may not expect bitwise equality but still rely on NaN tracking behavior (e.g., for debugging, tracing invalid values, etc.).
For example, run mlas Activation test on RISCV-V:
❓ Questions for discussion:
Should ONNX Runtime enforce or define consistent NaN propagation behavior across platforms?
For example: always propagate first-encountered NaN payload? always canonicalize?
Is it acceptable for NaN propagation behavior to vary by backend or architecture?
Should ONNX model authors assume anything about NaN bit patterns being preserved?
Would it make sense to explicitly document this in ONNX Runtime’s operator behavior or backend compliance expectations?
⚙️ Motivation:
Understanding this could help backend developers (especially on RISC-V and other minimal/flexible hardware targets) know whether they must implement software-emulated NaN propagation for compliance, or if canonical NaN is acceptable.
Looking forward to your thoughts!
Beta Was this translation helpful? Give feedback.
All reactions