libm: Fix _status status outputs on null-mantissa inputs#1085
libm: Fix _status status outputs on null-mantissa inputs#1085tgross35 merged 4 commits intorust-lang:mainfrom
_status status outputs on null-mantissa inputs#1085Conversation
c04e284 to
7214141
Compare
| /// `x / y` when `x != 0.0` and `y == 0.0`, | ||
| #[cfg_attr(not(feature = "unstable-public-internals"), allow(dead_code))] | ||
| pub const DIVIDE_BY_ZERO: Self = Self(1 << 2); | ||
| pub const DIVIDE_BY_ZERO: Self = Self(1 << 1); |
There was a problem hiding this comment.
I'm guessing the reason why this skipped a bit is that they used the same bit positions as the x86 status registers: x86-64 MXCSR Register as well as the separate x87 status word. Those use 0x2 is for the non-standard Denormal Operand Exception.
The patched contiguous labeling matches Arm, but RISC-V has yet another layout (same low 5 bits but in reverse order).
There was a problem hiding this comment.
Great catch. Is there any reason to be consistent with these? I can't think of much purpose to that unless we use host floats with the environment via asm and load the register back to a Status.
There was a problem hiding this comment.
Yeah, and doing that would require one of
- mapping to and from the architecture specific layout at that interface
- making these constants architecture specific
But we don't use it like that, so there's no need. Now that the Debug impl prints only the names, and since the UNKNOWN bits should not be settable, the actual bit-values are private to this module.
There was a problem hiding this comment.
libm/src/math/generic/trunc.rs
Outdated
| // If the to-be-masked-out portion is already zero, we have an exact result | ||
| if (xi & !mask) == IntTy::<F>::ZERO { | ||
| return FpResult::ok(x); | ||
| } | ||
|
|
||
| // C5: Otherwise the result is inexact and we will truncate. Raise `FE_INEXACT`, mask the | ||
| // Otherwise the result is inexact and we will truncate. Raise `FE_INEXACT`, mask the | ||
| // result, and return. | ||
|
|
||
| let status = if xi & F::SIG_MASK == F::Int::ZERO { | ||
| Status::OK | ||
| } else { | ||
| Status::INEXACT | ||
| }; | ||
| xi &= mask; | ||
| FpResult::new(F::from_bits(xi), status) | ||
| FpResult::new(F::from_bits(xi), Status::INEXACT) |
There was a problem hiding this comment.
If xi & !mask == 0, then xi & mask == xi, so this can just return F::from_bits(xi & mask) and avoid the branch.
Also, it might be slightly better to use the negated mask ("bits to clear") from the start.
let cleared = xi & clear_mask;
let status = if cleared == 0 { Status::OK } else { Status::INEXACT };
FpResult::new(F::from_bits(xi ^ cleared), status)There was a problem hiding this comment.
Thanks, this came with some nice small speedups
[660](https://github.com/rust-lang/compiler-builtins/actions/runs/21939477624/job/63361314470?pr=1085#step:7:2661)
icount::icount_bench_trunc_group::icount_bench_trunc logspace:setup_trunc()
Baselines: softfloat|softfloat (old)
Instructions: 7057|7579 (-6.88745%) [-1.07397x]
L1 Hits: 7674|8197 (-6.38038%) [-1.06815x]
LL Hits: 2|1 (+100.000%) [+2.00000x]
RAM Hits: 3|3 (No change)
Total read+write: 7679|8201 (-6.36508%) [-1.06798x]
Estimated Cycles: 7789|8307 (-6.23570%) [-1.06650x]
icount::icount_bench_truncf128_group::icount_bench_truncf128 logspace:setup_truncf128()
Baselines: softfloat|softfloat (old)
Instructions: 12263|13271 (-7.59551%) [-1.08220x]
L1 Hits: 14883|15889 (-6.33142%) [-1.06759x]
LL Hits: 3|5 (-40.0000%) [-1.66667x]
RAM Hits: 6|6 (No change)
Total read+write: 14892|15900 (-6.33962%) [-1.06769x]
Estimated Cycles: 15108|16124 (-6.30117%) [-1.06725x]
icount::icount_bench_truncf16_group::icount_bench_truncf16 logspace:setup_truncf16()
Baselines: softfloat|softfloat (old)
Instructions: 7952|8726 (-8.87004%) [-1.09733x]
L1 Hits: 8520|9294 (-8.32795%) [-1.09085x]
LL Hits: 1|1 (No change)
RAM Hits: 5|5 (No change)
Total read+write: 8526|9300 (-8.32258%) [-1.09078x]
Estimated Cycles: 8700|9474 (-8.16973%) [-1.08897x]
icount::icount_bench_truncf_group::icount_bench_truncf logspace:setup_truncf()
Baselines: softfloat|softfloat (old)
Instructions: 7281|7865 (-7.42530%) [-1.08021x]
L1 Hits: 7896|8479 (-6.87581%) [-1.07383x]
LL Hits: 2|3 (-33.3333%) [-1.50000x]
RAM Hits: 5|5 (No change)
Total read+write: 7903|8487 (-6.88111%) [-1.07390x]
Estimated Cycles: 8081|8669 (-6.78279%) [-1.07276x]
| status = Status::INEXACT; | ||
| // |x| < 1.0, zero or inexact with truncation | ||
|
|
||
| if (ix & !F::SIGN_MASK) == F::Int::ZERO { |
There was a problem hiding this comment.
The performance regression didn't make sense to me, so I looked at the assembly and then LLVM-IR, and it seems that LLVM is being "smart" by turning this into a floating point comparison x == 0.0_f16, and that generates the equivalent of x as f32 == 0.0_f32 with a call to __extendhfsf2.
There was a problem hiding this comment.
Thanks for digging into all of this. I wrote up an issue here llvm/llvm-project#180630, wonder what else might be hurting from it.
There was a problem hiding this comment.
On x86-64, these contain one or more calls to __extendhfsf2 (f16 as f32):
fmaximumf16
fminimumf16
fmaximum_numf16
fminimum_numf16
fmaxf16
fminf16
fdimf16
sqrtf16
rintf16
roundf16
fmodf16
ldexpf16
The min/max -functions are doing floating point comparisons, but might benefit from inspecting the bits instead. The others do float arithmetic and also contains calls to __truncsfhf2 (f32 as f16). So none of them are explicit bit-operations being turned into float ops.
Most other libcalls are in various f128-functions.
There was a problem hiding this comment.
Given the benchmarks primarily affect f16 and f128 and that we can't really easily work around LLVM, I'm just going to accept the regressions here.
57a0c3b to
0505d7a
Compare
This comment has been minimized.
This comment has been minimized.
0505d7a to
332416d
Compare
This comment has been minimized.
This comment has been minimized.
2a4aec6 to
eeaf909
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
Values like 0.5 that have an exponent but not a mantissa are currently being reported as OK rather than INEXACT for ceil, floor, and trunc. Fix this and clean up the test cases. This tries to keep the algorithm diff as simple as possible, which introduces some small performance regressions. This is improved in a future commit.
ceil seems to optimize better, but this gets closer to the pre-fix codegen.
Suggested-by: Juho Kahala <57393910+quaternic@users.noreply.github.com>
eeaf909 to
d7e350f
Compare
Values like 0.5 that have an exponent but not a mantissa are currently being reported as OK rather than INEXACT for ceil, floor, and trunc. Fix this and clean up the test cases.
ci: allow-regressions