Improve quality of f32/f64 generation.#103
Conversation
The previous int-to-float conversion had a bias of probability 2^-24 / 2^-53 for types f32 / f64 respectively. The new conversion has a bias of 2^-64, which is the same bias as the underlying WyRand generator. The new conversion is a slightly shorter instruction sequence on x86 and ARM, but executes as 1 more uop on x86. Seems unlikely to harm performance much, if at all. https://rust.godbolt.org/z/q3zMxEc3T
This is unavailable on thumbv7m-none-eabi.
They both end up compiling to the same after inlining and simplification, but the former requires more work from the optimizer to get there.
| } | ||
|
|
||
| #[test] | ||
| fn digit() { |
There was a problem hiding this comment.
Sorry, I accidentally removed several unrelated tests when resolving conflicts. Filed #110 to re-add them.
| loop { | ||
| let x = self.f32_inclusive(); | ||
| if x < 1.0 { | ||
| return x; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Was the perf of this measured? Seems like adding a branch and a loop to f32 might have performance implications I'm not comfortable with releasing.
There was a problem hiding this comment.
I didn't measure at the time since there was no existing benchmark; I only looked at assembly.
I've now added a benchmark in #112 and measured the results on AArch64 (M2 Pro) and x86-64 (Zen4). I get:
AArch64:
0.449ns/iter: old f32(). (2, 3, or 4 iters unrolled)
0.344ns/iter: new f32_inclusive(). (4 iters unrolled)
0.587ns/iter: new f32(). (2 iters unrolled)
x86-64:
0.856ns/iter: old f32(). (3 iterations)
0.649ns/iter: new f32_inclusive().
0.772ns/iter: new f32(). (2 iterations unrolled)
So the change is always a win if you switch to f32_inclusive(), and it's a win on x86-64 (but a loss on AArch64) if you stay with f32().
Note that the branch is almost-always taken (probability 1 - 2^-23) so the branch predictor will predict this ~perfectly.
There was a problem hiding this comment.
(These numbers are the benchmark numbers, divided by 10_000, since the benchmark itself generates 10_000 f32s.)
The previous int-to-float conversion had a bias of probability 2^-24 / 2^-53 for types f32 / f64 respectively. The new conversion has a bias of 2^-64, which is the same bias as the underlying WyRand generator. See comments in
lib.rsfor explanation.The new conversion is a slightly shorter instruction sequence on x86 and ARM, but executes as 1 more uop on x86. Seems unlikely to harm performance much, if at all. https://rust.godbolt.org/z/q3zMxEc3T.
The added tests in
smoke.rsfail with the old conversion and succeed with the new conversion.