Optimize Integer <-> ByteString conversions#7439
Conversation
zliu41
left a comment
There was a problem hiding this comment.
This is quite low level stuff, but I've gone through the logic and it all checks out.
| LittleEndian -> pure () | ||
| BigEndian -> reverseBuffer ptr desiredLength | ||
| goSmallLE :: Ptr Word8 -> Int -> Int# -> IO () | ||
| goSmallLE ptr offset remaining# |
There was a problem hiding this comment.
Add a bang to offset? Ditto for all other Int arguments.
There was a problem hiding this comment.
Good catch, no idea how I missed that.
| copyByteArrayToAddr ptr (ByteArray ba#) 0 minLength | ||
| case requestedByteOrder of | ||
| LittleEndian -> pure () | ||
| BigEndian -> reverseBuffer ptr desiredLength |
There was a problem hiding this comment.
Any idea how much overhead reverseBuffer has?
There was a problem hiding this comment.
The short answer: not very much.
The slightly longer answer: In the 'big endian' case, we have to do a reverse copy. We can do this in one of two ways:
- Manually perform a reverse copy by reading bytes from the source, then writing them to the destination with flipped indices; or
- Copy normally, then reverse in-place (which is what I chose to do).
A reverse copy is always going to have overheads compared to copyByteArrayToAddr, for several reasons. Some of these are inherent: the way caches work on current-day CPUs means that, in general, traversing an array in ascending order of indices will be much faster than in descending order, for example. However, in our case, the main reason is that copyByteArrayToAddr is implemented using memcpy, which the runtime calls without an FFI penalty. Among other things (also for reasons of caching), for arrays that aren't significantly larger than a memory page (8KiB for all the platforms we care about), copyByteArrayToAddr might as well be a constant-time operation. Nothing I could implement would even come close to that.
Reversing in-place, while still carrying a penalty, is less bad, for two reasons:
- The number of loop iterations we must perform to reverse an array in-place is only half of what would be required to do a reverse copy. In line with what I mentioned above with regard to
copyByteArrayToAddr's performance, we're actually winning as a result. - The implementation, especially when accelerated with loop sectioning, for an in-place reversal is a lot less confusing than for a reverse copy: reverse copies require 'reading behind yourself', which is written weirdly, and if you want to loop section, you have to worry about byte order as well.
Unsurprisingly, I found reversing in-place was slightly faster, or at least no worse at the upper end of what I benchmarked with. When compared with the 'little endian' case, it's about a factor of 3 overhead at worst.
|
I don't understand all of the details yet, but I'm going to merge this to get it out of the way and update the costs. I'll come back with a pencil and paper and check the logic of the conversions after that. |
kwxm
left a comment
There was a problem hiding this comment.
I'm going to come back and work through the details, but let's merge it.
|
@kwxm - if you need any assistance understanding what's going on here, I'm happy to explain. |
This optimizes the conversions between
IntegerandByteString, as we can now rely onIntegerinternals, which lets us use direct copying instead of digit-by-digit extraction and reconstruction.I did not changelog this, as this change is designed to be invisible at the API level. This will require these operations to be re-costed, as this change means we should see an algorithmic improvement (from$\Theta(n^2)$ to $\Theta(n)$ ).