- 
                Notifications
    You must be signed in to change notification settings 
- Fork 42
Optimize toByteString and toASCIIBytes #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| @phadej This PR is waiting for a review. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe that anything can be faster than toASCIIBytes uuid = BI.unsafeCreate 36 (pokeASCII uuid). It's literally just allocating exactly 36 bytes and poking stuff at the right place. Maybe there's something GHC doesn't see (missing bang somewhere)? But the current code is literally as little work as possible already.
I don't see a point in complicating it.
| It is hard to tell what makes the high-level implementation faster. The  Yet, the benchmarks show that a high-level implementation is faster. It might be that it unlocks some GHC optimizations or that bytestring has some other non-trivial optimizations. If we compare only the functions that do not have any extra overhead ( If  There is also more room for speedup:  | 
| If I had to guess, the reason the  | 
| @clyring clarified that the module   | 
| With the recent update this branch has  @phadej Would you give this another look? This is a significant improvement and it reduces the amount of code to maintain in  | 
| 
 @phadej I appreciate that you'd like to keep the code uncomplicated, but benchmarking seems to indicate a solid improvement, so I don't understand the resistance to the improvement. | 
| @iand675 the original patch was more complicated. And my review comment made lykahb improve on it. Your comment is not fair. | 
| It's true that @lykahb made a further improvement, but I don't think it's unfair to state that he provided benchmarks prior to you saying that you didn't want to "complicate the code" that were a solid performance increase. | 
| where | ||
| (w0, w1, w2, w3) = toWords uuid | ||
| wordFixedPrim :: BBP.FixedPrim (Word32, (Word16, (Word16, (Word16, (Word16, Word32))))) | ||
| wordFixedPrim = BBP.word32HexFixed BBP.>*< | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self,  the word<N>HexFixed calls into C function in bytestring:
char* _hs_bytestring_uint_hex (unsigned int x, char* buf) {
    // write hex representation in reverse order
    char c, *ptr = buf, *next_free;
    do {
        *ptr++ = digits[x & 0xf];
        x >>= 4;
    } while ( x );
    // invert written digits
    next_free = ptr--;
    while(buf < ptr) {
        c      = *ptr;
        *ptr-- = *buf;
        *buf++ = c;
    }
    return next_free;
};Fascinating that loop is faster (?) than unrolled version. Maybe GCC does magic. For "random" UUIDs, ther e shouldn't be win in short circuiting the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's wrong, the Fixed variants just split the number into halves:
-- | Encode a 'Word8' using 2 nibbles (hexadecimal digits).
{-# INLINE word8HexFixed #-}
word8HexFixed :: FixedPrim Word8
word8HexFixed = fixedPrim 2 $ \x op -> do
  enc <- encode8_as_16h lowerTable x
  unalignedWriteU16 enc op
-- | Encode a 'Word16' using 4 nibbles.
{-# INLINE word16HexFixed #-}
word16HexFixed :: FixedPrim Word16
word16HexFixed =
    (\x -> (fromIntegral $ x `shiftR` 8, fromIntegral x))
      >$< pairF word8HexFixed word8HexFixed
-- | Encode a 'Word32' using 8 nibbles.
{-# INLINE word32HexFixed #-}
word32HexFixed :: FixedPrim Word32
word32HexFixed =
    (\x -> (fromIntegral $ x `shiftR` 16, fromIntegral x))
      >$< pairF word16HexFixed word16HexFixed
-- | Encode a 'Word64' using 16 nibbles.
{-# INLINE word64HexFixed #-}
word64HexFixed :: FixedPrim Word64
word64HexFixed =
    (\x -> (fromIntegral $ x `shiftR` 32, fromIntegral x))
      >$< pairF word32HexFixed word32HexFixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's what #80 (comment) was saying
| 
 If we don't understand why code works as it does, does it work? In my opinion it doesn't in the long term. I'm the one who maintaining this, so I'm making the judgment calls. This PR has a lot of good in it, but it's not perfect, I'll take if from here. Note to self: benchmark on x86_64 | 
| For the sake of clarity, is your objection about the use  | 
| 
 Is there a trick to reducing these variance values? On an i9-13900K:  | 
This PR leverages the bytestring fixed-length builders to simplify and speed up the conversions. The re-implementation of
toASCIIBytesis now more high-level and safe.I bumped the bytestring dependency lower bound to the version that introduces
Data.ByteString.Builder.Prim. It was released in 2012, so that is plenty of backwards compatibility.Here are benchmarks on MacBook M1 Max:
Before
After