- 
                Notifications
    
You must be signed in to change notification settings  - Fork 144
 
Avoid per-byte loop in cstring{,Utf8} builders #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
96880aa    to
    266d6da      
    Compare
  
    | 
           The emulated CI build failures are spurious/systemic, not related to the PR. If I add a couple of new benchmarks that use somewhat longer string literals in builders: --- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
         , benchB' "UTF-8 String"  () $ \() -> P.cstringUtf8 "hello world\0"#
         , benchB' "String (naive)" "hello world!" fromString
         , benchB' "String"        () $ \() -> P.cstring "hello world!"#
+        , benchB' "AsciiLit64"   () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+        , benchB' "Utf8Lit64"   () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
         ]
 
       , bgroup "Encoding wrappers"The relevant benchmark results (GHC 9.4.5) are: The baseline master branch run was:  | 
    
| 
           Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0. I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass. Your  I will take a closer look later.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branching logic can potentially be simplified some. Currently we ask:
- Are we done?
 - Is there a null to decode?
 - Is the output buffer full?
 - Are there any non-nulls to copy?
 
But we can also ask only:
- Is there a null to decode? (If we are done, the answer will be no.)
 - Does the decoded string up to and including that null to decode fit in the output buffer? (If not, copy as much as possible and report a full buffer.)
 
That would mean we perform extra zero-length memcpys in some cases, particularly when there are consecutive (encoded) nulls, so it's not a clear win a priori. But it may be worth investigating.
9086b60    to
    e6cc4a2      
    Compare
  
    | 
           nitpick: could   | 
    
          
 Sure. Done. I do hope we won't forget to squash before merging...  | 
    
        
          
                Data/ByteString/Builder.hs
              
                Outdated
          
        
      | 
               | 
          ||
| -- | Char8 encode a 'String'. | ||
| {-# INLINE [1] string8 #-} -- phased to allow P.cstring rewrite | ||
| {-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chessai , @Bodigrim , @clyring A question for the reviewers:
Why is the phase specified here equal to 1?  When I add tests to see whether string8 and stringUtf8 actually benefit from the RULES, I only get improvement when the phase is set to 0:
--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -255,6 +255,10 @@ ascBuf, utfBuf :: Ptr Word8
 ascBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
 utfBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
 
+ascStr, utfStr :: String
+ascStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+utfStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+
 asclit, utflit :: Ptr Word8 -> Builder
 asclit str@(Ptr addr) = BI.ascLiteralCopy str (byteCountLiteral addr)
 utflit str@(Ptr addr) = BI.modUtf8LitCopy str (byteCountLiteral addr)
@@ -273,6 +277,8 @@ main = do
         , benchB' "String"        () $ \() -> asclit (Ptr "hello world!"#)
         , benchB' "AsciiLit"      () $ \() -> asclit ascBuf
         , benchB' "Utf8Lit"       () $ \() -> utflit utfBuf
+        , benchB' "strLit"        () $ \() -> string8 ascStr
+        , benchB' "utfLit"        () $ \() -> stringUtf8 utfStr
         ]
 
       , bgroup "Encoding wrappers"With the phase set to 1 as a baseline, testing with phase 0 the bench report is:
All
  Data.ByteString.Builder
    Small payload
      AsciiLit: OK (2.64s)
        243  ns ± 8.5 ns,       same as baseline
      Utf8Lit:  OK (1.45s)
        286  ns ±  15 ns,       same as baseline
      strLit:   OK (1.23s)
        243  ns ±  19 ns, 51% less than baseline
      utfLit:   OK (1.38s)
        279  ns ±  15 ns, 44% less than baseline
This is with GHC 9.4.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing with GHC 8.10 the phase 1 -> phase 0 delta is:
All
  Data.ByteString.Builder
    Small payload
      AsciiLit: OK (1.00s)
        179  ns ± 9.5 ns, 47% less than baseline
      Utf8Lit:  OK (0.53s)
        198  ns ±  15 ns,       same as baseline
      strLit:   OK (0.49s)
        187  ns ±  19 ns, 51% less than baseline
      utfLit:   OK (0.52s)
        199  ns ±  20 ns,       same as baseline
With GHC 9.2:
All
  Data.ByteString.Builder
    Small payload
      AsciiLit: OK (1.36s)
        236  ns ±  22 ns,       same as baseline
      Utf8Lit:  OK (1.36s)
        274  ns ±  16 ns,       same as baseline
      strLit:   OK (1.20s)
        237  ns ±  14 ns, 69% less than baseline
      utfLit:   OK (1.36s)
        275  ns ±  15 ns, 64% less than baseline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. In principle these rules can fire in phase 2, and I do observe "Rule fired: string8/unpackFoldrCString# (Data.ByteString.Builder)" if I inline ascStr in your example and compile with ghc-9.4.4.
These should probably just be NOINLINE pragmas. primMapList{Fixed,Bounded} are themselves marked inline (to encourage good specialization with the particular BoundedPrim used) and produce lots of code, but will not actually fuse with good list producers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In HEAD (master), with no changes other than phase [1] -> [0]:
All
  Data.ByteString.Builder
    Small payload
      strLit: OK (1.90s)
        688  ns ±  43 ns, 27% less than baseline
      utfLit: OK (0.93s)
        736  ns ±  62 ns, 31% less than baseline
Which seems to suggest that the original phase control was not helping, indeed simply removing the rules and inlining (GHC 9.2) gives:
All
  Data.ByteString.Builder
    Small payload
      strLit: OK (1.14s)
        790  ns ±  52 ns, 16% less than baseline
      utfLit: OK (0.99s)
        791  ns ±  66 ns, 26% less than baseline
All 2 tests passed (2.18s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the issue be that I'm giving string8 and stringUtf8 named constant strings, rather than inline string constants? That is, first inline the arguments at the call site, and only then inline string8?
This looks rather fragile. Is there a downside to setting the phase number to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the unpackCStringUtf8# gets rewritten in terms of unpackFoldrCStringUtf8# before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:
| {-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites | |
| {-# NOINLINE string8 #-} -- allow literal cstring rewrites | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the
unpackCStringUtf8#gets rewritten in terms ofunpackFoldrCStringUtf8#before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:
I am beginning to agree.  And I don't think this impairs the efficiency of the non-literal input case, where the particular BoundedPrim is already available, and that's all that needs to be optimised, the input data does not have be seen for good code generation.  And rewrite RULES fire without inlining the result, so this makes sense. I'll push a commit with NOINLINE and the additional benchmark variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, there is no explicit rewrite rule for the UTF8 build case, adding one doesn't seem to make a difference, I think that the build (unpackFoldr ...) form gets rewritten back when no fusion happens, and then the existing rule fires?
+#if __GLASGOW_HASKELL__ >= 811
+{-# RULES
+"stringUtf8/unpackFoldrCStringUtf8#" forall s.
+  stringUtf8 (build (unpackFoldrCStringUtf8# s)) =
+    modUtf8LitCopy (Ptr s) (byteCountLiteral s)
+ #-}
+#endifThe above is harmless, and can be added, but does not appear to be necessary.
| 
           If there's anything further I need to do, please let me know...  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been a bit sidetracked the last few weeks, sorry.
How is performance affected for strings consisting mostly of null characters? If this patch hurts it some, that's probably OK, but I'd like to know roughly by how much.
        
          
                Data/ByteString/Builder/Internal.hs
              
                Outdated
          
        
      | !op' = op0 `plusPtr` (nullFree + 1) | ||
| nullAt' <- c_strstr ip' modifiedUtf8NUL | ||
| modUtf8_step ip' len' nullAt' k (BufferRange op' ope) | ||
| | avail > 0 = do | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, but also avail == 0 should be a very rare case.
| 
           @vdukhovni please rebase to trigger updated CI jobs.  | 
    
44fdcbc    to
    0645428      
    Compare
  
    
          
 Done.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM module naming nitpicking!
@vdukhovni could you possibly address @clyring's questions?
        
          
                Data/ByteString/Builder/Internal.hs
              
                Outdated
          
        
      | -- | GHC represents @NUL@ in string literals via an overlong 2-byte encoding, | ||
| -- which is part of "modified UTF-8" (GHC does not also implement CESU-8). | ||
| modifiedUtf8NUL :: CString | ||
| modifiedUtf8NUL = Ptr "\xc0\x80"# | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| modifiedUtf8NUL = Ptr "\xc0\x80"# | |
| modUtf8NUL = Ptr "\xc0\x80"# | 
Let's keep the prefix consistent.
| 
           ping @vdukhovni Do you plan to come back to this patch? Would you like to pass this off to a maintainer?  | 
    
          
 It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work?  | 
    
0645428    to
    01b5f36      
    Compare
  
    | 
           Perhaps I can get this over the line. What remains to be done?  | 
    
        
          
                Data/ByteString/Builder/Internal.hs
              
                Outdated
          
        
      | -- available buffer space. If the string is long enough, we may have asked | ||
| -- for less than its full length, filling the buffer with the rest will go | ||
| -- into the next builder step. | ||
| | avail > nullFree = do | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check with hpc that tests provide sufficient coverage of all cases here? (Sorry, I'm AFK and cannot check myself)
| 
           This PR is languishing. Where do we go from here?  | 
    
| 
           Removing milestone for now.  | 
    
111456a    to
    0a7d5c8      
    Compare
  
    | 
           I've rebased this PR and significantly improved its performance. Please look again. The only possibly improvement (if worth it) is to rewrite  The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits".  | 
    
| 
           Sorry for leaving this hanging so long, @vdukhovni! I think that the last time I was working on this I was trying to get the null-encoding-correction work between calls to  
 We want native Haskell implementations anyway due to  I will take a look at the changes you have pushed tomorrow. I'm not sure what's going on with the OpenBSD job. It superficially looks like tests are failing... ...but we pass   | 
    
| 
           @clyring I'm pretty sure OpenBSD failure is completely unrelated to this PR, it currently fails across all my projects which have a OpenBSD job. (I suspect the root partition is deliberately very small on those runners and we need a hack similar to haskellari/splitmix@1a7118d, but let's leave it for another day)  | 
    
          
 Well, the delay gave me an opportunity to tackle it afresh and come up with a cleaner, more performant design. The core idea is to observe that a   | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to think more about the awkward middle ground list of 'things modUtf8Lit attempts to translate' chosen by the current version of this patch.
        
          
                Data/ByteString/Builder/Internal.hs
              
                Outdated
          
        
      | if | ch /= 0xC0 -> do | ||
| poke op ch | ||
| let !cnt = ipe `minusPtr` ip' | ||
| !runend <- S.memchr ip' 0xC0 (fromIntegral cnt) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| !runend <- S.memchr ip' 0xC0 (fromIntegral cnt) | |
| !runend <- S.memchr ip' 0xC0 (fromIntegral @Int @CSize cnt) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly CSize is not in scope in this module, and I don't think making it available is worthwhile. I added the explicit @Int, perhaps that's "progress".
3d72c68    to
    b202ddb      
    Compare
  
    | 
           I've fixed the issues with older GHC compatibility, CI passes. Perhaps close to done now...  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| 
           (I checked manually that this branch passes new tests from #714 if rebased) If there are no more comments / suggestions by October 12, I'll merge it as is.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ASCII side of things is perfect.
On the UTF-8 side of things I am not 100% sold on the specification yet:
- The current version of this patch changes the observable behavior of 
cstringUtf8on (admittedly very sketchy) inputs with0xC0not followed by0x80.- This isn't necessarily a major problem: I'd be surprised if there is any actual breakage, and wouldn't be surprised if literally nobody directly uses the current 
cstringUtf8. But it might make the backporting and deprecation story a bit messier. - How much of a performance penalty is there for matching the old behavior exactly? And if we don't care to match that old behavior exactly, would always ignoring the next input byte after 
0xC0be a further performance improvement? 
 - This isn't necessarily a major problem: I'd be surprised if there is any actual breakage, and wouldn't be surprised if literally nobody directly uses the current 
 - The demand that 
modUtf8LitCopybe given a null-terminated buffer (for safety if0xC0is the last byte) and its length not including that null terminator makes for a very weird interface! 
| testCStringUtf8 :: Int -> TestTree | ||
| testCStringUtf8 sz = testProperty "cstringUtf8" $ | ||
| BE.toLazyByteStringWith (BE.untrimmedStrategy sz sz) L.empty | ||
| (BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0"#) == | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0"#) == | |
| (BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0\x80"#) == | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it is best to test that the code does not blow up with input that unexpectedly ends with just \xC0 before the raw (implicit) NUL terminator. So this is more of a robustness test, that then also encodes the implemented handling, than a promise to users that this is how that's handled.
        
          
                Data/ByteString/Builder/Internal.hs
              
                Outdated
          
        
      | 
               | 
          ||
| , byteStringCopy | ||
| , asciiLiteralCopy | ||
| , modUtf8LitCopy | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the earlier modUtf8LitCopy -> modUtf8LiteralCopy naming suggestion.
Copy chunks of the input to the output buffer with, up to the shorter
of the available buffer space and the "null-free" portion of the remaining
string.  Actually "null-free" here means not containing any denormalised
two-byte encodings starting with 0xC0 (so possibly also other ASCII
bytes if the UTF-8 encoding is oddball).
This substantially improves performance, with just one "15%" increase
that looks like a spurious measurement error (perhaps code layout
difference artefact).
      UTF-8 String (12B):                            OK
        16.7 ns ± 1.3 ns, 60% less than baseline
      UTF-8 String (64B, one null):                  OK
        22.6 ns ± 1.3 ns, 87% less than baseline
      UTF-8 String (64B, one null, no shared work):  OK
        30.1 ns ± 2.6 ns, 83% less than baseline
      UTF-8 String (64B, half nulls):                OK
        92.6 ns ± 5.3 ns, 49% less than baseline
      UTF-8 String (64B, all nulls):                 OK
        76.3 ns ± 4.5 ns, 57% less than baseline
      UTF-8 String (64B, all nulls, no shared work): OK
        82.3 ns ± 5.6 ns, 54% less than baseline
      ASCII String (12B):                            OK
        6.50 ns ± 326 ps, 76% less than baseline
      ASCII String (64B):                            OK
        8.03 ns ± 334 ps, 94% less than baseline
      AsciiLit:                                      OK
        8.02 ns ± 648 ps, 94% less than baseline
      Utf8Lit:                                       OK
        21.8 ns ± 1.3 ns, 88% less than baseline
      strLit:                                        OK
        8.90 ns ± 788 ps, 94% less than baseline
      stringUtf8:                                    OK
        22.4 ns ± 1.3 ns, 87% less than baseline
      strLitInline:                                  OK
        8.26 ns ± 676 ps, 94% less than baseline
      utf8LitInline:                                 OK
        23.2 ns ± 1.3 ns, 87% less than baseline
      foldMap byteStringInsert (10000):              OK
        46.0 μs ± 4.0 μs, 15% less than baseline
-->   lazyByteStringHex (10000):                     OK
-->     4.74 μs ± 337 ns, 15% more than baseline
      foldMap integerDec (small) (10000):            OK
        205  μs ±  12 μs,  9% less than baseline
    char8 (10000):                                   OK
      2.58 μs ± 234 ns, 30% less than baseline
      foldMap (left-assoc) (10000):                  OK
        73.2 μs ± 2.9 μs, 54% less than baseline
      foldMap (right-assoc) (10000):                 OK
        43.0 μs ± 4.2 μs, 65% less than baseline
      foldMap [manually fused, left-assoc] (10000):  OK
        81.4 μs ± 5.3 μs, 48% less than baseline
      foldMap [manually fused, right-assoc] (10000): OK
        47.3 μs ± 785 ns, 61% less than baseline
    
          
 Thanks. 
 Such overlong encodings, of which  What is perhaps a bit more bold is that I ignore the top two bits, because the input is an  My take is that after  The original code would have retained any  
 Only invalid UTF8 would result in possibly surprising behaviour, but not more surprising than the current code. This code implements the  
 It would definitely be noticeably costlier for inputs with many (overlong encoded) NULs. And what would you then do with other overlong inputs? 
 We can't "ignore" the next byte by producing no output, that's the byte that's encoding the NUL as  
 The requirement for a final (unencoded) NUL is a result of the input being just a raw  The main idea is to no throw errors, this code will produce valid output for valid "modified UTF8", and something closely related to the input otherwise, never throwing any errors (just as before).  | 
    
Review feedback
b202ddb    to
    833ed24      
    Compare
  
    | 
           Is there something further I need to do here.  | 
    
Avoid per-byte loop in cstring{,Utf8} builders
Copy chunks of the input to the output buffer with, up to the shorter
of the available buffer space and the "null-free" portion of the remaining
string. Actually "null-free" here means not containing any denormalised
two-byte encodings starting with 0xC0 (so possibly also other ASCII
bytes if the UTF-8 encoding is oddball).
This substantially improves performance, with just one "15%" increase
that looks like a spurious measurement error (perhaps code layout
difference artefact).