-
Notifications
You must be signed in to change notification settings - Fork 145
Avoid per-byte loop in cstring{,Utf8} builders #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
96880aa
to
266d6da
Compare
The emulated CI build failures are spurious/systemic, not related to the PR. If I add a couple of new benchmarks that use somewhat longer string literals in builders: --- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
, benchB' "UTF-8 String" () $ \() -> P.cstringUtf8 "hello world\0"#
, benchB' "String (naive)" "hello world!" fromString
, benchB' "String" () $ \() -> P.cstring "hello world!"#
+ , benchB' "AsciiLit64" () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+ , benchB' "Utf8Lit64" () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
]
, bgroup "Encoding wrappers" The relevant benchmark results (GHC 9.4.5) are:
The baseline master branch run was:
|
Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0. I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass. Your I will take a closer look later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branching logic can potentially be simplified some. Currently we ask:
- Are we done?
- Is there a null to decode?
- Is the output buffer full?
- Are there any non-nulls to copy?
But we can also ask only:
- Is there a null to decode? (If we are done, the answer will be no.)
- Does the decoded string up to and including that null to decode fit in the output buffer? (If not, copy as much as possible and report a full buffer.)
That would mean we perform extra zero-length memcpy
s in some cases, particularly when there are consecutive (encoded) nulls, so it's not a clear win a priori. But it may be worth investigating.
9086b60
to
e6cc4a2
Compare
nitpick: could |
Sure. Done. I do hope we won't forget to squash before merging... |
Data/ByteString/Builder.hs
Outdated
@@ -440,18 +442,20 @@ char8 :: Char -> Builder | |||
char8 = P.primFixed P.char8 | |||
|
|||
-- | Char8 encode a 'String'. | |||
{-# INLINE [1] string8 #-} -- phased to allow P.cstring rewrite | |||
{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chessai , @Bodigrim , @clyring A question for the reviewers:
Why is the phase specified here equal to 1
? When I add tests to see whether string8
and stringUtf8
actually benefit from the RULES, I only get improvement when the phase is set to 0
:
--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -255,6 +255,10 @@ ascBuf, utfBuf :: Ptr Word8
ascBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
utfBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+ascStr, utfStr :: String
+ascStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+utfStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+
asclit, utflit :: Ptr Word8 -> Builder
asclit str@(Ptr addr) = BI.ascLiteralCopy str (byteCountLiteral addr)
utflit str@(Ptr addr) = BI.modUtf8LitCopy str (byteCountLiteral addr)
@@ -273,6 +277,8 @@ main = do
, benchB' "String" () $ \() -> asclit (Ptr "hello world!"#)
, benchB' "AsciiLit" () $ \() -> asclit ascBuf
, benchB' "Utf8Lit" () $ \() -> utflit utfBuf
+ , benchB' "strLit" () $ \() -> string8 ascStr
+ , benchB' "utfLit" () $ \() -> stringUtf8 utfStr
]
, bgroup "Encoding wrappers"
With the phase set to 1
as a baseline, testing with phase 0
the bench
report is:
All
Data.ByteString.Builder
Small payload
AsciiLit: OK (2.64s)
243 ns ± 8.5 ns, same as baseline
Utf8Lit: OK (1.45s)
286 ns ± 15 ns, same as baseline
strLit: OK (1.23s)
243 ns ± 19 ns, 51% less than baseline
utfLit: OK (1.38s)
279 ns ± 15 ns, 44% less than baseline
This is with GHC 9.4.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing with GHC 8.10 the phase 1 -> phase 0 delta is:
All
Data.ByteString.Builder
Small payload
AsciiLit: OK (1.00s)
179 ns ± 9.5 ns, 47% less than baseline
Utf8Lit: OK (0.53s)
198 ns ± 15 ns, same as baseline
strLit: OK (0.49s)
187 ns ± 19 ns, 51% less than baseline
utfLit: OK (0.52s)
199 ns ± 20 ns, same as baseline
With GHC 9.2:
All
Data.ByteString.Builder
Small payload
AsciiLit: OK (1.36s)
236 ns ± 22 ns, same as baseline
Utf8Lit: OK (1.36s)
274 ns ± 16 ns, same as baseline
strLit: OK (1.20s)
237 ns ± 14 ns, 69% less than baseline
utfLit: OK (1.36s)
275 ns ± 15 ns, 64% less than baseline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. In principle these rules can fire in phase 2, and I do observe "Rule fired: string8/unpackFoldrCString# (Data.ByteString.Builder)" if I inline ascStr in your example and compile with ghc-9.4.4.
These should probably just be NOINLINE pragmas. primMapList{Fixed,Bounded} are themselves marked inline (to encourage good specialization with the particular BoundedPrim
used) and produce lots of code, but will not actually fuse with good list producers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In HEAD (master), with no changes other than phase [1] -> [0]:
All
Data.ByteString.Builder
Small payload
strLit: OK (1.90s)
688 ns ± 43 ns, 27% less than baseline
utfLit: OK (0.93s)
736 ns ± 62 ns, 31% less than baseline
Which seems to suggest that the original phase control was not helping, indeed simply removing the rules and inlining (GHC 9.2) gives:
All
Data.ByteString.Builder
Small payload
strLit: OK (1.14s)
790 ns ± 52 ns, 16% less than baseline
utfLit: OK (0.99s)
791 ns ± 66 ns, 26% less than baseline
All 2 tests passed (2.18s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the issue be that I'm giving string8
and stringUtf8
named constant strings, rather than inline string constants? That is, first inline the arguments at the call site, and only then inline string8
?
This looks rather fragile. Is there a downside to setting the phase number to 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the unpackCStringUtf8#
gets rewritten in terms of unpackFoldrCStringUtf8#
before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:
{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites | |
{-# NOINLINE string8 #-} -- allow literal cstring rewrites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the
unpackCStringUtf8#
gets rewritten in terms ofunpackFoldrCStringUtf8#
before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:
I am beginning to agree. And I don't think this impairs the efficiency of the non-literal input case, where the particular BoundedPrim is already available, and that's all that needs to be optimised, the input data does not have be seen for good code generation. And rewrite RULES fire without inlining the result, so this makes sense. I'll push a commit with NOINLINE
and the additional benchmark variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, there is no explicit rewrite rule for the UTF8 build
case, adding one doesn't seem to make a difference, I think that the build (unpackFoldr ...)
form gets rewritten back when no fusion happens, and then the existing rule fires?
+#if __GLASGOW_HASKELL__ >= 811
+{-# RULES
+"stringUtf8/unpackFoldrCStringUtf8#" forall s.
+ stringUtf8 (build (unpackFoldrCStringUtf8# s)) =
+ modUtf8LitCopy (Ptr s) (byteCountLiteral s)
+ #-}
+#endif
The above is harmless, and can be added, but does not appear to be necessary.
If there's anything further I need to do, please let me know... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been a bit sidetracked the last few weeks, sorry.
How is performance affected for strings consisting mostly of null characters? If this patch hurts it some, that's probably OK, but I'd like to know roughly by how much.
Data/ByteString/Builder/Internal.hs
Outdated
!op' = op0 `plusPtr` (nullFree + 1) | ||
nullAt' <- c_strstr ip' modifiedUtf8NUL | ||
modUtf8_step ip' len' nullAt' k (BufferRange op' ope) | ||
| avail > 0 = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, but also avail == 0
should be a very rare case.
@vdukhovni please rebase to trigger updated CI jobs. |
44fdcbc
to
0645428
Compare
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM module naming nitpicking!
@vdukhovni could you possibly address @clyring's questions?
Data/ByteString/Builder/Internal.hs
Outdated
-- | GHC represents @NUL@ in string literals via an overlong 2-byte encoding, | ||
-- which is part of "modified UTF-8" (GHC does not also implement CESU-8). | ||
modifiedUtf8NUL :: CString | ||
modifiedUtf8NUL = Ptr "\xc0\x80"# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modifiedUtf8NUL = Ptr "\xc0\x80"# | |
modUtf8NUL = Ptr "\xc0\x80"# |
Let's keep the prefix consistent.
ping @vdukhovni Do you plan to come back to this patch? Would you like to pass this off to a maintainer? |
It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work? |
0645428
to
01b5f36
Compare
Perhaps I can get this over the line. What remains to be done? |
Data/ByteString/Builder/Internal.hs
Outdated
-- available buffer space. If the string is long enough, we may have asked | ||
-- for less than its full length, filling the buffer with the rest will go | ||
-- into the next builder step. | ||
| avail > nullFree = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check with hpc
that tests provide sufficient coverage of all cases here? (Sorry, I'm AFK and cannot check myself)
This PR is languishing. Where do we go from here? |
Heads-up: I'll probably push some updates and finishing touches to this branch later today or tomorrow. |
The small-builder benchmarks were set up in a terrible way that made using them to investigate performance very difficult. My recent push should hopefully fix that. |
@@ -84,6 +84,8 @@ module Data.ByteString.Builder.Internal ( | |||
-- , sizedChunksInsert | |||
|
|||
, byteStringCopy | |||
, asciiLiteralCopy | |||
, modUtf8LitCopy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, modUtf8LitCopy | |
, modUtf8LiteralCopy |
For consistency with asciiLiteralCopy
(or we might as well chose to use Lit
for both)
Here's what the benchmarks currently look like on my machine with Baseline (3ce0346):
Topic (2603009):
Lots of big changes. Some are expected:
But there's also a nasty surprise:
I also wanted to see how the
|
I have confirmed that reducing the syntactic arity of |
Removing milestone for now. |
cd02c61
to
111456a
Compare
Copy chunks of the input to the output buffer with, up to the shorter of the available buffer space and the "null-free" portion of the remaining string. Actually "null-free" here means not containing any denormalised two-byte encodings starting with 0xC0 (so possibly also other ASCII bytes if the UTF-8 encoding is oddball). This substantially improves performance, with just one "15%" increase that looks like a spurious measurement error (perhaps code layout difference artefact). UTF-8 String (12B): OK 16.7 ns ± 1.3 ns, 60% less than baseline UTF-8 String (64B, one null): OK 22.6 ns ± 1.3 ns, 87% less than baseline UTF-8 String (64B, one null, no shared work): OK 30.1 ns ± 2.6 ns, 83% less than baseline UTF-8 String (64B, half nulls): OK 92.6 ns ± 5.3 ns, 49% less than baseline UTF-8 String (64B, all nulls): OK 76.3 ns ± 4.5 ns, 57% less than baseline UTF-8 String (64B, all nulls, no shared work): OK 82.3 ns ± 5.6 ns, 54% less than baseline ASCII String (12B): OK 6.50 ns ± 326 ps, 76% less than baseline ASCII String (64B): OK 8.03 ns ± 334 ps, 94% less than baseline AsciiLit: OK 8.02 ns ± 648 ps, 94% less than baseline Utf8Lit: OK 21.8 ns ± 1.3 ns, 88% less than baseline strLit: OK 8.90 ns ± 788 ps, 94% less than baseline stringUtf8: OK 22.4 ns ± 1.3 ns, 87% less than baseline strLitInline: OK 8.26 ns ± 676 ps, 94% less than baseline utf8LitInline: OK 23.2 ns ± 1.3 ns, 87% less than baseline foldMap byteStringInsert (10000): OK 46.0 μs ± 4.0 μs, 15% less than baseline --> lazyByteStringHex (10000): OK --> 4.74 μs ± 337 ns, 15% more than baseline foldMap integerDec (small) (10000): OK 205 μs ± 12 μs, 9% less than baseline char8 (10000): OK 2.58 μs ± 234 ns, 30% less than baseline foldMap (left-assoc) (10000): OK 73.2 μs ± 2.9 μs, 54% less than baseline foldMap (right-assoc) (10000): OK 43.0 μs ± 4.2 μs, 65% less than baseline foldMap [manually fused, left-assoc] (10000): OK 81.4 μs ± 5.3 μs, 48% less than baseline foldMap [manually fused, right-assoc] (10000): OK 47.3 μs ± 785 ns, 61% less than baseline
111456a
to
0a7d5c8
Compare
I've rebased this PR and significantly improved its performance. Please look again. The only possibly improvement (if worth it) is to rewrite The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits". |
Sorry for leaving this hanging so long, @vdukhovni! I think that the last time I was working on this I was trying to get the null-encoding-correction work between calls to
We want native Haskell implementations anyway due to I will take a look at the changes you have pushed tomorrow. I'm not sure what's going on with the OpenBSD job. It superficially looks like tests are failing...
...but we pass |
@clyring I'm pretty sure OpenBSD failure is completely unrelated to this PR, it currently fails across all my projects which have a OpenBSD job. (I suspect the root partition is deliberately very small on those runners and we need a hack similar to haskellari/splitmix@1a7118d, but let's leave it for another day) |
Well, the delay gave me an opportunity to tackle it afresh and come up with a cleaner, more performant design. The core idea is to observe that a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to think more about the awkward middle ground list of 'things modUtf8Lit
attempts to translate' chosen by the current version of this patch.
@@ -839,6 +844,25 @@ wrappedBytesCopyStep bs0 k = | |||
where | |||
outRemaining = ope `minusPtr` op | |||
|
|||
-- | Copy the bytes from a 'BufferRange' into the output stream. | |||
wrappedBufferRangeCopyStep :: BufferRange -- ^ Input 'BufferRange'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how important the duplication with wrappedBytesCopyStep
is. My guess is that it probably makes the function passed in the 'buffer full' signal one word smaller (the ForeignPtrContents
to be kept alive), and maybe results in a few extra arithmetic operations due to converting back and forth between start/len and start/end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well what we have in this case is a Ptr Word8
rather a byte string, and there is no foreign ptr to keep alive, so it seemed natural to use a corresponding copy loop. But if you feel that this should be done differently, please speak up.
Data/ByteString/Builder/Internal.hs
Outdated
-- strings that are free of embedded (overlong-encoded as the two-byte sequence | ||
-- @0xC0 0x80@) null characters. | ||
-- | ||
-- @since 0.11.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This will need updating (again) before merging.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Changed to 0.13.0.0 for now.
Data/ByteString/Builder/Prim.hs
Outdated
-- | Byte count of null-terminated primitive literal string excluding the | ||
-- terminating null byte. | ||
byteCountLiteral :: Addr# -> Int | ||
byteCountLiteral addr# = | ||
#if HS_cstringLength_AND_FinalPtr_AVAILABLE | ||
I# (cstringLength# addr#) | ||
#else | ||
fromIntegral (pure_strlen (Ptr addr#)) | ||
|
||
foreign import ccall unsafe "string.h strlen" pure_strlen | ||
:: CString -> CSize | ||
#endif | ||
{-# INLINE byteCountLiteral #-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belongs in D.B.Internal.Type
, right next to unsafePackLiteral
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bench/BenchAll.hs
Outdated
, benchB'_ "AsciiLit" $ asciiLit asciiBuf | ||
, benchB'_ "Utf8Lit" $ utf8Lit utf8Buf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are duplicates of the existing benchmarks "ASCII String (64B)" and "UTF-8 String (64B, one null)", respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped.
Data/ByteString/Builder/Internal.hs
Outdated
if | ch /= 0xC0 -> do | ||
poke op ch | ||
let !cnt = ipe `minusPtr` ip' | ||
!runend <- S.memchr ip' 0xC0 (fromIntegral cnt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!runend <- S.memchr ip' 0xC0 (fromIntegral cnt) | |
!runend <- S.memchr ip' 0xC0 (fromIntegral @Int @CSize cnt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly CSize
is not in scope in this module, and I don't think making it available is worthwhile. I added the explicit @Int
, perhaps that's "progress".
458df41
to
3d72c68
Compare
Matthew Craven review fixes
3d72c68
to
b202ddb
Compare
I've fixed the issues with older GHC compatibility, CI passes. Perhaps close to done now... |
Avoid per-byte loop in cstring{,Utf8} builders
Copy chunks of the input to the output buffer with, up to the shorter
of the available buffer space and the "null-free" portion of the remaining
string. Actually "null-free" here means not containing any denormalised
two-byte encodings starting with 0xC0 (so possibly also other ASCII
bytes if the UTF-8 encoding is oddball).
This substantially improves performance, with just one "15%" increase
that looks like a spurious measurement error (perhaps code layout
difference artefact).