Skip to content

Conversation

vdukhovni
Copy link
Contributor

@vdukhovni vdukhovni commented Jan 13, 2023

Avoid per-byte loop in cstring{,Utf8} builders

Copy chunks of the input to the output buffer with, up to the shorter
of the available buffer space and the "null-free" portion of the remaining
string. Actually "null-free" here means not containing any denormalised
two-byte encodings starting with 0xC0 (so possibly also other ASCII
bytes if the UTF-8 encoding is oddball).

This substantially improves performance, with just one "15%" increase
that looks like a spurious measurement error (perhaps code layout
difference artefact).

      UTF-8 String (12B):                            OK
        16.7 ns ± 1.3 ns, 60% less than baseline
      UTF-8 String (64B, one null):                  OK
        22.6 ns ± 1.3 ns, 87% less than baseline
      UTF-8 String (64B, one null, no shared work):  OK
        30.1 ns ± 2.6 ns, 83% less than baseline
      UTF-8 String (64B, half nulls):                OK
        92.6 ns ± 5.3 ns, 49% less than baseline
      UTF-8 String (64B, all nulls):                 OK
        76.3 ns ± 4.5 ns, 57% less than baseline
      UTF-8 String (64B, all nulls, no shared work): OK
        82.3 ns ± 5.6 ns, 54% less than baseline
      ASCII String (12B):                            OK
        6.50 ns ± 326 ps, 76% less than baseline
      ASCII String (64B):                            OK
        8.03 ns ± 334 ps, 94% less than baseline
      AsciiLit:                                      OK
        8.02 ns ± 648 ps, 94% less than baseline
      Utf8Lit:                                       OK
        21.8 ns ± 1.3 ns, 88% less than baseline
      strLit:                                        OK
        8.90 ns ± 788 ps, 94% less than baseline
      stringUtf8:                                    OK
        22.4 ns ± 1.3 ns, 87% less than baseline
      strLitInline:                                  OK
        8.26 ns ± 676 ps, 94% less than baseline
      utf8LitInline:                                 OK
        23.2 ns ± 1.3 ns, 87% less than baseline
      foldMap byteStringInsert (10000):              OK
        46.0 μs ± 4.0 μs, 15% less than baseline
-->   lazyByteStringHex (10000):                     OK
-->     4.74 μs ± 337 ns, 15% more than baseline
      foldMap integerDec (small) (10000):            OK
        205  μs ±  12 μs,  9% less than baseline
    char8 (10000):                                   OK
      2.58 μs ± 234 ns, 30% less than baseline
      foldMap (left-assoc) (10000):                  OK
        73.2 μs ± 2.9 μs, 54% less than baseline
      foldMap (right-assoc) (10000):                 OK
        43.0 μs ± 4.2 μs, 65% less than baseline
      foldMap [manually fused, left-assoc] (10000):  OK
        81.4 μs ± 5.3 μs, 48% less than baseline
      foldMap [manually fused, right-assoc] (10000): OK
        47.3 μs ± 785 ns, 61% less than baseline

@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch 2 times, most recently from 96880aa to 266d6da Compare January 13, 2023 04:30
@vdukhovni
Copy link
Contributor Author

The emulated CI build failures are spurious/systemic, not related to the PR.

If I add a couple of new benchmarks that use somewhat longer string literals in builders:

--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
         , benchB' "UTF-8 String"  () $ \() -> P.cstringUtf8 "hello world\0"#
         , benchB' "String (naive)" "hello world!" fromString
         , benchB' "String"        () $ \() -> P.cstring "hello world!"#
+        , benchB' "AsciiLit64"   () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+        , benchB' "Utf8Lit64"   () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
         ]
 
       , bgroup "Encoding wrappers"

The relevant benchmark results (GHC 9.4.5) are:

$ cabal run bytestring-bench -- --baseline baseline-lit-9.4.csv --csv new-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.43s)
        278  ns ±  19 ns, 66% less than baseline
      Utf8Lit64:  OK (1.72s)
        356  ns ±  23 ns, 58% less than baseline

All 2 tests passed (3.19s)

The baseline master branch run was:

$ cabal run bytestring-bench -- --csv baseline-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.07s)
        832  ns ±  79 ns
      Utf8Lit64:  OK (1.06s)
        846  ns ±  75 ns

All 2 tests passed (2.16s)

@clyring
Copy link
Member

clyring commented Jan 13, 2023

Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0.

I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass.

Your cstring_step does more or less the same thing as byteStringCopyStep in Builder.Internal.

I will take a closer look later.

Copy link
Member

@clyring clyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branching logic can potentially be simplified some. Currently we ask:

  1. Are we done?
  2. Is there a null to decode?
  3. Is the output buffer full?
  4. Are there any non-nulls to copy?

But we can also ask only:

  1. Is there a null to decode? (If we are done, the answer will be no.)
  2. Does the decoded string up to and including that null to decode fit in the output buffer? (If not, copy as much as possible and report a full buffer.)

That would mean we perform extra zero-length memcpys in some cases, particularly when there are consecutive (encoded) nulls, so it's not a clear win a priori. But it may be worth investigating.

@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch 3 times, most recently from 9086b60 to e6cc4a2 Compare January 14, 2023 10:42
@chessai
Copy link
Member

chessai commented Jan 15, 2023

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

@vdukhovni
Copy link
Contributor Author

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

Sure. Done. I do hope we won't forget to squash before merging...

@@ -440,18 +442,20 @@ char8 :: Char -> Builder
char8 = P.primFixed P.char8

-- | Char8 encode a 'String'.
{-# INLINE [1] string8 #-} -- phased to allow P.cstring rewrite
{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chessai , @Bodigrim , @clyring A question for the reviewers:

Why is the phase specified here equal to 1? When I add tests to see whether string8 and stringUtf8 actually benefit from the RULES, I only get improvement when the phase is set to 0:

--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -255,6 +255,10 @@ ascBuf, utfBuf :: Ptr Word8
 ascBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
 utfBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
 
+ascStr, utfStr :: String
+ascStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+utfStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+
 asclit, utflit :: Ptr Word8 -> Builder
 asclit str@(Ptr addr) = BI.ascLiteralCopy str (byteCountLiteral addr)
 utflit str@(Ptr addr) = BI.modUtf8LitCopy str (byteCountLiteral addr)
@@ -273,6 +277,8 @@ main = do
         , benchB' "String"        () $ \() -> asclit (Ptr "hello world!"#)
         , benchB' "AsciiLit"      () $ \() -> asclit ascBuf
         , benchB' "Utf8Lit"       () $ \() -> utflit utfBuf
+        , benchB' "strLit"        () $ \() -> string8 ascStr
+        , benchB' "utfLit"        () $ \() -> stringUtf8 utfStr
         ]
 
       , bgroup "Encoding wrappers"

With the phase set to 1 as a baseline, testing with phase 0 the bench report is:

All
  Data.ByteString.Builder
    Small payload
      AsciiLit: OK (2.64s)
        243  ns ± 8.5 ns,       same as baseline
      Utf8Lit:  OK (1.45s)
        286  ns ±  15 ns,       same as baseline
      strLit:   OK (1.23s)
        243  ns ±  19 ns, 51% less than baseline
      utfLit:   OK (1.38s)
        279  ns ±  15 ns, 44% less than baseline

This is with GHC 9.4.5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing with GHC 8.10 the phase 1 -> phase 0 delta is:

All
  Data.ByteString.Builder
    Small payload
      AsciiLit: OK (1.00s)
        179  ns ± 9.5 ns, 47% less than baseline
      Utf8Lit:  OK (0.53s)
        198  ns ±  15 ns,       same as baseline
      strLit:   OK (0.49s)
        187  ns ±  19 ns, 51% less than baseline
      utfLit:   OK (0.52s)
        199  ns ±  20 ns,       same as baseline

With GHC 9.2:

All
  Data.ByteString.Builder
    Small payload
      AsciiLit: OK (1.36s)
        236  ns ±  22 ns,       same as baseline
      Utf8Lit:  OK (1.36s)
        274  ns ±  16 ns,       same as baseline
      strLit:   OK (1.20s)
        237  ns ±  14 ns, 69% less than baseline
      utfLit:   OK (1.36s)
        275  ns ±  15 ns, 64% less than baseline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. In principle these rules can fire in phase 2, and I do observe "Rule fired: string8/unpackFoldrCString# (Data.ByteString.Builder)" if I inline ascStr in your example and compile with ghc-9.4.4.

These should probably just be NOINLINE pragmas. primMapList{Fixed,Bounded} are themselves marked inline (to encourage good specialization with the particular BoundedPrim used) and produce lots of code, but will not actually fuse with good list producers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HEAD (master), with no changes other than phase [1] -> [0]:

All
  Data.ByteString.Builder
    Small payload
      strLit: OK (1.90s)
        688  ns ±  43 ns, 27% less than baseline
      utfLit: OK (0.93s)
        736  ns ±  62 ns, 31% less than baseline

Which seems to suggest that the original phase control was not helping, indeed simply removing the rules and inlining (GHC 9.2) gives:

All
  Data.ByteString.Builder
    Small payload
      strLit: OK (1.14s)
        790  ns ±  52 ns, 16% less than baseline
      utfLit: OK (0.99s)
        791  ns ±  66 ns, 26% less than baseline

All 2 tests passed (2.18s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the issue be that I'm giving string8 and stringUtf8 named constant strings, rather than inline string constants? That is, first inline the arguments at the call site, and only then inline string8?

This looks rather fragile. Is there a downside to setting the phase number to 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the unpackCStringUtf8# gets rewritten in terms of unpackFoldrCStringUtf8# before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:

Suggested change
{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites
{-# NOINLINE string8 #-} -- allow literal cstring rewrites

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the unpackCStringUtf8# gets rewritten in terms of unpackFoldrCStringUtf8# before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:

I am beginning to agree. And I don't think this impairs the efficiency of the non-literal input case, where the particular BoundedPrim is already available, and that's all that needs to be optimised, the input data does not have be seen for good code generation. And rewrite RULES fire without inlining the result, so this makes sense. I'll push a commit with NOINLINE and the additional benchmark variants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, there is no explicit rewrite rule for the UTF8 build case, adding one doesn't seem to make a difference, I think that the build (unpackFoldr ...) form gets rewritten back when no fusion happens, and then the existing rule fires?

+#if __GLASGOW_HASKELL__ >= 811
+{-# RULES
+"stringUtf8/unpackFoldrCStringUtf8#" forall s.
+  stringUtf8 (build (unpackFoldrCStringUtf8# s)) =
+    modUtf8LitCopy (Ptr s) (byteCountLiteral s)
+ #-}
+#endif

The above is harmless, and can be added, but does not appear to be necessary.

@vdukhovni
Copy link
Contributor Author

If there's anything further I need to do, please let me know...

Copy link
Member

@clyring clyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been a bit sidetracked the last few weeks, sorry.

How is performance affected for strings consisting mostly of null characters? If this patch hurts it some, that's probably OK, but I'd like to know roughly by how much.

!op' = op0 `plusPtr` (nullFree + 1)
nullAt' <- c_strstr ip' modifiedUtf8NUL
modUtf8_step ip' len' nullAt' k (BufferRange op' ope)
| avail > 0 = do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, but also avail == 0 should be a very rare case.

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 8, 2023

@vdukhovni please rebase to trigger updated CI jobs.

@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch from 44fdcbc to 0645428 Compare February 9, 2023 04:45
@vdukhovni
Copy link
Contributor Author

@vdukhovni please rebase to trigger updated CI jobs.

Done.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM module naming nitpicking!

@vdukhovni could you possibly address @clyring's questions?

-- | GHC represents @NUL@ in string literals via an overlong 2-byte encoding,
-- which is part of "modified UTF-8" (GHC does not also implement CESU-8).
modifiedUtf8NUL :: CString
modifiedUtf8NUL = Ptr "\xc0\x80"#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
modifiedUtf8NUL = Ptr "\xc0\x80"#
modUtf8NUL = Ptr "\xc0\x80"#

Let's keep the prefix consistent.

@clyring clyring modified the milestones: 0.11.5.0, 0.12.1.0 Jul 6, 2023
@clyring
Copy link
Member

clyring commented Sep 27, 2023

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

@vdukhovni
Copy link
Contributor Author

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work?

@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch from 0645428 to 01b5f36 Compare October 9, 2023 20:57
@vdukhovni
Copy link
Contributor Author

Perhaps I can get this over the line. What remains to be done?

-- available buffer space. If the string is long enough, we may have asked
-- for less than its full length, filling the buffer with the rest will go
-- into the next builder step.
| avail > nullFree = do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check with hpc that tests provide sufficient coverage of all cases here? (Sorry, I'm AFK and cannot check myself)

@vdukhovni
Copy link
Contributor Author

This PR is languishing. Where do we go from here?

@clyring
Copy link
Member

clyring commented Feb 11, 2024

Heads-up: I'll probably push some updates and finishing touches to this branch later today or tomorrow.

@clyring
Copy link
Member

clyring commented Feb 15, 2024

The small-builder benchmarks were set up in a terrible way that made using them to investigate performance very difficult. My recent push should hopefully fix that.

@@ -84,6 +84,8 @@ module Data.ByteString.Builder.Internal (
-- , sizedChunksInsert

, byteStringCopy
, asciiLiteralCopy
, modUtf8LitCopy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
, modUtf8LitCopy
, modUtf8LiteralCopy

For consistency with asciiLiteralCopy (or we might as well chose to use Lit for both)

@clyring
Copy link
Member

clyring commented Feb 15, 2024

Here's what the benchmarks currently look like on my machine with ghc-9.8.1:

Baseline (3ce0346):

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.7 ns ± 810 ps
      toLazyByteString mempty:        OK
        452  ns ±  24 ns
      empty (10000 times):            OK
        126  μs ± 4.0 μs
      ensureFree 8:                   OK
        16.6 ns ± 624 ps
      intHost 1:                      OK
        25.7 ns ± 1.2 ns
      UTF-8 String (12B, naive):      OK
        101  ns ± 1.6 ns
      UTF-8 String (12B):             OK
        104  ns ±  35 ns
      UTF-8 String (64B, naive):      OK
        311  ns ±  16 ns
      UTF-8 String (64B):             OK
        356  ns ± 9.7 ns
      UTF-8 String (64B, half nulls): OK
        501  ns ±  15 ns
      UTF-8 String (64B, all nulls):  OK
        335  ns ±  12 ns
      String (12B, naive):            OK
        122  ns ± 2.5 ns
      String (12B):                   OK
        86.8 ns ± 3.1 ns
      String (64B, naive):            OK
        327  ns ±  17 ns
      String (64B):                   OK
        279  ns ±  11 ns

Topic (2603009):

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.5 ns ± 830 ps,       same as baseline
      toLazyByteString mempty:        OK
        452  ns ±  27 ns,       same as baseline
      empty (10000 times):            OK
        133  μs ± 5.8 μs,  5% more than baseline
      ensureFree 8:                   OK
        16.9 ns ± 844 ps,       same as baseline
      intHost 1:                      OK
        25.5 ns ± 818 ps,       same as baseline
      UTF-8 String (12B, naive):      OK
        489  ns ±  29 ns, 383% more than baseline
      UTF-8 String (12B):             OK
        61.2 ns ± 1.5 ns, 41% less than baseline
      UTF-8 String (64B, naive):      OK
        2.36 μs ±  82 ns, 657% more than baseline
      UTF-8 String (64B):             OK
        61.4 ns ± 3.0 ns, 82% less than baseline
      UTF-8 String (64B, half nulls): OK
        563  ns ±  21 ns, 12% more than baseline
      UTF-8 String (64B, all nulls):  OK
        765  ns ±  24 ns, 128% more than baseline
      String (12B, naive):            OK
        499  ns ±  12 ns, 310% more than baseline
      String (12B):                   OK
        24.7 ns ± 3.3 ns, 71% less than baseline
      String (64B, naive):            OK
        2.37 μs ± 108 ns, 623% more than baseline
      String (64B):                   OK
        23.3 ns ± 1.3 ns, 91% less than baseline

Lots of big changes. Some are expected:

  • The ASCII memcpy implementation is much faster except perhaps on trivially small strings: ~70% less run-time on "hello world!" and ~90% less run-time on the 64-byte case.
  • For modified UTF-8 strings without many embedded nulls, the new implementation is much faster. But it suffers when there are many embedded nulls, roughly breaking even at half-nulls and taking ~2.3x as long when there are only nulls. (Perhaps it would make sense to directly check if the first byte is 0xC0 before calling the C search function, to reduce this regression's magnitude a little.)

But there's also a nasty surprise:

  • The benchmarks for the "naive" case (where rewriting to the CString/Add# versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new {-# NOINLINE stringUtf8 #-} the primMapListBounded in stringUtf8 only gets one argument and primMapListBounded needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of primMapListBounded and see if that fixes this.

I also wanted to see how the memchr implementation compares with the strstr implementation. It seems they're about the same. Here are the memchr numbers, with strstr as the "baseline":

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.4 ns ± 456 ps,       same as baseline
      toLazyByteString mempty:        OK
        445  ns ±  21 ns,       same as baseline
      empty (10000 times):            OK
        124  μs ± 6.3 μs,  6% less than baseline
      ensureFree 8:                   OK
        17.7 ns ± 658 ps,       same as baseline
      intHost 1:                      OK
        25.8 ns ± 484 ps,       same as baseline
      UTF-8 String (12B, naive):      OK
        490  ns ±  11 ns,       same as baseline
      UTF-8 String (12B):             OK
        62.7 ns ± 1.6 ns,       same as baseline
      UTF-8 String (64B, naive):      OK
        2.40 μs ±  84 ns,       same as baseline
      UTF-8 String (64B):             OK
        64.4 ns ± 1.7 ns,       same as baseline
      UTF-8 String (64B, half nulls): OK
        616  ns ±  24 ns,  9% more than baseline
      UTF-8 String (64B, all nulls):  OK
        839  ns ± 349 ns,       same as baseline
      String (12B, naive):            OK
        500  ns ±  16 ns,       same as baseline
      String (12B):                   OK
        26.3 ns ± 356 ps,       same as baseline
      String (64B, naive):            OK
        2.28 μs ±  68 ns,       same as baseline
      String (64B):                   OK
        22.1 ns ± 752 ps,       same as baseline

@clyring
Copy link
Member

clyring commented Feb 15, 2024

* The benchmarks for the "naive" case (where rewriting to the `CString`/`Add#` versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new `{-# NOINLINE stringUtf8 #-}` the `primMapListBounded` in `stringUtf8` only gets one argument and `primMapListBounded` needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of `primMapListBounded` and see if that fixes this.

I have confirmed that reducing the syntactic arity of primMapListBounded fixes this regression.

@clyring clyring modified the milestones: 0.12.1.0, 0.12.2.0 Feb 15, 2024
@Bodigrim
Copy link
Contributor

Removing milestone for now.

@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch from cd02c61 to 111456a Compare August 18, 2025 06:41
Copy chunks of the input to the output buffer with, up to the shorter
of the available buffer space and the "null-free" portion of the remaining
string.  Actually "null-free" here means not containing any denormalised
two-byte encodings starting with 0xC0 (so possibly also other ASCII
bytes if the UTF-8 encoding is oddball).

This substantially improves performance, with just one "15%" increase
that looks like a spurious measurement error (perhaps code layout
difference artefact).

      UTF-8 String (12B):                            OK
        16.7 ns ± 1.3 ns, 60% less than baseline
      UTF-8 String (64B, one null):                  OK
        22.6 ns ± 1.3 ns, 87% less than baseline
      UTF-8 String (64B, one null, no shared work):  OK
        30.1 ns ± 2.6 ns, 83% less than baseline
      UTF-8 String (64B, half nulls):                OK
        92.6 ns ± 5.3 ns, 49% less than baseline
      UTF-8 String (64B, all nulls):                 OK
        76.3 ns ± 4.5 ns, 57% less than baseline
      UTF-8 String (64B, all nulls, no shared work): OK
        82.3 ns ± 5.6 ns, 54% less than baseline
      ASCII String (12B):                            OK
        6.50 ns ± 326 ps, 76% less than baseline
      ASCII String (64B):                            OK
        8.03 ns ± 334 ps, 94% less than baseline
      AsciiLit:                                      OK
        8.02 ns ± 648 ps, 94% less than baseline
      Utf8Lit:                                       OK
        21.8 ns ± 1.3 ns, 88% less than baseline
      strLit:                                        OK
        8.90 ns ± 788 ps, 94% less than baseline
      stringUtf8:                                    OK
        22.4 ns ± 1.3 ns, 87% less than baseline
      strLitInline:                                  OK
        8.26 ns ± 676 ps, 94% less than baseline
      utf8LitInline:                                 OK
        23.2 ns ± 1.3 ns, 87% less than baseline
      foldMap byteStringInsert (10000):              OK
        46.0 μs ± 4.0 μs, 15% less than baseline
-->   lazyByteStringHex (10000):                     OK
-->     4.74 μs ± 337 ns, 15% more than baseline
      foldMap integerDec (small) (10000):            OK
        205  μs ±  12 μs,  9% less than baseline
    char8 (10000):                                   OK
      2.58 μs ± 234 ns, 30% less than baseline
      foldMap (left-assoc) (10000):                  OK
        73.2 μs ± 2.9 μs, 54% less than baseline
      foldMap (right-assoc) (10000):                 OK
        43.0 μs ± 4.2 μs, 65% less than baseline
      foldMap [manually fused, left-assoc] (10000):  OK
        81.4 μs ± 5.3 μs, 48% less than baseline
      foldMap [manually fused, right-assoc] (10000): OK
        47.3 μs ± 785 ns, 61% less than baseline
@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch from 111456a to 0a7d5c8 Compare August 18, 2025 08:57
@vdukhovni
Copy link
Contributor Author

I've rebased this PR and significantly improved its performance. Please look again. The only possibly improvement (if worth it) is to rewrite utf8_copyBytes in C. It is a buffer to buffer copy, that "normalises" any "denormalised" 2-byte sequences "0xC0 ??` to the last 6 bits of the second byte, including the case where the NUL-terminated input ends in "0xC0" (i.e. 0xC0 00").

The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits".

Cc: @Bodigrim @clyring

@clyring
Copy link
Member

clyring commented Aug 24, 2025

Sorry for leaving this hanging so long, @vdukhovni! I think that the last time I was working on this I was trying to get the null-encoding-correction work between calls to cstringUtf8 for the same literal to be shared, at an acceptably minimal cost. But that is not a requirement for moving forward!

The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits".

We want native Haskell implementations anyway due to -fpure-haskell and the JS backend.


I will take a look at the changes you have pushed tomorrow.


I'm not sure what's going on with the OpenBSD job. It superficially looks like tests are failing...

Running 1 test suites...
Test suite bytestring-tests: RUNNING...
Test suite bytestring-tests: FAIL
Test suite logged to:
/tmp/cirrus-ci-build/./dist-newstyle/build/x86_64-openbsd/ghc-9.8.3/bytestring-0.13.0.0/t/bytestring-tests/test/bytestring-0.13.0.0-bytestring-tests.log
0 of 1 test suites (0 of 1 test cases) passed.
Error: [Cabal-7125]
Tests failed for test:bytestring-tests from bytestring-0.13.0.0.

...but we pass --test-show-details=direct so cabal should print the testsuite output to the job log. The fact that nothing is visible at all suggests that a broken test executable is being produced...

@Bodigrim
Copy link
Contributor

@clyring I'm pretty sure OpenBSD failure is completely unrelated to this PR, it currently fails across all my projects which have a OpenBSD job. (I suspect the root partition is deliberately very small on those runners and we need a hack similar to haskellari/splitmix@1a7118d, but let's leave it for another day)

@vdukhovni
Copy link
Contributor Author

Sorry for leaving this hanging so long, @vdukhovni! I think that the last time I was working on this I was trying to get the null-encoding-correction work between calls to cstringUtf8 for the same literal to be shared, at an acceptably minimal cost. But that is not a requirement for moving forward!

The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits".

We want native Haskell implementations anyway due to -fpure-haskell and the JS backend.

Well, the delay gave me an opportunity to tackle it afresh and come up with a cleaner, more performant design. The core idea is to observe that a 0xC0 byte is necessarily the first byte of a denormalised encoding of some 6-bit code point, NUL or not. So instead of fixing up just 0xC0 0x80, we can fix up any 0xC0 nn by taking the last 6 bits of nn. This makes it possible to use memchr() to find runs of bytes that don't require special treatment, it is also useful that the input is NUL-terminated, so even a final 0xC0 doesn't require any special logic, it is fine to read the terminal NUL byte (use $n+1$ bytes of an $n$-byte + NUL-terminator input).

Copy link
Member

@clyring clyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to think more about the awkward middle ground list of 'things modUtf8Lit attempts to translate' chosen by the current version of this patch.

@@ -839,6 +844,25 @@ wrappedBytesCopyStep bs0 k =
where
outRemaining = ope `minusPtr` op

-- | Copy the bytes from a 'BufferRange' into the output stream.
wrappedBufferRangeCopyStep :: BufferRange -- ^ Input 'BufferRange'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how important the duplication with wrappedBytesCopyStep is. My guess is that it probably makes the function passed in the 'buffer full' signal one word smaller (the ForeignPtrContents to be kept alive), and maybe results in a few extra arithmetic operations due to converting back and forth between start/len and start/end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well what we have in this case is a Ptr Word8 rather a byte string, and there is no foreign ptr to keep alive, so it seemed natural to use a corresponding copy loop. But if you feel that this should be done differently, please speak up.

-- strings that are free of embedded (overlong-encoded as the two-byte sequence
-- @0xC0 0x80@) null characters.
--
-- @since 0.11.5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This will need updating (again) before merging.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changed to 0.13.0.0 for now.

Comment on lines 682 to 694
-- | Byte count of null-terminated primitive literal string excluding the
-- terminating null byte.
byteCountLiteral :: Addr# -> Int
byteCountLiteral addr# =
#if HS_cstringLength_AND_FinalPtr_AVAILABLE
I# (cstringLength# addr#)
#else
fromIntegral (pure_strlen (Ptr addr#))

foreign import ccall unsafe "string.h strlen" pure_strlen
:: CString -> CSize
#endif
{-# INLINE byteCountLiteral #-}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs in D.B.Internal.Type, right next to unsafePackLiteral.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 330 to 331
, benchB'_ "AsciiLit" $ asciiLit asciiBuf
, benchB'_ "Utf8Lit" $ utf8Lit utf8Buf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are duplicates of the existing benchmarks "ASCII String (64B)" and "UTF-8 String (64B, one null)", respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped.

if | ch /= 0xC0 -> do
poke op ch
let !cnt = ipe `minusPtr` ip'
!runend <- S.memchr ip' 0xC0 (fromIntegral cnt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!runend <- S.memchr ip' 0xC0 (fromIntegral cnt)
!runend <- S.memchr ip' 0xC0 (fromIntegral @Int @CSize cnt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly CSize is not in scope in this module, and I don't think making it available is worthwhile. I added the explicit @Int, perhaps that's "progress".

@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch from 458df41 to 3d72c68 Compare August 27, 2025 06:58
@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch from 3d72c68 to b202ddb Compare August 31, 2025 02:58
@vdukhovni
Copy link
Contributor Author

I've fixed the issues with older GHC compatibility, CI passes. Perhaps close to done now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants