Avoid per-byte loop in cstring{,Utf8} builders #569

vdukhovni · 2023-01-13T04:14:08Z

Avoid per-byte loop in cstring{,Utf8} builders

Copy chunks of the input to the output buffer with, up to the shorter
of the available buffer space and the "null-free" portion of the remaining
string. Actually "null-free" here means not containing any denormalised
two-byte encodings starting with 0xC0 (so possibly also other ASCII
bytes if the UTF-8 encoding is oddball).

This substantially improves performance, with just one "15%" increase
that looks like a spurious measurement error (perhaps code layout
difference artefact).

      UTF-8 String (12B):                            OK
        16.7 ns ± 1.3 ns, 60% less than baseline
      UTF-8 String (64B, one null):                  OK
        22.6 ns ± 1.3 ns, 87% less than baseline
      UTF-8 String (64B, one null, no shared work):  OK
        30.1 ns ± 2.6 ns, 83% less than baseline
      UTF-8 String (64B, half nulls):                OK
        92.6 ns ± 5.3 ns, 49% less than baseline
      UTF-8 String (64B, all nulls):                 OK
        76.3 ns ± 4.5 ns, 57% less than baseline
      UTF-8 String (64B, all nulls, no shared work): OK
        82.3 ns ± 5.6 ns, 54% less than baseline
      ASCII String (12B):                            OK
        6.50 ns ± 326 ps, 76% less than baseline
      ASCII String (64B):                            OK
        8.03 ns ± 334 ps, 94% less than baseline
      AsciiLit:                                      OK
        8.02 ns ± 648 ps, 94% less than baseline
      Utf8Lit:                                       OK
        21.8 ns ± 1.3 ns, 88% less than baseline
      strLit:                                        OK
        8.90 ns ± 788 ps, 94% less than baseline
      stringUtf8:                                    OK
        22.4 ns ± 1.3 ns, 87% less than baseline
      strLitInline:                                  OK
        8.26 ns ± 676 ps, 94% less than baseline
      utf8LitInline:                                 OK
        23.2 ns ± 1.3 ns, 87% less than baseline
      foldMap byteStringInsert (10000):              OK
        46.0 μs ± 4.0 μs, 15% less than baseline
-->   lazyByteStringHex (10000):                     OK
-->     4.74 μs ± 337 ns, 15% more than baseline
      foldMap integerDec (small) (10000):            OK
        205  μs ±  12 μs,  9% less than baseline
    char8 (10000):                                   OK
      2.58 μs ± 234 ns, 30% less than baseline
      foldMap (left-assoc) (10000):                  OK
        73.2 μs ± 2.9 μs, 54% less than baseline
      foldMap (right-assoc) (10000):                 OK
        43.0 μs ± 4.2 μs, 65% less than baseline
      foldMap [manually fused, left-assoc] (10000):  OK
        81.4 μs ± 5.3 μs, 48% less than baseline
      foldMap [manually fused, right-assoc] (10000): OK
        47.3 μs ± 785 ns, 61% less than baseline

vdukhovni · 2023-01-13T06:36:36Z

The emulated CI build failures are spurious/systemic, not related to the PR.

If I add a couple of new benchmarks that use somewhat longer string literals in builders:

--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
         , benchB' "UTF-8 String"  () $ \() -> P.cstringUtf8 "hello world\0"#
         , benchB' "String (naive)" "hello world!" fromString
         , benchB' "String"        () $ \() -> P.cstring "hello world!"#
+        , benchB' "AsciiLit64"   () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+        , benchB' "Utf8Lit64"   () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
         ]
 
       , bgroup "Encoding wrappers"

The relevant benchmark results (GHC 9.4.5) are:

$ cabal run bytestring-bench -- --baseline baseline-lit-9.4.csv --csv new-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.43s)
        278  ns ±  19 ns, 66% less than baseline
      Utf8Lit64:  OK (1.72s)
        356  ns ±  23 ns, 58% less than baseline

All 2 tests passed (3.19s)

The baseline master branch run was:

$ cabal run bytestring-bench -- --csv baseline-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.07s)
        832  ns ±  79 ns
      Utf8Lit64:  OK (1.06s)
        846  ns ±  75 ns

All 2 tests passed (2.16s)

clyring · 2023-01-13T13:05:19Z

Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0.

I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass.

Your cstring_step does more or less the same thing as byteStringCopyStep in Builder.Internal.

I will take a closer look later.

clyring

The branching logic can potentially be simplified some. Currently we ask:

Are we done?
Is there a null to decode?
Is the output buffer full?
Are there any non-nulls to copy?

But we can also ask only:

Is there a null to decode? (If we are done, the answer will be no.)
Does the decoded string up to and including that null to decode fit in the output buffer? (If not, copy as much as possible and report a full buffer.)

That would mean we perform extra zero-length memcpys in some cases, particularly when there are consecutive (encoded) nulls, so it's not a clear win a priori. But it may be worth investigating.

Data/ByteString/Internal.hs

Data/ByteString/Builder/Prim.hs

Data/ByteString/Builder/Internal.hs

chessai · 2023-01-15T17:53:12Z

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

vdukhovni · 2023-01-15T19:20:28Z

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

Sure. Done. I do hope we won't forget to squash before merging...

vdukhovni · 2023-01-15T20:46:18Z

Data/ByteString/Builder.hs


 -- | Char8 encode a 'String'.
-{-# INLINE [1] string8 #-} -- phased to allow P.cstring rewrite
+{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites


@chessai , @Bodigrim , @clyring A question for the reviewers:

Why is the phase specified here equal to 1? When I add tests to see whether string8 and stringUtf8 actually benefit from the RULES, I only get improvement when the phase is set to 0:

--- a/bench/BenchAll.hs +++ b/bench/BenchAll.hs @@ -255,6 +255,10 @@ ascBuf, utfBuf :: Ptr Word8 ascBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"# utfBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"# +ascStr, utfStr :: String +ascStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" +utfStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" + asclit, utflit :: Ptr Word8 -> Builder asclit str@(Ptr addr) = BI.ascLiteralCopy str (byteCountLiteral addr) utflit str@(Ptr addr) = BI.modUtf8LitCopy str (byteCountLiteral addr) @@ -273,6 +277,8 @@ main = do , benchB' "String" () $ \() -> asclit (Ptr "hello world!"#) , benchB' "AsciiLit" () $ \() -> asclit ascBuf , benchB' "Utf8Lit" () $ \() -> utflit utfBuf + , benchB' "strLit" () $ \() -> string8 ascStr + , benchB' "utfLit" () $ \() -> stringUtf8 utfStr ] , bgroup "Encoding wrappers"

With the phase set to 1 as a baseline, testing with phase 0 the bench report is:

All Data.ByteString.Builder Small payload AsciiLit: OK (2.64s) 243 ns ± 8.5 ns, same as baseline Utf8Lit: OK (1.45s) 286 ns ± 15 ns, same as baseline strLit: OK (1.23s) 243 ns ± 19 ns, 51% less than baseline utfLit: OK (1.38s) 279 ns ± 15 ns, 44% less than baseline

This is with GHC 9.4.5.

Testing with GHC 8.10 the phase 1 -> phase 0 delta is:

All Data.ByteString.Builder Small payload AsciiLit: OK (1.00s) 179 ns ± 9.5 ns, 47% less than baseline Utf8Lit: OK (0.53s) 198 ns ± 15 ns, same as baseline strLit: OK (0.49s) 187 ns ± 19 ns, 51% less than baseline utfLit: OK (0.52s) 199 ns ± 20 ns, same as baseline

With GHC 9.2:

All Data.ByteString.Builder Small payload AsciiLit: OK (1.36s) 236 ns ± 22 ns, same as baseline Utf8Lit: OK (1.36s) 274 ns ± 16 ns, same as baseline strLit: OK (1.20s) 237 ns ± 14 ns, 69% less than baseline utfLit: OK (1.36s) 275 ns ± 15 ns, 64% less than baseline

I'm not sure. In principle these rules can fire in phase 2, and I do observe "Rule fired: string8/unpackFoldrCString# (Data.ByteString.Builder)" if I inline ascStr in your example and compile with ghc-9.4.4.

These should probably just be NOINLINE pragmas. primMapList{Fixed,Bounded} are themselves marked inline (to encourage good specialization with the particular BoundedPrim used) and produce lots of code, but will not actually fuse with good list producers.

In HEAD (master), with no changes other than phase [1] -> [0]:

All Data.ByteString.Builder Small payload strLit: OK (1.90s) 688 ns ± 43 ns, 27% less than baseline utfLit: OK (0.93s) 736 ns ± 62 ns, 31% less than baseline

Which seems to suggest that the original phase control was not helping, indeed simply removing the rules and inlining (GHC 9.2) gives:

All Data.ByteString.Builder Small payload strLit: OK (1.14s) 790 ns ± 52 ns, 16% less than baseline utfLit: OK (0.99s) 791 ns ± 66 ns, 26% less than baseline All 2 tests passed (2.18s)

Could the issue be that I'm giving string8 and stringUtf8 named constant strings, rather than inline string constants? That is, first inline the arguments at the call site, and only then inline string8?

This looks rather fragile. Is there a downside to setting the phase number to 0?

It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the unpackCStringUtf8# gets rewritten in terms of unpackFoldrCStringUtf8# before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:

Suggested change

{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites

{-# NOINLINE string8 #-} -- allow literal cstring rewrites

It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the unpackCStringUtf8# gets rewritten in terms of unpackFoldrCStringUtf8# before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:

I am beginning to agree. And I don't think this impairs the efficiency of the non-literal input case, where the particular BoundedPrim is already available, and that's all that needs to be optimised, the input data does not have be seen for good code generation. And rewrite RULES fire without inlining the result, so this makes sense. I'll push a commit with NOINLINE and the additional benchmark variants.

By the way, there is no explicit rewrite rule for the UTF8 build case, adding one doesn't seem to make a difference, I think that the build (unpackFoldr ...) form gets rewritten back when no fusion happens, and then the existing rule fires?

+#if __GLASGOW_HASKELL__ >= 811 +{-# RULES +"stringUtf8/unpackFoldrCStringUtf8#" forall s. + stringUtf8 (build (unpackFoldrCStringUtf8# s)) = + modUtf8LitCopy (Ptr s) (byteCountLiteral s) + #-} +#endif

The above is harmless, and can be added, but does not appear to be necessary.

vdukhovni · 2023-01-23T06:25:41Z

If there's anything further I need to do, please let me know...

clyring

I've been a bit sidetracked the last few weeks, sorry.

How is performance affected for strings consisting mostly of null characters? If this patch hurts it some, that's probably OK, but I'd like to know roughly by how much.

Data/ByteString/Builder/Internal.hs

clyring · 2023-02-08T01:35:14Z

Data/ByteString/Builder/Internal.hs

+            !op' = op0 `plusPtr` (nullFree + 1)
+        nullAt' <- c_strstr ip' modifiedUtf8NUL
+        modUtf8_step ip' len' nullAt' k (BufferRange op' ope)
+    | avail > 0 = do


Same question, but also avail == 0 should be a very rare case.

Bodigrim · 2023-02-08T23:23:32Z

@vdukhovni please rebase to trigger updated CI jobs.

vdukhovni · 2023-02-09T04:45:58Z

@vdukhovni please rebase to trigger updated CI jobs.

Done.

Bodigrim

LGTM module naming nitpicking!

@vdukhovni could you possibly address @clyring's questions?

Data/ByteString/Builder/Internal.hs

Bodigrim · 2023-06-12T21:36:58Z

Data/ByteString/Builder/Internal.hs

+-- | GHC represents @NUL@ in string literals via an overlong 2-byte encoding,
+-- which is part of "modified UTF-8" (GHC does not also implement CESU-8).
+modifiedUtf8NUL :: CString
+modifiedUtf8NUL = Ptr "\xc0\x80"#


Suggested change

modifiedUtf8NUL = Ptr "\xc0\x80"#

modUtf8NUL = Ptr "\xc0\x80"#

Let's keep the prefix consistent.

clyring · 2023-09-27T01:50:37Z

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

vdukhovni · 2023-09-27T02:30:01Z

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work?

vdukhovni · 2023-10-11T02:02:18Z

Perhaps I can get this over the line. What remains to be done?

Data/ByteString/Builder/Internal.hs

Bodigrim · 2023-10-11T20:40:55Z

Data/ByteString/Builder/Internal.hs

+    -- available buffer space. If the string is long enough, we may have asked
+    -- for less than its full length, filling the buffer with the rest will go
+    -- into the next builder step.
+    | avail > nullFree = do


Could you please check with hpc that tests provide sufficient coverage of all cases here? (Sorry, I'm AFK and cannot check myself)

vdukhovni · 2024-01-21T00:27:10Z

This PR is languishing. Where do we go from here?

Bodigrim · 2024-10-15T22:32:13Z

Removing milestone for now.

vdukhovni · 2025-08-19T01:18:15Z

I've rebased this PR and significantly improved its performance. Please look again. The only possibly improvement (if worth it) is to rewrite utf8_copyBytes in C. It is a buffer to buffer copy, that "normalises" any "denormalised" 2-byte sequences "0xC0 ??` to the last 6 bits of the second byte, including the case where the NUL-terminated input ends in "0xC0" (i.e. 0xC0 00").

The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits".

Cc: @Bodigrim @clyring

clyring · 2025-08-24T16:39:35Z

Sorry for leaving this hanging so long, @vdukhovni! I think that the last time I was working on this I was trying to get the null-encoding-correction work between calls to cstringUtf8 for the same literal to be shared, at an acceptably minimal cost. But that is not a requirement for moving forward!

The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits".

We want native Haskell implementations anyway due to -fpure-haskell and the JS backend.

I will take a look at the changes you have pushed tomorrow.

I'm not sure what's going on with the OpenBSD job. It superficially looks like tests are failing...

Running 1 test suites...
Test suite bytestring-tests: RUNNING...
Test suite bytestring-tests: FAIL
Test suite logged to:
/tmp/cirrus-ci-build/./dist-newstyle/build/x86_64-openbsd/ghc-9.8.3/bytestring-0.13.0.0/t/bytestring-tests/test/bytestring-0.13.0.0-bytestring-tests.log
0 of 1 test suites (0 of 1 test cases) passed.
Error: [Cabal-7125]
Tests failed for test:bytestring-tests from bytestring-0.13.0.0.

...but we pass --test-show-details=direct so cabal should print the testsuite output to the job log. The fact that nothing is visible at all suggests that a broken test executable is being produced...

Bodigrim · 2025-08-24T16:50:32Z

@clyring I'm pretty sure OpenBSD failure is completely unrelated to this PR, it currently fails across all my projects which have a OpenBSD job. (I suspect the root partition is deliberately very small on those runners and we need a hack similar to haskellari/splitmix@1a7118d, but let's leave it for another day)

vdukhovni · 2025-08-25T01:06:23Z

Sorry for leaving this hanging so long, @vdukhovni! I think that the last time I was working on this I was trying to get the null-encoding-correction work between calls to cstringUtf8 for the same literal to be shared, at an acceptably minimal cost. But that is not a requirement for moving forward!

The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits".

We want native Haskell implementations anyway due to -fpure-haskell and the JS backend.

Well, the delay gave me an opportunity to tackle it afresh and come up with a cleaner, more performant design. The core idea is to observe that a 0xC0 byte is necessarily the first byte of a denormalised encoding of some 6-bit code point, NUL or not. So instead of fixing up just 0xC0 0x80, we can fix up any 0xC0 nn by taking the last 6 bits of nn. This makes it possible to use memchr() to find runs of bytes that don't require special treatment, it is also useful that the input is NUL-terminated, so even a final 0xC0 doesn't require any special logic, it is fine to read the terminal NUL byte (use $n+1$ bytes of an $n$-byte + NUL-terminator input).

clyring

I'll have to think more about the awkward middle ground list of 'things modUtf8Lit attempts to translate' chosen by the current version of this patch.

Data/ByteString/Builder/Internal.hs

Data/ByteString/Builder/Prim.hs

bench/BenchAll.hs

clyring · 2025-08-26T04:15:39Z

Data/ByteString/Builder/Internal.hs

+        if | ch /= 0xC0 -> do
+               poke op ch
+               let !cnt = ipe `minusPtr` ip'
+               !runend <- S.memchr ip' 0xC0 (fromIntegral cnt)


Suggested change

!runend <- S.memchr ip' 0xC0 (fromIntegral cnt)

!runend <- S.memchr ip' 0xC0 (fromIntegral @Int @CSize cnt)

Sadly CSize is not in scope in this module, and I don't think making it available is worthwhile. I added the explicit @Int, perhaps that's "progress".

vdukhovni · 2025-08-31T03:05:09Z

I've fixed the issues with older GHC compatibility, CI passes. Perhaps close to done now...

vdukhovni · 2025-09-09T14:44:45Z

@clyring @hsyl20 @Bodigrim I believe this is done. If there's anything else outstanding, please let me know.

hsyl20

LGTM

Bodigrim · 2025-10-02T21:51:49Z

(I checked manually that this branch passes new tests from #714 if rebased)

If there are no more comments / suggestions by October 12, I'll merge it as is.

clyring

The ASCII side of things is perfect.

On the UTF-8 side of things I am not 100% sold on the specification yet:

The current version of this patch changes the observable behavior of cstringUtf8 on (admittedly very sketchy) inputs with 0xC0 not followed by 0x80.
- This isn't necessarily a major problem: I'd be surprised if there is any actual breakage, and wouldn't be surprised if literally nobody directly uses the current cstringUtf8. But it might make the backporting and deprecation story a bit messier.
- How much of a performance penalty is there for matching the old behavior exactly? And if we don't care to match that old behavior exactly, would always ignoring the next input byte after 0xC0 be a further performance improvement?
The demand that modUtf8LitCopy be given a null-terminated buffer (for safety if 0xC0 is the last byte) and its length not including that null terminator makes for a very weird interface!

tests/builder/Data/ByteString/Builder/Prim/Tests.hs

clyring · 2025-10-13T03:08:54Z

tests/builder/Data/ByteString/Builder/Prim/Tests.hs

+testCStringUtf8 :: Int -> TestTree
+testCStringUtf8 sz = testProperty "cstringUtf8" $
+    BE.toLazyByteStringWith (BE.untrimmedStrategy sz sz) L.empty
+      (BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0"#) ==


Suggested change

(BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0"#) ==

(BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0\x80"#) ==

I do think it is best to test that the code does not blow up with input that unexpectedly ends with just \xC0 before the raw (implicit) NUL terminator. So this is more of a robustness test, that then also encodes the implemented handling, than a promise to users that this is how that's handled.

clyring · 2025-10-13T03:11:21Z

Data/ByteString/Builder/Internal.hs


  , byteStringCopy
+  , asciiLiteralCopy
+  , modUtf8LitCopy


I agree with the earlier modUtf8LitCopy -> modUtf8LiteralCopy naming suggestion.

Copy chunks of the input to the output buffer with, up to the shorter of the available buffer space and the "null-free" portion of the remaining string. Actually "null-free" here means not containing any denormalised two-byte encodings starting with 0xC0 (so possibly also other ASCII bytes if the UTF-8 encoding is oddball). This substantially improves performance, with just one "15%" increase that looks like a spurious measurement error (perhaps code layout difference artefact). UTF-8 String (12B): OK 16.7 ns ± 1.3 ns, 60% less than baseline UTF-8 String (64B, one null): OK 22.6 ns ± 1.3 ns, 87% less than baseline UTF-8 String (64B, one null, no shared work): OK 30.1 ns ± 2.6 ns, 83% less than baseline UTF-8 String (64B, half nulls): OK 92.6 ns ± 5.3 ns, 49% less than baseline UTF-8 String (64B, all nulls): OK 76.3 ns ± 4.5 ns, 57% less than baseline UTF-8 String (64B, all nulls, no shared work): OK 82.3 ns ± 5.6 ns, 54% less than baseline ASCII String (12B): OK 6.50 ns ± 326 ps, 76% less than baseline ASCII String (64B): OK 8.03 ns ± 334 ps, 94% less than baseline AsciiLit: OK 8.02 ns ± 648 ps, 94% less than baseline Utf8Lit: OK 21.8 ns ± 1.3 ns, 88% less than baseline strLit: OK 8.90 ns ± 788 ps, 94% less than baseline stringUtf8: OK 22.4 ns ± 1.3 ns, 87% less than baseline strLitInline: OK 8.26 ns ± 676 ps, 94% less than baseline utf8LitInline: OK 23.2 ns ± 1.3 ns, 87% less than baseline foldMap byteStringInsert (10000): OK 46.0 μs ± 4.0 μs, 15% less than baseline --> lazyByteStringHex (10000): OK --> 4.74 μs ± 337 ns, 15% more than baseline foldMap integerDec (small) (10000): OK 205 μs ± 12 μs, 9% less than baseline char8 (10000): OK 2.58 μs ± 234 ns, 30% less than baseline foldMap (left-assoc) (10000): OK 73.2 μs ± 2.9 μs, 54% less than baseline foldMap (right-assoc) (10000): OK 43.0 μs ± 4.2 μs, 65% less than baseline foldMap [manually fused, left-assoc] (10000): OK 81.4 μs ± 5.3 μs, 48% less than baseline foldMap [manually fused, right-assoc] (10000): OK 47.3 μs ± 785 ns, 61% less than baseline

vdukhovni · 2025-10-13T12:27:41Z

The ASCII side of things is perfect.

Thanks.

On the UTF-8 side of things I am not 100% sold on the specification yet:

The current version of this patch changes the observable behavior of cstringUtf8 on (admittedly very sketchy) inputs with 0xC0 not followed by 0x80.

Such overlong encodings, of which 0xC080 is but one example, though not not canonical and strictly valid UTF8, are nevertheless unambiguous. So if 0xC0xx has a meaning, it would be the bottom six bits of xx, provided the top two bits are 10.

What is perhaps a bit more bold is that I ignore the top two bits, because the input is an Addr# compile-time representation of a literal string, where the only overlong encoding is that of NUL.

My take is that after 0xC0 we can either support only 0x80 and otherwise throw an error (which I am reluctant to do in this code), or just take the most reasonable interpretation of the input.

The original code would have retained any 0xC0xx (other than 0xC080) unmodified. I think that's worse.

This isn't necessarily a major problem: I'd be surprised if there is any actual breakage, and wouldn't be surprised if literally nobody directly uses the current cstringUtf8. But it might make the backporting and deprecation story a bit messier.

Only invalid UTF8 would result in possibly surprising behaviour, but not more surprising than the current code. This code implements the stringUtf8 builder when the input is a literal. The comments say:

Note that 'stringUtf8' performs no codepoint validation and consequently may
emit invalid UTF-8 if asked (e.g. single surrogates).

How much of a performance penalty is there for matching the old behavior exactly?

It would definitely be noticeably costlier for inputs with many (overlong encoded) NULs. And what would you then do with other overlong inputs?

And if we don't care to match that old behavior exactly, would always ignoring the next input byte after 0xC0 be a further performance improvement?

We can't "ignore" the next byte by producing no output, that's the byte that's encoding the NUL as 0x80. I think you're suggesting always emitting '0x00even of that byte is not0x80`. I think that's more surprising that supporting other overlong forms, by emitting the bottom six bits.

The demand that modUtf8LitCopy be given a null-terminated buffer (for safety if 0xC0 is the last byte) and its length not including that null terminator makes for a very weird interface!

The requirement for a final (unencoded) NUL is a result of the input being just a raw Addr# pointer, with no length information. The need to then encode non-final NULs is then a consequence rather than a cause of the final NUL. It is IMHO simplest to treat a final 0xC000 the same as 0xC08000, that both ends the input and produces a \0 in the output.

The main idea is to no throw errors, this code will produce valid output for valid "modified UTF8", and something closely related to the input otherwise, never throwing any errors (just as before).

Review feedback

vdukhovni · 2025-10-29T15:09:56Z

Is there something further I need to do here.

vdukhovni force-pushed the chunky-cstring-builder branch 2 times, most recently from 96880aa to 266d6da Compare January 13, 2023 04:30

clyring reviewed Jan 14, 2023

View reviewed changes

Data/ByteString/Internal.hs Outdated Show resolved Hide resolved

Data/ByteString/Builder/Prim.hs Outdated Show resolved Hide resolved

Data/ByteString/Builder/Prim.hs Outdated Show resolved Hide resolved

vdukhovni force-pushed the chunky-cstring-builder branch 3 times, most recently from 9086b60 to e6cc4a2 Compare January 14, 2023 10:42

clyring reviewed Jan 14, 2023

View reviewed changes

Data/ByteString/Builder/Prim.hs Outdated Show resolved Hide resolved

Bodigrim reviewed Jan 14, 2023

View reviewed changes

Data/ByteString/Builder/Internal.hs Outdated Show resolved Hide resolved

clyring mentioned this pull request Jan 15, 2023

0.12.0.0 release planning #573

Closed

vdukhovni commented Jan 15, 2023

View reviewed changes

clyring mentioned this pull request Jan 15, 2023

Test that our rewrite rules and list fusion actually work #574

Open

clyring added this to the 0.11.5.0 milestone Jan 19, 2023

clyring reviewed Feb 8, 2023

View reviewed changes

vdukhovni force-pushed the chunky-cstring-builder branch from 44fdcbc to 0645428 Compare February 9, 2023 04:45

Bodigrim reviewed Jun 12, 2023

View reviewed changes

clyring modified the milestones: 0.11.5.0, 0.12.1.0 Jul 6, 2023

vdukhovni force-pushed the chunky-cstring-builder branch from 0645428 to 01b5f36 Compare October 9, 2023 20:57

Bodigrim reviewed Oct 11, 2023

View reviewed changes

clyring added this to the 0.12.2.0 milestone Feb 15, 2024

Bodigrim approved these changes Feb 15, 2024

View reviewed changes

clyring mentioned this pull request Jun 5, 2024

Improve benchmarks for small Builders #680

Merged

clyring mentioned this pull request Jun 26, 2024

Fix several bugs around the 'byteString' family of Builders #671

Merged

Bodigrim removed this from the 0.12.2.0 milestone Oct 15, 2024

vdukhovni force-pushed the chunky-cstring-builder branch 2 times, most recently from 111456a to 0a7d5c8 Compare August 18, 2025 08:57

vdukhovni requested review from Bodigrim, clyring and hsyl20 August 21, 2025 03:05

vdukhovni mentioned this pull request Aug 24, 2025

Implemented TH splices for validated ByteString literals #712

Merged

clyring reviewed Aug 26, 2025

View reviewed changes

vdukhovni force-pushed the chunky-cstring-builder branch 2 times, most recently from 3d72c68 to b202ddb Compare August 31, 2025 02:58

vdukhovni requested a review from clyring September 9, 2025 14:39

hsyl20 approved these changes Sep 12, 2025

View reviewed changes

clyring reviewed Oct 13, 2025

View reviewed changes

fixup! Avoid per-byte loop in cstring{,Utf8} builders

833ed24

Review feedback

vdukhovni force-pushed the chunky-cstring-builder branch from b202ddb to 833ed24 Compare October 13, 2025 12:42

	{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites
	{-# NOINLINE string8 #-} -- allow literal cstring rewrites

	modifiedUtf8NUL = Ptr "\xc0\x80"#
	modUtf8NUL = Ptr "\xc0\x80"#

	!runend <- S.memchr ip' 0xC0 (fromIntegral cnt)
	!runend <- S.memchr ip' 0xC0 (fromIntegral @Int @CSize cnt)

	(BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0"#) ==
	(BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0\x80"#) ==

Avoid per-byte loop in cstring{,Utf8} builders #569

Are you sure you want to change the base?

Avoid per-byte loop in cstring{,Utf8} builders #569

Uh oh!

Conversation

vdukhovni commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vdukhovni commented Jan 13, 2023

Uh oh!

clyring commented Jan 13, 2023

Uh oh!

clyring left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chessai commented Jan 15, 2023

Uh oh!

vdukhovni commented Jan 15, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdukhovni commented Jan 23, 2023

Uh oh!

clyring left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bodigrim commented Feb 8, 2023

Uh oh!

vdukhovni commented Feb 9, 2023

Uh oh!

Bodigrim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clyring commented Sep 27, 2023

Uh oh!

vdukhovni commented Sep 27, 2023

Uh oh!

vdukhovni commented Oct 11, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdukhovni commented Jan 21, 2024

Uh oh!

Bodigrim commented Oct 15, 2024

Uh oh!

vdukhovni commented Aug 19, 2025

Uh oh!

clyring commented Aug 24, 2025

Uh oh!

Bodigrim commented Aug 24, 2025

Uh oh!

vdukhovni commented Aug 25, 2025

Uh oh!

clyring left a comment

vdukhovni commented Jan 13, 2023 •

edited

Loading