Skip to content

SliceUtf8 performance optimizations#190

Closed
dain wants to merge 15 commits intoairlift:masterfrom
dain:user/dain/sliceutf8-perf
Closed

SliceUtf8 performance optimizations#190
dain wants to merge 15 commits intoairlift:masterfrom
dain:user/dain/sliceutf8-perf

Conversation

@dain
Copy link
Member

@dain dain commented Mar 7, 2026

Summary

  • Refactored core implementations to operate on byte[] + offset + length, with Slice overloads delegating.
  • Added/expanded ASCII fast paths across key algorithms.
  • Reduced repeated decode work in loop-heavy code paths.
  • Added new UTF-8/code-point conversion helpers for Trino-style usage.
  • Expanded JMH coverage for existing methods and Trino-representative loops.

High-level optimization approaches

  • byte[] first internals: better JVM bounds-check hoisting and easier raw-array integration.
  • ASCII specialization: skip full decode work when all bytes are ASCII.
  • SWAR/chunked scanning where applicable (long/int lanes via var handles) to skip equal ASCII regions quickly.
  • Fewer passes over data: APIs/helpers that decode once and reuse derived results.
  • Explicit API boundary validation with inner loops kept lean.

New APIs

  • toCodePoints(byte[] utf8, int offset, int length)
  • fromCodePoints(int[] codePoints, int offset, int length)
  • codePointByteLengths(byte[] utf8, int offset, int length)

Benchmark highlights (JMH)

Most results below are for length=1000 code points unless noted.

  • benchmarkCompareUtf16BE

    • ASCII: 3.483 -> 0.102 ns/codepoint (~34x)
    • non-ASCII: 8.214 -> 6.395 ns/codepoint (~1.28x)
  • benchmarkToLowerCase

    • ASCII: 3.029 -> 0.501 ns/codepoint (~6.0x)
    • non-ASCII: 7.145 -> 4.183 ns/codepoint (~1.71x)
  • benchmarkToUpperCase

    • ASCII: 3.053 -> 0.601 ns/codepoint (~5.1x)
    • non-ASCII: 7.254 -> 5.019 ns/codepoint (~1.45x)
  • benchmarkTrimCustom

    • ASCII: 2.702 -> 0.474 ns/codepoint (~5.7x)
    • non-ASCII: 5.224 -> 4.329 ns/codepoint (~1.21x)
  • benchmarkLeftTrim

    • ASCII: 1.919 -> 0.344 ns/codepoint (~5.6x)
    • non-ASCII: 3.137 -> 2.201 ns/codepoint (~1.42x)
  • benchmarkRightTrim

    • ASCII: 0.551 -> 0.359 ns/codepoint (~1.53x)
    • non-ASCII: 2.939 -> 2.534 ns/codepoint (~1.16x)
  • benchmarkToCodePointsApi (ns/byte)

    • ASCII: 2.4902 -> 0.2319 (~10.7x vs two-pass baseline)
    • non-ASCII: 1.6643 -> 1.0820 (~1.54x vs two-pass baseline)
  • benchmarkFromCodePointsApi

    • ASCII: 0.500 -> 0.326 ns/codepoint (~1.53x)
    • non-ASCII: 3.230 -> 2.062 ns/codepoint (~1.57x)
  • benchmarkFixInvalidUtf8WithoutReplacement (inputLength=1024, ns/byte)

    • valid non-ASCII: 6.341 -> 3.978 (~1.59x)
    • invalid non-ASCII: 6.242 -> 4.549 (~1.37x)
  • benchmarkReverse

    • ASCII: 0.318 -> 0.067 ns/codepoint (~4.7x)
    • non-ASCII: 3.397 -> 3.406 ns/codepoint (flat/noise)
  • codePointByteLengths helper benchmark (length=128)

    • ASCII: 1.020 -> 0.696 ns/codepoint (~1.47x)
    • non-ASCII: 3.596 -> 2.129 ns/codepoint (~1.69x)

Small-string sanity (tail paths)

Ran a dedicated JMH sanity pass at non-8-multiple lengths 7 and 31 (with ascii=true,false) for:
compareUtf16BE, toLowerCase, toUpperCase, trimCustom, toCodePointsApi, and fromCodePointsApi.

  • compareUtf16BE:
    • ASCII 7.332 / 10.781 ns/op (len=7 / 31)
    • non-ASCII 41.308 / 193.126 ns/op
  • fromCodePointsApi:
    • ASCII 7.702 / 14.937 ns/op
    • non-ASCII 25.337 / 64.678 ns/op
  • toCodePointsApi:
    • ASCII 5.843 / 12.206 ns/op
    • non-ASCII 29.565 / 123.929 ns/op
  • toLowerCase:
    • ASCII 12.790 / 29.095 ns/op
    • non-ASCII 27.257 / 124.587 ns/op
  • toUpperCase:
    • ASCII 7.894 / 23.449 ns/op
    • non-ASCII 36.579 / 124.978 ns/op
  • trimCustom:
    • ASCII 18.294 / 29.761 ns/op
    • non-ASCII 54.013 / 170.847 ns/op

Conclusion: no obvious small-string regressions; short-input behavior is consistent with expected fixed-overhead effects.

@dain dain requested review from electrum and wendigo March 7, 2026 17:23
@dain dain force-pushed the user/dain/sliceutf8-perf branch from 4e2d22a to 3c7465e Compare March 7, 2026 22:09
@wendigo
Copy link
Contributor

wendigo commented Mar 8, 2026

@dain do you think we can add a snapshot release workflow, release it and benchmark it in Trino?

return isAsciiRaw(utf8, offset, length);
}

private static boolean isAsciiRaw(byte[] utf8, int utf8Offset, int utf8Length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why operate on a byte[] rather than Slice? This duplicates the LONG_HANDLEs that are already defined in Slice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons:

  1. It makes it much easier to see where the bounds checks are.
  2. It is easier for the JVM to optimize when it's working with raw byte arrays.
    That said, the JVM is also very good at unwinding stack. As another side reason, there are cases where you're sometimes operating on a raw byte array and having a wrapper and a slice is kind of annoying just to do some basic string manipulation.

@dain
Copy link
Member Author

dain commented Mar 8, 2026

@wendigo if you put up a PR for the workflow you're asking about, I'll check it in.

@wendigo
Copy link
Contributor

wendigo commented Mar 9, 2026

@dain #191

dain added 15 commits March 9, 2026 09:53
unwrapping slice makes it easier to see what is happening in these
algorithms and makes it easier to optimize. Additionally this makes
these functions usable without having to wrap them into a slice.
Benchmark (benchmarkCompareUtf16BE, length=1000):

- ascii=true: 3.483 -> 0.102 ns/codepoint

- ascii=false: 8.214 -> 6.395 ns/codepoint
Benchmark (benchmarkReverse, length=1000):

- ascii=true: 0.318 -> 0.067 ns/codepoint

- ascii=false: 3.397 -> 3.406 ns/codepoint (flat within noise)
Benchmark (benchmarkToUpperCase, length=1000):

- ascii=true: 3.053 -> 0.601 ns/codepoint

- ascii=false: 7.254 -> 5.019 ns/codepoint
Benchmark (benchmarkToLowerCase, length=1000):

- ascii=true: 3.029 -> 0.501 ns/codepoint

- ascii=false: 7.145 -> 4.183 ns/codepoint
Benchmark (benchmarkFixInvalidUtf8WithoutReplacement, inputLength=1024):

- valid_non_ascii: 6.341 -> 3.978 ns/byte

- invalid_non_ascii: 6.242 -> 4.549 ns/byte
Benchmark (benchmarkLeftTrim, length=1000):

- ascii=true: 1.919 -> 0.344 ns/codepoint

- ascii=false: 3.137 -> 2.201 ns/codepoint
Benchmark (benchmarkRightTrim, length=1000):

- ascii=true: 0.551 -> 0.359 ns/codepoint

- ascii=false: 2.939 -> 2.534 ns/codepoint
Benchmark (benchmarkTrimCustom, length=1000):

- ascii=true: 2.702 -> 0.474 ns/codepoint

- ascii=false: 5.224 -> 4.329 ns/codepoint
Benchmark (benchmarkSetCodePointAt, length=1000):

- ascii=true: 0.336 -> 0.332 ns/codepoint

- ascii=false: 2.259 -> 2.334 ns/codepoint

Related benchmark (benchmarkCodePointToUtf8, length=1000):

- ascii=false: 2.404 -> 2.154 ns/codepoint
Useful for Trino VARCHAR->code points casts and similar decode loops.

Benchmark (ns/byte, length=1000):

- toCodePointsApi ascii: 0.2319 (baseline two-pass: 2.4902)

- toCodePointsApi non-ascii: 1.0820 (baseline two-pass: 1.6643)
Adds fromCodePoints to encode code-point arrays directly into UTF-8
Slice output. This is useful for Trino-style loops that currently
pre-size and encode with repeated setCodePointAt calls.

Benchmark (SliceUtf8Benchmark, length=1000 code points):

- ascii=true: fromCodePointsApi 0.326 ns/codepoint vs Trino baseline 0.500 ns/codepoint

- ascii=false: fromCodePointsApi 2.062 ns/codepoint vs Trino baseline 3.230 ns/codepoint
Adds codePointByteLengths so callers can decode UTF-8 once and directly
materialize per-code-point byte widths (1..4) for padding/loop planning.

Benchmark (SliceUtf8Benchmark, length=128 code points):

- ascii=true: helper(byte[]) 0.696 ns/codepoint vs Trino byte[] baseline 1.020 ns/codepoint

- ascii=false: helper(byte[]) 2.129 ns/codepoint vs Trino byte[] baseline 3.596 ns/codepoint
@wendigo
Copy link
Contributor

wendigo commented Mar 9, 2026

@dain you'd need to push it as a branch to upstream in order to release a snapshot

@dain dain force-pushed the user/dain/sliceutf8-perf branch from 3c7465e to 605b373 Compare March 9, 2026 16:55
@dain dain closed this Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants