Use direct byte[] utf-8 conversions #136053

Tim-Brooks · 2025-10-06T19:09:51Z

Currently Elasticsearch is using StandardCharsets#decode and encode
methods when working with optimized text. These variants are not as
performant as the direct implementations in String when working with
byte[]. If we are going to one-shot convert without validation then the
String variants should be preferred.

Currently Elasticsearch is using StandardCharsets#decode and encode methods when working with optimized text. These variants are not as performant as the direct implementations in String when working with byte[]. If we are going to one-shot convert without validation then the String variants should be preferred.

elasticsearchmachine · 2025-10-06T19:10:16Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

Tim-Brooks · 2025-10-06T19:15:13Z

I noticed this a few months ago when I was benchmarking something and it seemed to be we should switch variants. I can't think of a scenario where we would prefer the byte buffer encoder variant for String -> bytes. I could see the scenario where we would prefer the byte buffer variant for byte -> String if we wanted to catch UTF-8 errors. However, the default malformed input variant uses replace which is the same as the String ctor variant.

    public final CharBuffer decode(ByteBuffer bb) {
        try {
            return ThreadLocalCoders.decoderFor(this)
                .onMalformedInput(CodingErrorAction.REPLACE)
                .onUnmappableCharacter(CodingErrorAction.REPLACE)
                .decode(bb);
        } catch (CharacterCodingException x) {
            throw new Error(x);         // Can't happen
        }
    }

I would note that this is probably different behavior than Jackson would give prior to this change. Are we okay with silently replacing invalid UTF-8 input?

@jordan-powers - added the code
@felixbarny - recently worked with the code
@martijnvg - reviewed a change a did a while back related to optimized input

Tim-Brooks · 2025-10-06T19:15:47Z

UTF8StringBytesBenchmark.getBytesByteBufferEncoder           uuid  avgt    3    36.610 ±   2.235  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder          short  avgt    3    36.502 ±   5.117  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder           long  avgt    3   122.581 ±  10.805  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder       nonAscii  avgt    3   244.617 ± 101.226  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder       veryLong  avgt    3  1149.046 ±  82.400  ns/op

UTF8StringBytesBenchmark.getBytesJDK                         uuid  avgt    3     3.772 ±   1.984  ns/op
UTF8StringBytesBenchmark.getBytesJDK                        short  avgt    3     3.723 ±   1.799  ns/op
UTF8StringBytesBenchmark.getBytesJDK                         long  avgt    3     6.737 ±   3.997  ns/op
UTF8StringBytesBenchmark.getBytesJDK                     nonAscii  avgt    3   134.981 ±  22.179  ns/op
UTF8StringBytesBenchmark.getBytesJDK                     veryLong  avgt    3    31.704 ±   0.360  ns/op

UTF8StringBytesBenchmark.getBytesUnicodeUtils                uuid  avgt    3    29.224 ±  23.463  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils               short  avgt    3    29.634 ±  23.561  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils                long  avgt    3    44.193 ±   8.370  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils            nonAscii  avgt    3   196.680 ±   9.583  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils            veryLong  avgt    3   417.743 ±  14.938  ns/op

UTF8StringBytesBenchmark.getStringByteBufferDecoder          uuid  avgt    3    20.832 ±   0.228  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder         short  avgt    3    20.872 ±   0.444  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder          long  avgt    3    27.744 ±   0.130  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder      nonAscii  avgt    3   174.942 ±  36.993  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder      veryLong  avgt    3   126.116 ±  13.467  ns/op

UTF8StringBytesBenchmark.getStringJDK                        uuid  avgt    3     6.353 ±   0.564  ns/op
UTF8StringBytesBenchmark.getStringJDK                       short  avgt    3     6.497 ±   2.621  ns/op
UTF8StringBytesBenchmark.getStringJDK                        long  avgt    3    10.237 ±   0.547  ns/op
UTF8StringBytesBenchmark.getStringJDK                    nonAscii  avgt    3   161.460 ±  13.172  ns/op
UTF8StringBytesBenchmark.getStringJDK                    veryLong  avgt    3    34.948 ±   2.518  ns/op

…ray_methods

Tim-Brooks requested review from martijnvg, felixbarny and jordan-powers October 6, 2025 19:09

Tim-Brooks requested a review from a team as a code owner October 6, 2025 19:09

Tim-Brooks added >non-issue :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v9.3.0 labels Oct 6, 2025

elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Oct 6, 2025

Merge remote-tracking branch 'origin/main' into use_string_to_byte_ar…

9bfd281

…ray_methods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use direct byte[] utf-8 conversions #136053

Use direct byte[] utf-8 conversions #136053

Tim-Brooks commented Oct 6, 2025

Uh oh!

elasticsearchmachine commented Oct 6, 2025

Uh oh!

Tim-Brooks commented Oct 6, 2025 •

edited

Loading

Uh oh!

Tim-Brooks commented Oct 6, 2025

Uh oh!

Uh oh!

Use direct byte[] utf-8 conversions #136053

Are you sure you want to change the base?

Use direct byte[] utf-8 conversions #136053

Conversation

Tim-Brooks commented Oct 6, 2025

Uh oh!

elasticsearchmachine commented Oct 6, 2025

Uh oh!

Tim-Brooks commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tim-Brooks commented Oct 6, 2025

Uh oh!

Uh oh!

Tim-Brooks commented Oct 6, 2025 •

edited

Loading