Skip to content

Conversation

Tim-Brooks
Copy link
Contributor

Currently Elasticsearch is using StandardCharsets#decode and encode
methods when working with optimized text. These variants are not as
performant as the direct implementations in String when working with
byte[]. If we are going to one-shot convert without validation then the
String variants should be preferred.

Currently Elasticsearch is using StandardCharsets#decode and encode
methods when working with optimized text. These variants are not as
performant as the direct implementations in String when working with
byte[]. If we are going to one-shot convert without validation then the
String variants should be preferred.
@Tim-Brooks Tim-Brooks requested a review from a team as a code owner October 6, 2025 19:09
@Tim-Brooks Tim-Brooks added >non-issue :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v9.3.0 labels Oct 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Oct 6, 2025
@Tim-Brooks
Copy link
Contributor Author

Tim-Brooks commented Oct 6, 2025

I noticed this a few months ago when I was benchmarking something and it seemed to be we should switch variants. I can't think of a scenario where we would prefer the byte buffer encoder variant for String -> bytes. I could see the scenario where we would prefer the byte buffer variant for byte -> String if we wanted to catch UTF-8 errors. However, the default malformed input variant uses replace which is the same as the String ctor variant.

    public final CharBuffer decode(ByteBuffer bb) {
        try {
            return ThreadLocalCoders.decoderFor(this)
                .onMalformedInput(CodingErrorAction.REPLACE)
                .onUnmappableCharacter(CodingErrorAction.REPLACE)
                .decode(bb);
        } catch (CharacterCodingException x) {
            throw new Error(x);         // Can't happen
        }
    }

I would note that this is probably different behavior than Jackson would give prior to this change. Are we okay with silently replacing invalid UTF-8 input?

@jordan-powers - added the code
@felixbarny - recently worked with the code
@martijnvg - reviewed a change a did a while back related to optimized input

@Tim-Brooks
Copy link
Contributor Author

UTF8StringBytesBenchmark.getBytesByteBufferEncoder           uuid  avgt    3    36.610 ±   2.235  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder          short  avgt    3    36.502 ±   5.117  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder           long  avgt    3   122.581 ±  10.805  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder       nonAscii  avgt    3   244.617 ± 101.226  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder       veryLong  avgt    3  1149.046 ±  82.400  ns/op

UTF8StringBytesBenchmark.getBytesJDK                         uuid  avgt    3     3.772 ±   1.984  ns/op
UTF8StringBytesBenchmark.getBytesJDK                        short  avgt    3     3.723 ±   1.799  ns/op
UTF8StringBytesBenchmark.getBytesJDK                         long  avgt    3     6.737 ±   3.997  ns/op
UTF8StringBytesBenchmark.getBytesJDK                     nonAscii  avgt    3   134.981 ±  22.179  ns/op
UTF8StringBytesBenchmark.getBytesJDK                     veryLong  avgt    3    31.704 ±   0.360  ns/op

UTF8StringBytesBenchmark.getBytesUnicodeUtils                uuid  avgt    3    29.224 ±  23.463  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils               short  avgt    3    29.634 ±  23.561  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils                long  avgt    3    44.193 ±   8.370  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils            nonAscii  avgt    3   196.680 ±   9.583  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils            veryLong  avgt    3   417.743 ±  14.938  ns/op

UTF8StringBytesBenchmark.getStringByteBufferDecoder          uuid  avgt    3    20.832 ±   0.228  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder         short  avgt    3    20.872 ±   0.444  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder          long  avgt    3    27.744 ±   0.130  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder      nonAscii  avgt    3   174.942 ±  36.993  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder      veryLong  avgt    3   126.116 ±  13.467  ns/op

UTF8StringBytesBenchmark.getStringJDK                        uuid  avgt    3     6.353 ±   0.564  ns/op
UTF8StringBytesBenchmark.getStringJDK                       short  avgt    3     6.497 ±   2.621  ns/op
UTF8StringBytesBenchmark.getStringJDK                        long  avgt    3    10.237 ±   0.547  ns/op
UTF8StringBytesBenchmark.getStringJDK                    nonAscii  avgt    3   161.460 ±  13.172  ns/op
UTF8StringBytesBenchmark.getStringJDK                    veryLong  avgt    3    34.948 ±   2.518  ns/op

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >non-issue Team:Distributed Indexing Meta label for Distributed Indexing team v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants