Skip to content

[Performance] Optimize TermsQueryBuilder serialization for large homogeneous term listsΒ #20445

@kanatti

Description

@kanatti

Is your feature request related to a problem? Please describe

We observed high latencies on some of our search requests that use terms queries with large term lists (thousands of terms) when queries hit many shards. Hot threads showed significant CPU time spent in the transport serialization path:

  Node: data-eu-south-2b-1-19 (HsYQ-G0FRmam5NrgnKQoqw)
    Thread: opensearch[data-eu-south-2b-1-19][transport_worker][T#15]
    CPU: 100.3% (1s out of 1s)
      Stack trace:
        java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3383)
        java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3483)
        org.opensearch.core.common.io.stream.Writeable$WriteableRegistry.getCustomClassFromInstance(Writeable.java:110)
        org.opensearch.core.common.io.stream.StreamOutput.getGenericType(StreamOutput.java:791)
        org.opensearch.core.common.io.stream.StreamOutput.writeGenericValue(StreamOutput.java:837)
        org.opensearch.core.common.io.stream.StreamOutput.lambda$static$10(StreamOutput.java:703)
        org.opensearch.core.common.io.stream.StreamOutput.writeGenericValue(StreamOutput.java:839)
        org.opensearch.index.query.TermsQueryBuilder.doWriteTo(TermsQueryBuilder.java:215)
        ...

The cost multiplies with the number of shards since the query is serialized once per shard.

Describe the solution you'd like

TermsQueryBuilder already has an optimization inconvert() that compacts homogeneous term lists into efficient representations:

  • All numbers β†’ backed by long[]
  • All strings β†’ backed by single BytesReference + int[] offsets

TermsQueryBuilder.java#L332-L387

However, this optimization isn't utilized during serialization. The doWriteTo() method calls out.writeGenericValue(values) which triggers the generic List writer. For each element, writeGenericValue calls getGenericType() β†’ WriteableRegistry.getCustomClassFromInstance() which iterates over all registered custom classes in a ConcurrentHashMap.

We can optimize this by adding specialized serialization in TermsQueryBuilder that detects the compact list representations and uses bulk serialization:

  FORMAT_LONG (1):    marker byte + long[] array (bulk write)
  FORMAT_STRING (2):  marker byte + BytesReference + int[] offsets (bulk write)
  FORMAT_GENERIC (0): marker byte + writeGenericValue (existing fallback)

This requires detecting the compact AbstractList implementations created by convert() and serializing their backing data directly rather than element-by-element.

We tried this internally on our clusters (for some very specific call patterns with large terms list and many shards in scope):

Metric Before After Improvement
P50 3378ms 2124ms 37%
P90 5486ms 3112ms 43%
P99 6594ms 3473ms 47%

Looking for inputs on this approach. I can make pull request if it sounds good.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

πŸ†• New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions