-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Description
The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated a merge sort is performed (in order to get the doc ids from different segments in order of index sorting).
There are several reasons why the doc value instance is iterated multiple times:
- To compute stats (num values, number of docs with value) required for writing values to disk.
- To write bitset that indicate which documents have a value. (indexed disi, jump table)
- To write the actual values to disk.
- To write the addresses to disk (in case docs have multiple values)
This applies for numeric doc values, but also for the ordinals of sorted (set) doc values.
The following changes should be made to address this performance issue:
- Change the tsdb doc values format to allows store
numDocsWithFieldas metadata and store jump table after the values (Prepare tsdb doc values format for merging optimizations. #125933). - Reuse statistics used during merging from the metadata instead of computing it on the fly by creating a merged
SortedNumericDocValues(First step optimizing tsdb doc values codec merging. #125403). - Keep track of documents with value while iterating over values and use that to write jump table later (Tsdb doc values inline building jump table #126499)
- Keep track of
docValueCountwhile iterating over values and write to later for the address offsets. (Coalesce getSortedNumeric calls for ES819 doc values merging #126732) - Optimize merging binary doc value. By accumulating offsets and disi, so that we iterate once. (Apply TSDB jump table and offset construction optimizations to binary doc values #127278)
felixbarny