-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Speed up loading keyword fields with index sorts #132950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ce26e27 to
b2ae4bf
Compare
b2ae4bf to
503b524
Compare
3081cfa to
8639922
Compare
165df8b to
e24a8e0
Compare
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this looks great - LGTM.
| try (var reader = DirectoryReader.open(iw)) { | ||
| int gaugeIndex = numDocs; | ||
| for (var leaf : reader.leaves()) { | ||
| var timestampDV = getBulkNumericDocValues(leaf.reader(), timestampField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we talked about we need to adjust these test to also use sorted(set) doc values.
| readFields(in, state.fieldInfos); | ||
|
|
||
| final var indexSort = state.segmentInfo.getIndexSort(); | ||
| if (indexSort != null && indexSort.getSort().length > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 - good idea to extract this information from segment info, no need to check for index setting.
| if (lookaheadBlockIndex + 1 != blockIndex) { | ||
| lookaheadData.seek(indexReader.get(blockIndex)); | ||
| } | ||
| if (maxOrd >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is only invoked from getSorted(...) and in that case maxOrd is always >= 0.
I think we should keep this branch, otherwise we can't reuse lookAheadValueAt(...) for numeric doc value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replace this with a switch statement:
switch (maxOrdAsInt) {
case -1:
decoder.decode(lookaheadData, lookaheadBlock);
break;
default:
decoder.decodeOrdinals(lookaheadData, lookaheadBlock, bitsPerOrd);
}
I think this is a little bit more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried but Intellij suggested this 91d32d1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 - I suspect that is also easier to be optimized at runtime by jvm.
| * to load the requested values, for example due to unsupported underlying data. | ||
| * This allows callers to optimistically try optimized loading strategies first, and fall back if necessary. | ||
| */ | ||
| interface OptionalColumnAtATimeReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 This is better than previous abstraction (BulkNumericDocValues).
| } | ||
|
|
||
| @Override | ||
| public BytesRefBlock constantBytes(BytesRef value, int count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also be used in SingletonOrdinalsBuilder#tryBuildConstantBlock(...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. Unfortunately, we don't have DelegatingBlockLoaderFactory here. I will move this to BlockFactory in a follow-up.
| final class ES819TSDBDocValuesProducer extends DocValuesProducer { | ||
| final IntObjectHashMap<NumericEntry> numerics; | ||
| private int primarySortFieldNumber = -1; | ||
| private boolean primarySortFieldReversed = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this field can be converted to a variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I removed a0ce374
|
|
||
| @Override | ||
| long lookAheadValueAt(int targetDoc) throws IOException { | ||
| return 0L; // Only one ordinal! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
easy one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test found it :)
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Hi @dnhatn, I've created a changelog YAML for you. |
* upstream/main: (32 commits) Speed up loading keyword fields with index sorts (elastic#132950) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964 Simplify EsqlSession (elastic#132848) Implement WriteLoadConstraintDecider#canAllocate (elastic#132041) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965 Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941) Breakdown undesired allocations by shard routing role (elastic#132235) Implement v_magnitude function (elastic#132765) Introduce execution location marker for better handling of remote/local compatibility (elastic#132205) Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957 Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946) Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956 Fix failing UT by adding a required capability (elastic#132947) Precompute the BitsetCacheKey hashCode (elastic#132875) Adding simulate ingest effective mapping (elastic#132833) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948 Rename skipping logic to remove hard link to skip_unavailable (elastic#132861) Store ignored source in unique stored fields per entry (elastic#132142) Add random tests with match_only_text multi-field (elastic#132380) ...
Reading keyword fields that are the primary sort in the index can be sped up by skip reading, as identical values are stored together. In this case, we can use values from the doc_values skipper instead of loading values from doc_values. However, the doc_values skipper is not enabled yet. Here, we use two buffers when reading ordinals: one for the beginning of the block and one for the end. If both return the same value, we can skip the middle. There is a follow-up step where we fill the values in the middle until we reach the last value. This optimization should speed up time-series queries.
…-stats * upstream/main: (36 commits) Fix reproducability of builds against Java EA versions (elastic#132847) Speed up loading keyword fields with index sorts (elastic#132950) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964 Simplify EsqlSession (elastic#132848) Implement WriteLoadConstraintDecider#canAllocate (elastic#132041) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965 Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941) Breakdown undesired allocations by shard routing role (elastic#132235) Implement v_magnitude function (elastic#132765) Introduce execution location marker for better handling of remote/local compatibility (elastic#132205) Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957 Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946) Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956 Fix failing UT by adding a required capability (elastic#132947) Precompute the BitsetCacheKey hashCode (elastic#132875) Adding simulate ingest effective mapping (elastic#132833) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948 Rename skipping logic to remove hard link to skip_unavailable (elastic#132861) Store ignored source in unique stored fields per entry (elastic#132142) ...
Reading keyword fields that are the primary sort in the index can be sped up by skip reading, as identical values are stored together. In this case, we can use values from the doc_values skipper instead of loading values from doc_values. However, the doc_values skipper is not enabled yet. Here, we use two buffers when reading ordinals: one for the beginning of the block and one for the end. If both return the same value, we can skip the middle. There is a follow-up step where we fill the values in the middle until we reach the last value. This optimization should speed up time-series queries.