-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Enable a sparse doc values index for @timestamp in time-series indices
#123191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable a sparse doc values index for @timestamp in time-series indices
#123191
Conversation
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
| public static final IndexVersion TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_011_0_00, Version.LUCENE_10_1_0); | ||
| public static final IndexVersion TIME_SERIES_ID_DOC_VALUES_SPARSE_INDEX = def(9_012_0_00, Version.LUCENE_10_1_0); | ||
| public static final IndexVersion SYNTHETIC_SOURCE_STORE_ARRAYS_NATIVELY_KEYWORD = def(9_013_0_00, Version.LUCENE_10_1_0); | ||
| public static final IndexVersion TSDB_TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_014_0_00, Version.LUCENE_10_1_0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually need an index version here, given that doc value skipper is only enabled in snapshot build. The same I think also applies to previous doc value skipper related changes. But let's keep it for now, given that we added also index version for previous doc value skipper related changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't certain we needed the new IndexVersion, but I figured better safe than sorry
| && indexSortConfig.hasSortOnField(fullFieldName) | ||
| && DataStreamTimestampFieldMapper.DEFAULT_PATH.equals(fullFieldName); | ||
| } else if (IndexMode.TIME_SERIES.equals(indexMode)) { | ||
| return indexCreatedVersion.onOrAfter(IndexVersions.TSDB_TIMESTAMP_DOC_VALUES_SPARSE_INDEX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here why we don't check for index sorting in case for time series index mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this maybe this can be simplified to this:
return indexCreatedVersion.onOrAfter(IndexVersions.TIMESTAMP_DOC_VALUES_SPARSE_INDEX)
&& useDocValuesSkipper
&& hasDocValues
&& (indexMode == IndexMode.LOGSDB || indexMode == IndexMode.TIME_SERIES)
&& indexSortConfig != null
&& indexSortConfig.hasSortOnField(fullFieldName)
&& DataStreamTimestampFieldMapper.DEFAULT_PATH.equals(fullFieldName);
We don't have to check index sorting for tsdb, but it should aways exist and timestamp should be part of it.
This makes this check easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then should I take out the IndexVersions.TSDB_TIMESTAMP_DOC_VALUES_SPARSE_INDEX and just use the IndexVersions.TIMESTAMP_DOC_VALUES_SPARSE_INDEX for both index modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is fine. Since this only applies to snapshot builds and bwc isn't an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the latest changes, I left two more comments.
| this.loadedTimestampRange = loadTimestampRange; | ||
| // Also check for the existence of the timestamp field, because sometimes a segment can only contain tombstone documents, | ||
| // which don't have any mapped fields (also not the timestamp field) and just some meta fields like _id, _seq_no etc. | ||
| if (loadTimestampRange && reader.getFieldInfos().fieldInfo(DataStream.TIMESTAMP_FIELD_NAME) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe leave this if statement in tact and add the following check below here:
if (IndexSettings.DOC_VALUES_SKIPPER.isEnabled()) {
DocValuesSkipper skipper = reader.getDocValuesSkipper(DataStream.TIMESTAMP_FIELD_NAME);
assert skipper != null : "no skipper for reader:" + reader + " and parent:" + reader.getContext().parent.reader();
minTimestamp = skipper.minValue();
maxTimestamp = skipper.maxValue();
} else {
PointValues tsPointValues = reader.getPointValues(DataStream.TIMESTAMP_FIELD_NAME);
assert tsPointValues != null
: "no timestamp field for reader:" + reader + " and parent:" + reader.getContext().parent.reader();
minTimestamp = LongPoint.decodeDimension(tsPointValues.getMinPackedValue(), 0);
maxTimestamp = LongPoint.decodeDimension(tsPointValues.getMaxPackedValue(), 0);
}
The reason here is that when loadTimestampRange is requested and timestamp field exists only two scenarios can be true. Either there is PointValues for timestamp field or if feature flag is enabled skipper should exist for timestamp field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, there isn't an easy way to get the current IndexSettings instance into this method. I could update the constructor signature to take some extra info (either the whole IndexSettings instance or just the boolean indicating if doc values skipper is enabled), but it seems easier to just check the FieldInfo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to get an IndexSettings instance, IndexSettings.DOC_VALUES_SKIPPER is a static field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you're saying I should check the feature flag, not the index setting. But what if the feature flag is enabled but the skipper is disabled by the index setting? Then we still want to check the point values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let's keep this as is then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually still had to update this check. I'm keeping the same logic, still checking the FieldInfo to see if the doc values skipper is enabled, but it seems retrieving the FieldInfo even if loadTimestampRange is false was causing the CI to fail (something to do with the translogInMemorySegmentCount that I don't fully understand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are cases where we fake the reader. I think when we read from translog, for realtime get. I suspect those IndexReaders sometimes don't work well with field infos.
| && indexSortConfig.hasSortOnField(fullFieldName) | ||
| && DataStreamTimestampFieldMapper.DEFAULT_PATH.equals(fullFieldName); | ||
| } else if (IndexMode.TIME_SERIES.equals(indexMode)) { | ||
| return indexCreatedVersion.onOrAfter(IndexVersions.TSDB_TIMESTAMP_DOC_VALUES_SPARSE_INDEX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this maybe this can be simplified to this:
return indexCreatedVersion.onOrAfter(IndexVersions.TIMESTAMP_DOC_VALUES_SPARSE_INDEX)
&& useDocValuesSkipper
&& hasDocValues
&& (indexMode == IndexMode.LOGSDB || indexMode == IndexMode.TIME_SERIES)
&& indexSortConfig != null
&& indexSortConfig.hasSortOnField(fullFieldName)
&& DataStreamTimestampFieldMapper.DEFAULT_PATH.equals(fullFieldName);
We don't have to check index sorting for tsdb, but it should aways exist and timestamp should be part of it.
This makes this check easier to read.
| if (loadTimestampRange && info != null) { | ||
| if (info.docValuesSkipIndexType() == DocValuesSkipIndexType.RANGE) { | ||
| DocValuesSkipper skipper = reader.getDocValuesSkipper(DataStream.TIMESTAMP_FIELD_NAME); | ||
| minTimestamp = skipper.minValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an assert here: assert skipper != null : "no skipper for reader:" + reader + " and parent:" + reader.getContext().parent.reader();
daa77ba to
c8607b2
Compare
| // Timeseries indices' leaf readers should be sorted by desc order of their timestamp field, as it allows search time optimizations | ||
| public static final Comparator<LeafReader> TIMESERIES_LEAF_READERS_SORTER = Comparator.comparingLong((LeafReader r) -> { | ||
| try { | ||
| FieldInfo info = r.getFieldInfos().fieldInfo(TIMESTAMP_FIELD_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| if (isIndexed() == false && pointsMetadataAvailable == false && hasDocValues()) { | ||
| // we don't have a quick way to run this check on doc values, so fall back to default assuming we are within bounds | ||
| return Relation.INTERSECTS; | ||
| if (hasDocValuesSkipper() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you make this change because a test failed or as an optimization? Asking because I don't expect this change to be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not required, I'm doubting whether we should include this particular change now. I prefer to do this change in isolation and add the necessary tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a failing test, org.elasticsearch.index.shard.SearchIdleIT.testSearchIdleBoolQueryMatchOneIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did add tests to DateFieldTypeTests.java to test isFieldWithinQuery when the doc values skipper is enabled. But it makes sense to me to make this change in isolation. If you want, I can break this out into a separate PR and just mute the test for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, I can break this out into a separate PR and just mute the test for now
Maybe change testSearchIdleBoolQueryMatchOneIndex(...) test to set index.mapping.use_doc_values_skipper setting to false if doc_values_skipper feature flag is enabled? Then I think the test should pass without adding this logic.
Then we can add this logic with associated tests in a follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, follow-up PR #123930 opened.
| this.loadedTimestampRange = loadTimestampRange; | ||
| // Also check for the existence of the timestamp field, because sometimes a segment can only contain tombstone documents, | ||
| // which don't have any mapped fields (also not the timestamp field) and just some meta fields like _id, _seq_no etc. | ||
| if (loadTimestampRange && reader.getFieldInfos().fieldInfo(DataStream.TIMESTAMP_FIELD_NAME) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are cases where we fake the reader. I think when we read from translog, for realtime get. I suspect those IndexReaders sometimes don't work well with field infos.
…ipper (#123930) When running a timestamp range query, as an optimization we check if the query range overlaps with the total range of values within a shard before executing the query on that shard. That way, if the range is disjoint, we can skip execution for that shard. To get the range of values within a shard, we usually use the PointValues index on the shard. However, when the doc values skipper is enabled, the point values are not (as the reason for the skipper is to reduce storage overhead by removing the point values index). In this case, we need to instead get the range of values within the shard by using the skipper. This patch implements that logic. Follow-up to #123191
…ces (elastic#123191) This patch builds on the work done in elastic#122161 by also enabling the sparse doc values index for @timestamp in time-series indices.
…ipper (elastic#123930) When running a timestamp range query, as an optimization we check if the query range overlaps with the total range of values within a shard before executing the query on that shard. That way, if the range is disjoint, we can skip execution for that shard. To get the range of values within a shard, we usually use the PointValues index on the shard. However, when the doc values skipper is enabled, the point values are not (as the reason for the skipper is to reduce storage overhead by removing the point values index). In this case, we need to instead get the range of values within the shard by using the skipper. This patch implements that logic. Follow-up to elastic#123191
This patch builds on the work done in #122161 by also enabling the sparse doc values index for
@timestampin time-series indices.