Skip to content

Conversation

@jordan-powers
Copy link
Contributor

This patch builds on the work done in #122161 by also enabling the sparse doc values index for @timestamp in time-series indices.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Feb 21, 2025
@jordan-powers jordan-powers added >non-issue :StorageEngine/TSDB You know, for Metrics and removed needs:triage Requires assignment of a team area label labels Feb 21, 2025
@jordan-powers jordan-powers self-assigned this Feb 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

public static final IndexVersion TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_011_0_00, Version.LUCENE_10_1_0);
public static final IndexVersion TIME_SERIES_ID_DOC_VALUES_SPARSE_INDEX = def(9_012_0_00, Version.LUCENE_10_1_0);
public static final IndexVersion SYNTHETIC_SOURCE_STORE_ARRAYS_NATIVELY_KEYWORD = def(9_013_0_00, Version.LUCENE_10_1_0);
public static final IndexVersion TSDB_TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_014_0_00, Version.LUCENE_10_1_0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually need an index version here, given that doc value skipper is only enabled in snapshot build. The same I think also applies to previous doc value skipper related changes. But let's keep it for now, given that we added also index version for previous doc value skipper related changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't certain we needed the new IndexVersion, but I figured better safe than sorry

&& indexSortConfig.hasSortOnField(fullFieldName)
&& DataStreamTimestampFieldMapper.DEFAULT_PATH.equals(fullFieldName);
} else if (IndexMode.TIME_SERIES.equals(indexMode)) {
return indexCreatedVersion.onOrAfter(IndexVersions.TSDB_TIMESTAMP_DOC_VALUES_SPARSE_INDEX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here why we don't check for index sorting in case for time series index mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this maybe this can be simplified to this:

return indexCreatedVersion.onOrAfter(IndexVersions.TIMESTAMP_DOC_VALUES_SPARSE_INDEX)
            && useDocValuesSkipper
            && hasDocValues
            && (indexMode == IndexMode.LOGSDB || indexMode == IndexMode.TIME_SERIES)
            && indexSortConfig != null
            && indexSortConfig.hasSortOnField(fullFieldName)
            && DataStreamTimestampFieldMapper.DEFAULT_PATH.equals(fullFieldName);

We don't have to check index sorting for tsdb, but it should aways exist and timestamp should be part of it.
This makes this check easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then should I take out the IndexVersions.TSDB_TIMESTAMP_DOC_VALUES_SPARSE_INDEX and just use the IndexVersions.TIMESTAMP_DOC_VALUES_SPARSE_INDEX for both index modes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that is fine. Since this only applies to snapshot builds and bwc isn't an issue.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the latest changes, I left two more comments.

this.loadedTimestampRange = loadTimestampRange;
// Also check for the existence of the timestamp field, because sometimes a segment can only contain tombstone documents,
// which don't have any mapped fields (also not the timestamp field) and just some meta fields like _id, _seq_no etc.
if (loadTimestampRange && reader.getFieldInfos().fieldInfo(DataStream.TIMESTAMP_FIELD_NAME) != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave this if statement in tact and add the following check below here:

if (IndexSettings.DOC_VALUES_SKIPPER.isEnabled()) {
                DocValuesSkipper skipper = reader.getDocValuesSkipper(DataStream.TIMESTAMP_FIELD_NAME);
                assert skipper != null : "no skipper for reader:" + reader + " and parent:" + reader.getContext().parent.reader();
                minTimestamp = skipper.minValue();
                maxTimestamp = skipper.maxValue();
            } else {
                PointValues tsPointValues = reader.getPointValues(DataStream.TIMESTAMP_FIELD_NAME);
                assert tsPointValues != null
                    : "no timestamp field for reader:" + reader + " and parent:" + reader.getContext().parent.reader();
                minTimestamp = LongPoint.decodeDimension(tsPointValues.getMinPackedValue(), 0);
                maxTimestamp = LongPoint.decodeDimension(tsPointValues.getMaxPackedValue(), 0);
            }

The reason here is that when loadTimestampRange is requested and timestamp field exists only two scenarios can be true. Either there is PointValues for timestamp field or if feature flag is enabled skipper should exist for timestamp field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, there isn't an easy way to get the current IndexSettings instance into this method. I could update the constructor signature to take some extra info (either the whole IndexSettings instance or just the boolean indicating if doc values skipper is enabled), but it seems easier to just check the FieldInfo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to get an IndexSettings instance, IndexSettings.DOC_VALUES_SKIPPER is a static field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you're saying I should check the feature flag, not the index setting. But what if the feature flag is enabled but the skipper is disabled by the index setting? Then we still want to check the point values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let's keep this as is then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually still had to update this check. I'm keeping the same logic, still checking the FieldInfo to see if the doc values skipper is enabled, but it seems retrieving the FieldInfo even if loadTimestampRange is false was causing the CI to fail (something to do with the translogInMemorySegmentCount that I don't fully understand).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are cases where we fake the reader. I think when we read from translog, for realtime get. I suspect those IndexReaders sometimes don't work well with field infos.

&& indexSortConfig.hasSortOnField(fullFieldName)
&& DataStreamTimestampFieldMapper.DEFAULT_PATH.equals(fullFieldName);
} else if (IndexMode.TIME_SERIES.equals(indexMode)) {
return indexCreatedVersion.onOrAfter(IndexVersions.TSDB_TIMESTAMP_DOC_VALUES_SPARSE_INDEX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this maybe this can be simplified to this:

return indexCreatedVersion.onOrAfter(IndexVersions.TIMESTAMP_DOC_VALUES_SPARSE_INDEX)
            && useDocValuesSkipper
            && hasDocValues
            && (indexMode == IndexMode.LOGSDB || indexMode == IndexMode.TIME_SERIES)
            && indexSortConfig != null
            && indexSortConfig.hasSortOnField(fullFieldName)
            && DataStreamTimestampFieldMapper.DEFAULT_PATH.equals(fullFieldName);

We don't have to check index sorting for tsdb, but it should aways exist and timestamp should be part of it.
This makes this check easier to read.

if (loadTimestampRange && info != null) {
if (info.docValuesSkipIndexType() == DocValuesSkipIndexType.RANGE) {
DocValuesSkipper skipper = reader.getDocValuesSkipper(DataStream.TIMESTAMP_FIELD_NAME);
minTimestamp = skipper.minValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an assert here: assert skipper != null : "no skipper for reader:" + reader + " and parent:" + reader.getContext().parent.reader();

@jordan-powers jordan-powers force-pushed the tsdb-timestamp-sparse-index branch from daa77ba to c8607b2 Compare February 25, 2025 18:53
// Timeseries indices' leaf readers should be sorted by desc order of their timestamp field, as it allows search time optimizations
public static final Comparator<LeafReader> TIMESERIES_LEAF_READERS_SORTER = Comparator.comparingLong((LeafReader r) -> {
try {
FieldInfo info = r.getFieldInfos().fieldInfo(TIMESTAMP_FIELD_NAME);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if (isIndexed() == false && pointsMetadataAvailable == false && hasDocValues()) {
// we don't have a quick way to run this check on doc values, so fall back to default assuming we are within bounds
return Relation.INTERSECTS;
if (hasDocValuesSkipper() == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you make this change because a test failed or as an optimization? Asking because I don't expect this change to be required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not required, I'm doubting whether we should include this particular change now. I prefer to do this change in isolation and add the necessary tests for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a failing test, org.elasticsearch.index.shard.SearchIdleIT.testSearchIdleBoolQueryMatchOneIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add tests to DateFieldTypeTests.java to test isFieldWithinQuery when the doc values skipper is enabled. But it makes sense to me to make this change in isolation. If you want, I can break this out into a separate PR and just mute the test for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, I can break this out into a separate PR and just mute the test for now

Maybe change testSearchIdleBoolQueryMatchOneIndex(...) test to set index.mapping.use_doc_values_skipper setting to false if doc_values_skipper feature flag is enabled? Then I think the test should pass without adding this logic.

Then we can add this logic with associated tests in a follow up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, follow-up PR #123930 opened.

this.loadedTimestampRange = loadTimestampRange;
// Also check for the existence of the timestamp field, because sometimes a segment can only contain tombstone documents,
// which don't have any mapped fields (also not the timestamp field) and just some meta fields like _id, _seq_no etc.
if (loadTimestampRange && reader.getFieldInfos().fieldInfo(DataStream.TIMESTAMP_FIELD_NAME) != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are cases where we fake the reader. I think when we read from translog, for realtime get. I suspect those IndexReaders sometimes don't work well with field infos.

@jordan-powers jordan-powers merged commit 737ab62 into elastic:main Mar 3, 2025
17 checks passed
@jordan-powers jordan-powers deleted the tsdb-timestamp-sparse-index branch March 3, 2025 19:29
jordan-powers added a commit that referenced this pull request Mar 4, 2025
…ipper (#123930)

When running a timestamp range query, as an optimization we check if the
query range overlaps with the total range of values within a shard before
executing the query on that shard. That way, if the range is disjoint, we
can skip execution for that shard. To get the range of values within a
shard, we usually use the PointValues index on the shard. However, when the
doc values skipper is enabled, the point values are not (as the reason for
the skipper is to reduce storage overhead by removing the point values
index). In this case, we need to instead get the range of values within the
shard by using the skipper. This patch implements that logic.

Follow-up to #123191
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Mar 11, 2025
…ces (elastic#123191)

This patch builds on the work done in elastic#122161 by also enabling the sparse doc values index for @timestamp in time-series indices.
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Mar 11, 2025
…ipper (elastic#123930)

When running a timestamp range query, as an optimization we check if the
query range overlaps with the total range of values within a shard before
executing the query on that shard. That way, if the range is disjoint, we
can skip execution for that shard. To get the range of values within a
shard, we usually use the PointValues index on the shard. However, when the
doc values skipper is enabled, the point values are not (as the reason for
the skipper is to reduce storage overhead by removing the point values
index). In this case, we need to instead get the range of values within the
shard by using the skipper. This patch implements that logic.

Follow-up to elastic#123191
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants