Skip to content

Conversation

@salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Feb 5, 2025

In this PR, we change how the doc values sparse index is enabled for the host.name keyword field.
The initial implementation of the sparse index for host.name was introduced in #120741.

Previously, the choice between using an inverted index or a doc values sparse index was determined by the index parameter. With this change, we introduce a new final index-level setting, index.mapping.use_doc_values_sparse_index:

  • When the setting is true, we enable the sparse index and omit the inverted index for host.name.
  • When the setting is false (default), we retain the inverted index instead.

Additionally, this setting is only exposed if the doc_values_sparse_index feature flag is enabled.

This change simplifies enabling the doc values sparse index and makes the selection of indexing strategies explicit at the index level. Moreover, the setting is not dynamic and is exposed only for stateful deployments.

The plan is to enable this setting in our nightly benchmarks and evaluate its impact on LogsDB indexing throughput, storage footprint and query latency. Based on benchmarking results, we will decide whether to adopt the sparse index and determine the best way to configure it.

@salvatore-campagna salvatore-campagna self-assigned this Feb 5, 2025
@salvatore-campagna salvatore-campagna changed the title Refactoring Doc Values Sparse Index Enabling for host.name Refactoring doc values sparse index enabling for the host.name field Feb 5, 2025
Property.Final
);

public static final FeatureFlag DOC_VALUES_SPARSE_INDEX = new FeatureFlag("doc_values_sparse_index");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the feature flag here since we now use it to enable the setting or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the setting here, but move the setting registration to logsdb plugin?

}
return indexed.isConfigured() == false
&& hasDocValues.getValue()
return hasDocValues.getValue()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I remove the check on index value and refactored the way we check sorting on host.name.

assertTrue(mapper.fieldType().hasDocValues());
assertFalse(mapper.fieldType().isIndexed());
assertFalse(mapper.fieldType().hasDocValuesSparseIndex());
assertTrue(mapper.fieldType().hasDocValuesSparseIndex());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has changed because the setting overrides the index parameter.

@salvatore-campagna salvatore-campagna marked this pull request as ready for review February 6, 2025 13:00
@salvatore-campagna salvatore-campagna requested a review from a team as a code owner February 6, 2025 13:00
@salvatore-campagna salvatore-campagna requested review from martijnvg and removed request for a team February 6, 2025 13:00
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

public void testFieldTypeWithSkipDocValues_LogsDbMode() throws IOException {
assumeTrue("Needs feature flag to be enabled", FieldMapper.DOC_VALUES_SPARSE_INDEX.isEnabled());

public void testFieldTypeWithSkipDocValues_LogsDbModeDisabledSetting() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now all these tests written in such a way that they work both for snapshot and release builds.

@salvatore-campagna salvatore-campagna added the test-release Trigger CI checks against release build label Feb 6, 2025
@salvatore-campagna
Copy link
Contributor Author

@elasticsearchmachine test this please.

&& IndexMode.LOGSDB.equals(indexMode)
&& HOST_NAME.equals(fullFieldName)
&& (indexSortConfig != null && indexSortConfig.hasPrimarySortOnField(HOST_NAME));
&& indexSortConfigByHostName(indexSortConfig);
Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I assume we would like to create the sparse index on host.name no matter if it is the first field or not in the index sort configuration.

*/
public final class IndexScopedSettings extends AbstractScopedSettings {

public static final Set<Setting<?>> BUILT_IN_INDEX_SETTINGS = Set.of(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe register the setting in logsdb plugin? Then we don't have to make this big change here?

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to move also tests in KeywordFieldMapperTests which are LogsDB-specific in LogsDB plugin...because getPlugins() needs the LogsDB plugin too which we can't have as a dependency in the existing KeywordFieldMapperTests. Not having the LogsDB plugin in the test results in failures because the setting is not registered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, can we avoid this big diff here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that MapperTestCase does not register settings provided by other plugins (like LogsDB)...so I would need to implement the logic there to be able to register settings provided by other plugins.

Copy link

@salvatorecampagna salvatorecampagna Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUILT_IN_INDEX_SETTINGS is used by a few tests directly...which is why I had to do it this way. Those tests run both in snapshot and release builds, which means they fail il the setting is not registered because we register the setting conditionally depending on the feature flag. So if the setting is registered only in snapshot builds...release tests will fail with something like..."setting ... has not been registered".

Copy link

@salvatorecampagna salvatorecampagna Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify why I need the change. The original set BUILT_IN_INDEX_SETTINGS is defined using Set.of which returns an immutable set. As a result I cannot just conditionally add a new setting based on a feature flag. The add method call on an immutable set would result in throwing an UnsupportedOperationException as a result of trying to modify a collection that is immutable. So here what I do its...create a mutable set, conditionally add the setting and later on make a copy to have the static variable as an immutable set again.

Another option would be to change existing tests to conditionally use a different set based on a feature flag...I would not do that anyway.

private final int ignoreAboveDefault;
private final IndexMode indexMode;
private final IndexSortConfig indexSortConfig;
private final boolean useDocValuesSParseIndex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to: useDocValuesSkipper ?


public static final FeatureFlag DOC_VALUES_SPARSE_INDEX = new FeatureFlag("doc_values_sparse_index");
public static final Setting<Boolean> USE_DOC_VALUES_SPARSE_INDEX = Setting.boolSetting(
"index.mapping.use_doc_values_sparse_index",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the lucene name. So use_doc_values_skipper?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in IndexVersions I think I cannot rename public static final IndexVersion HOSTNAME_DOC_VALUES_SPARSE_INDEX = def(9_008_0_00, Version.LUCENE_10_0_0);
I will rename everything else.

Property.Final
);

public static final FeatureFlag DOC_VALUES_SPARSE_INDEX = new FeatureFlag("doc_values_sparse_index");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the setting here, but move the setting registration to logsdb plugin?

@lkts
Copy link
Contributor

lkts commented Feb 6, 2025

Can we avoid hardcoding host.name and have this setting take a list of fields that use it or something? Maybe just expose it as a mapping parameter - it seems generally useful and like something that some users would find helpful.

@martijnvg
Copy link
Member

Can we avoid hardcoding host.name and have this setting take a list of fields that use it or something? Maybe just expose it as a mapping parameter

The idea for now is the hardcode it. Enabling doc value skipper only makes sense for fields part of index sorting, so I don't think for now it makes sense to make it generic. Which fields / field types end up supporting this depends on research yet to be done. For tsdb maybe _tsid, @timestamp and all dimension fields, for logsdb other fields that have a relation to host.name are also good candidates. For now, we are going to evaluate to effect replacing the indexed data structures with doc value skippers for host.name and @timestamp field with logsdb.

@salvatorecampagna
Copy link

salvatorecampagna commented Feb 7, 2025

Can we avoid hardcoding host.name and have this setting take a list of fields that use it or something? Maybe just expose it as a mapping parameter - it seems generally useful and like something that some users would find helpful.

Also if we expose it as a mapping parameter users will be able to use

Can we avoid hardcoding host.name and have this setting take a list of fields that use it or something? Maybe just expose it as a mapping parameter - it seems generally useful and like something that some users would find helpful.

This has been discussed and we decided not to have it exposed as a mapping parameter to avoid users start defining sparse indices on fields of their choice. Also, putting the logic to expose the mapping parameter behind a feature flag:

  • makes this PR more complex expanding its scope
  • introduces a new parameter that once introduced need to be kept and supported forever (bwc after removing the feature flag)
  • can be done later on if use cases arise and we see benefits for other fields too

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, LGTM otherwise.

}
}

public boolean hasSortOnFiled(final String fieldName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public boolean hasSortOnFiled(final String fieldName) {
public boolean hasSortOnField(final String fieldName) {

final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name");
assertTrue(mapper.fieldType().hasDocValues());
assertFalse(IndexSettings.USE_DOC_VALUES_SKIPPER.get(settings) && mapper.fieldType().isIndexed());
assertEquals(IndexSettings.USE_DOC_VALUES_SKIPPER.get(settings), mapper.fieldType().hasDocValuesSkipper());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertEquals(IndexSettings.USE_DOC_VALUES_SKIPPER.get(settings), mapper.fieldType().hasDocValuesSkipper());
if (FieldMapper.DOC_VALUES_SPARSE_INDEX.isEnabled()) {
assertTrue(mapper.fieldType().hasDocValuesSkipper());
} else {
assertFalse(mapper.fieldType().hasDocValuesSkipper());
}

I think this is easier to read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to have a different name for the setting and the feature flag...also to make it easier later to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the index setting is actually index.mapping.use_doc_values_skipper

final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name");
assertTrue(mapper.fieldType().hasDocValues());
assertFalse(IndexSettings.USE_DOC_VALUES_SKIPPER.get(settings) && mapper.fieldType().isIndexed());
assertEquals(IndexSettings.USE_DOC_VALUES_SKIPPER.get(settings), mapper.fieldType().hasDocValuesSkipper());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertEquals(IndexSettings.USE_DOC_VALUES_SKIPPER.get(settings), mapper.fieldType().hasDocValuesSkipper());
if (FieldMapper.DOC_VALUES_SPARSE_INDEX.isEnabled()) {
assertTrue(mapper.fieldType().hasDocValuesSkipper());
} else {
assertFalse(mapper.fieldType().hasDocValuesSkipper());
}

and same for the other similar assertions?

@salvatore-campagna salvatore-campagna merged commit 27adf20 into elastic:main Feb 10, 2025
18 checks passed
salvatore-campagna added a commit that referenced this pull request Feb 17, 2025
This PR extends the work done in #121751 by enabling a sparse doc values index for the @timestamp field in LogsDB.

Similar to the previous PR, the setting index.mapping.use_doc_values_skipper will override the index mapping parameter when all of the following conditions are met:

* The index mode is LogsDB.
* The field name is @timestamp.
* Index sorting is configured on @timestamp (regardless of whether it is a primary sort field or not).
* Doc values are enabled.

This ensures that only one index structure is defined on the @timestamp field:
* If the conditions above are met, the inverted index is replaced with a sparse doc values index.
* This prevents both the inverted index and sparse doc values index from being enabled together, reducing unnecessary storage overhead.

This change aligns with our goal of optimizing LogsDB for storage efficiency while possibly maintaining reasonable query latency performance. It will enable us to run benchmarks and evaluate the impact of sparse indexing on the @timestamp field as well.
jordan-powers added a commit that referenced this pull request Feb 19, 2025
This patch extends the work done in #122161 and #121751 to also use the doc values sparse index for the _tsid fields in time-series mode indices.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :StorageEngine/Logs You know, for Logs Team:StorageEngine test-release Trigger CI checks against release build v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants