-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Store ignored source in unique stored fields per entry #132142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store ignored source in unique stored fields per entry #132142
Conversation
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jordan-powers! High level this looks good.
I wonder what the impact is on the elastic/logs with unmapped fields (insist_chicken) benchmark. Did you already have a change to check this?
I also think we should introduce a bwc test suite for synthetic source with many unmapped fields that would use synthetic source / logsdb. I think that is a bit under tested at the moment. Something like LogsdbIndexingRollingUpgradeIT but with unmapped fields.
| return fieldsToLoadForSyntheticSource; | ||
| } | ||
|
|
||
| public enum IgnoredFieldsLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to encapsulating synthesizing logic here
| }; | ||
| } | ||
|
|
||
| return new CustomFieldsVisitor(fields, loadSource); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this per field ignored source this can be improved in a followup.
| IgnoredSourceFieldMapper.NameValue nameValue = IgnoredSourceFieldMapper.decode(value); | ||
| if (fieldNames.contains(nameValue.name())) { | ||
| valuesForFieldAndParents.computeIfAbsent(nameValue.name(), k -> new ArrayList<>()).add(nameValue); | ||
| if (indexCreatedVersion.onOrAfter(IndexVersions.IGNORED_SOURCE_FIELDS_PER_ENTRY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to have two IgnoredSourceRowStrideReader implementations? Something like LegacySingleIgnoredSourceRowStrideReader and PerFieldIgnoredSourceRowStrideReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's messy to check the IndexVersion here. But instead of creating a second IgnoredSourceRowStrideReader implementation, I instead opted to re-use the IgnoredFieldsLoader I added for the SourceLoader.
|
Ok, I've run the insist_🐔 benchmarks comparing this branch to baseline, here are the results: baseline: 68eff34 Seems to be already an overall positive change, likely because this change allows the |
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
lkts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed in depth (f.e. i am not familiar with FieldSubsetReader) but it looks good to me.
Looks like chicken_4 actually regressed btw? Do you know why?
| } | ||
|
|
||
| private static class IgnoredSourceRowStrideReader<T> implements RowStrideReader { | ||
| // Contains name of the field and all its parents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move the comment together with the line it comments
| return NAME + "." + fieldName; | ||
| } | ||
|
|
||
| static BytesRef encodeMulti(List<NameValue> values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: encodeMultipleValuesForField?
|
|
||
| @Override | ||
| public void writeIgnoredFields(Collection<NameValue> ignoredFieldValues) { | ||
| throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it possible to split write and read code path and avoid UnsupportedOperationException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately obvious to me how to split it up. Maybe I can change the UnsupportedOperationException into an assert false, so that in production it's just a no-op instead of a fatal error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have a preference, i think it's surprising for the user of this API in either case. Thanks for considering it and let's move on if it's not viable.
|
Thanks for the review, Sasha!
Looking at the nightlies, chicken_4 is really noisy, with 50th percentile latency ranging from 86,605ms to 816,208ms over the past 30 days. So I think the regression here is just noise. |
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor comments, otherwise LGTM.
| this.reader = reader; | ||
| this.fieldName = fieldName; | ||
| this.ignoredSourceFormat = ignoredSourceFormat; | ||
| this.fieldPaths = splitIntoFieldPaths(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, now we do this once per field and shard instead of once per field and segment.
| Map<String, List<Object>> storedFields | ||
| ); | ||
|
|
||
| public abstract Map<String, List<IgnoredSourceFieldMapper.NameValue>> loadSingleIgnoredField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ making the distinction of loading everything and just one field (which works better for es|ql)
|
|
||
| { | ||
| Automaton automaton = Automatons.patterns(Arrays.asList("fieldA", IgnoredSourceFieldMapper.NAME)); | ||
| Automaton automaton = Automatons.patterns(Arrays.asList("fieldA", IgnoredSourceFieldMapper.ignoredFieldName("*"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe randomly set index version in these tests so that we keep unit testing both single and per field ignored source here?
| } | ||
|
|
||
| public static IgnoredSourceFormat ignoredSourceFormat(IndexVersion indexCreatedVersion) { | ||
| return indexCreatedVersion.onOrAfter(IndexVersions.IGNORED_SOURCE_FIELDS_PER_ENTRY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a feature flag here for per field ignored source? With the idea to let nightly benchmarks run the new code for ~1 week to get a better picture of the effect on other benchmarks that use synthetic source and after that period remove feature flag.
I know this is more work. Removing FF and then adding another index version, but lean towards being careful. Wdyt?
| var document = reader.storedFields().document(0); | ||
| Set<String> storedFieldNames = new LinkedHashSet<>(document.getFields().stream().map(IndexableField::name).toList()); | ||
| assertThat(storedFieldNames, contains("_ignored_source")); | ||
| assertThat(storedFieldNames, contains("_ignored_source.parent.field")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also run randomly run with an index version before this change and with this change? To test both single and multiple ignored source stored fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an integration test, so it doesn't seem that I'm able to specify an index version when creating the index.
This reverts commit 6d6b9f0.
* upstream/main: (32 commits) Speed up loading keyword fields with index sorts (elastic#132950) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964 Simplify EsqlSession (elastic#132848) Implement WriteLoadConstraintDecider#canAllocate (elastic#132041) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965 Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941) Breakdown undesired allocations by shard routing role (elastic#132235) Implement v_magnitude function (elastic#132765) Introduce execution location marker for better handling of remote/local compatibility (elastic#132205) Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957 Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946) Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956 Fix failing UT by adding a required capability (elastic#132947) Precompute the BitsetCacheKey hashCode (elastic#132875) Adding simulate ingest effective mapping (elastic#132833) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948 Rename skipping logic to remove hard link to skip_unavailable (elastic#132861) Store ignored source in unique stored fields per entry (elastic#132142) Add random tests with match_only_text multi-field (elastic#132380) ...
This PR does the following: * Stores each _ignored_source entry in a unique stored field called _ignored_source.<field_name> * Coalesces multiple entries for the same field name into a single lucene stored field * Adds the WildcardFieldMaskingReader so that when running synthetic source roundtrip tests, we can ignore differences in fields that match the pattern ignored_source.* For now, these changes are by default disabled behind a feature flag.
…-stats * upstream/main: (36 commits) Fix reproducability of builds against Java EA versions (elastic#132847) Speed up loading keyword fields with index sorts (elastic#132950) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964 Simplify EsqlSession (elastic#132848) Implement WriteLoadConstraintDecider#canAllocate (elastic#132041) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965 Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941) Breakdown undesired allocations by shard routing role (elastic#132235) Implement v_magnitude function (elastic#132765) Introduce execution location marker for better handling of remote/local compatibility (elastic#132205) Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957 Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946) Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956 Fix failing UT by adding a required capability (elastic#132947) Precompute the BitsetCacheKey hashCode (elastic#132875) Adding simulate ingest effective mapping (elastic#132833) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948 Rename skipping logic to remove hard link to skip_unavailable (elastic#132861) Store ignored source in unique stored fields per entry (elastic#132142) ...
In #132142 and #132428 we split up ignored_source entries into distinct lucene fields, then added an optimized field visitor to speed up retrieving unmapped values for INSIST_🐔. However, since this approach creates a unique lucene field for every ignored_source entry, we can very quickly have a lot of lucene fields if there are a lot of unique unmapped fields per document. This can cause significant slowdowns in indexing throughput and merge time. This PR addresses those limitations by reverting back to keeping all ignored_source entries under the same lucene field. However, we still keep some of the speedups from that prior work by continuing to coalesce multiple ignored_source entries for the same field into a single entry, allowing the field visitor to exit early. Unfortunately, we do lose some time compared to the original optimizations because now the field visitor cannot look at the fieldInfo to decide whether or not to visit a field, and it instead needs to actually visit and materialize each ignored_source entry before it can decide whether or not to keep it.
This PR does the following:
_ignored_source.<field_name>WildcardFieldMaskingReaderso that when running synthetic source roundtrip tests, we can ignore differences in fields that match the patternignored_source.*