-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Optimized field visitor for ES|QL loading from _ignored_source #132428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized field visitor for ES|QL loading from _ignored_source #132428
Conversation
Here are the results for the insist_🐔 benchmarks run against main:
Seems to be an good improvement across the board, especially with chicken_2 and chicken_3 showing around 50% reduction in latency. |
96fb75e
to
0d59c1d
Compare
…cial-field-visitor-2
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
RowStrideReader rowStrideReader(LeafReaderContext context) throws IOException; | ||
|
||
StoredFieldsSpec rowStrideStoredFieldSpec(); | ||
record FieldsSpec(StoredFieldsSpec storedFieldsSpec, IgnoredFieldsSpec ignoredFieldsSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to extend StoredFieldsSpec
to know about ignored fields, instead of having this new abstraction on top? I think that would make a lot of the logic here easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reluctant to do so because StoredFieldsSpec is used in a few other places, and I only need the IgnoredFieldsSpec in the ESQL blockloader context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also leaning towards keeping FieldsSpec
in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking a second look at this, if we're able to add IgnoredFieldsSpec
to StoredFieldsSpec
and make it optional then I think it would reduce the size of this PR? By adding a constructor to StoredFieldSpec
:
public StoredFieldsSpec(boolean requiresSource, boolean requiresMetadata, Set<String> requiredStoredFields) {
this(requiresSource, requiresMetadata, requiredStoredFields, IgnoredFieldsSpec.NONE);
}
(the StoredFieldsSpec#merge(...)
method also needs to be updated, but from there we can delegate to IgnoredFieldsSpec)
StoredFieldSpec
is used in fetch phase and value fetchers, but I think we don't need to make any changes in these areas of the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I moved IgnoredFieldsSpec
into StoredFieldsSpec
in 9fb1125
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jordan-powers, this looks good! I think we should add more unit testing around IgnoredSourceFieldLoader
and StoredFieldLoader#fromSpec(...)
.
for (String requiredIgnoredField : spec.ignoredFieldsSpec().requiredIgnoredFields()) { | ||
for (String potentialStoredField : spec.ignoredFieldsSpec().format().requiredStoredFields(requiredIgnoredField)) { | ||
potentialFieldsInIgnoreSource.computeIfAbsent(potentialStoredField, k -> new HashSet<>()).add(requiredIgnoredField); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hoping that we can optimize/simplify in a follow up change. This logic is now required for fields with dots and then we don't know which field part there is actually a stored field for.
Ideally we would know based from the mapping what stored field we need. For example if attributes.host.ip
is requested and no attributes
field is mapped, then the stored field should be _ignored_source.attributes
. If attributes is mapped, but host isn't then the required stored field to load would be _ignored_source.attributes.host
.
I also think requiredStoredFields(...)
also always include the _doc
field (via FallbackSyntheticSourceBlockLoader.splitIntoFieldPaths(...)
)?
Would be great if we need less maps and no sets in SFV
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this can probably be optimized by using the mapping to figure out which stored field contains the value instead of just looking at all possible parent stored fields. But let's leave that as a follow-up.
return potentialFieldsInIgnoreSource.keySet().stream().toList(); | ||
} | ||
|
||
static class SFV extends StoredFieldVisitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we can perhaps explore an implementation that is tuned for just fetching one stored field.
/** | ||
* Crates a new StoredFieldLaoader using a BlockLoader.FieldsSpec | ||
*/ | ||
public static StoredFieldLoader fromSpec(BlockLoader.FieldsSpec spec, boolean forceSequentialReader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test that tests whether the ignored source field loader is loaded when expected? Maybe we can add a test to BlockSourceReaderTests
and then also test the produced block? I think current tests in BlockSourceReaderTests
don't test ignored source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe adding a test to IgnoredSourceFieldMapperTests
? But I don't see any block related tests there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added StoredFieldLoaderTests
and IgnoredFieldLoaderTests
RowStrideReader rowStrideReader(LeafReaderContext context) throws IOException; | ||
|
||
StoredFieldsSpec rowStrideStoredFieldSpec(); | ||
record FieldsSpec(StoredFieldsSpec storedFieldsSpec, IgnoredFieldsSpec ignoredFieldsSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also leaning towards keeping FieldsSpec
in this PR.
ddc51a6
to
ac52d06
Compare
…cial-field-visitor-2
…cial-field-visitor-2
…cial-field-visitor-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
…cial-field-visitor-2
…cial-field-visitor-2
In #132142 and #132428 we split up ignored_source entries into distinct lucene fields, then added an optimized field visitor to speed up retrieving unmapped values for INSIST_🐔. However, since this approach creates a unique lucene field for every ignored_source entry, we can very quickly have a lot of lucene fields if there are a lot of unique unmapped fields per document. This can cause significant slowdowns in indexing throughput and merge time. This PR addresses those limitations by reverting back to keeping all ignored_source entries under the same lucene field. However, we still keep some of the speedups from that prior work by continuing to coalesce multiple ignored_source entries for the same field into a single entry, allowing the field visitor to exit early. Unfortunately, we do lose some time compared to the original optimizations because now the field visitor cannot look at the fieldInfo to decide whether or not to visit a field, and it instead needs to actually visit and materialize each ignored_source entry before it can decide whether or not to keep it.
This PR builds on the work in #132142 to optimize loading values from _ignored_source by stopping the
FieldVisitor
early, once all required fields have been visited.Part of this change is to change the signature of
BlockLoader#FieldsSpec
so that the returned object tracks fields from ignored source separately from fields from other stored fields. This is needed so that the custom field visitor knows which fields to look for.Relates to #130886.