-
Notifications
You must be signed in to change notification settings - Fork 25.7k
ES|QL: Fix Fork field reference tracking #131723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 28 commits
9f24a18
594d76a
69fff05
b64ec91
666382c
15480b6
96951c7
8a98b45
bb7082d
a20fb5b
b159bd7
3b7f546
f9f26d4
917b273
cb412e6
c5564f0
7d67986
2479bfa
3ebe919
482fabe
300f169
6378f76
f337b02
82f189d
199392f
b7a7854
5dd7462
98aa887
7bc4a87
b017321
d14bf24
2931d83
82672be
94da34b
9606c04
3f55419
00a885a
4313222
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| pr: 131723 | ||
| summary: Tests for FORK's evaluation of field names used in `field_caps` resolve calls | ||
| area: Search | ||
| type: bug | ||
| issues: [] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -49,6 +49,7 @@ | |
| import java.util.HashSet; | ||
| import java.util.List; | ||
| import java.util.Set; | ||
| import java.util.function.Function; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| import static org.elasticsearch.xpack.esql.core.util.StringUtils.WILDCARD; | ||
|
|
@@ -76,11 +77,6 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso | |
| return new PreAnalysisResult(enrichResolution, IndexResolver.ALL_FIELDS, Set.of()); | ||
| } | ||
|
|
||
| // TODO: Improve field resolution for FORK - right now we request all fields | ||
| if (parsed.anyMatch(p -> p instanceof Fork)) { | ||
| return new PreAnalysisResult(enrichResolution, IndexResolver.ALL_FIELDS, Set.of()); | ||
| } | ||
|
|
||
| Holder<Boolean> projectAll = new Holder<>(false); | ||
| parsed.forEachExpressionDown(UnresolvedStar.class, us -> {// explicit "*" fields selection | ||
| if (projectAll.get()) { | ||
|
|
@@ -93,7 +89,7 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso | |
| return new PreAnalysisResult(enrichResolution, IndexResolver.ALL_FIELDS, Set.of()); | ||
| } | ||
|
|
||
| var referencesBuilder = AttributeSet.builder(); | ||
| var referencesBuilder = new Holder<>(AttributeSet.builder()); | ||
| // "keep" and "drop" attributes are special whenever a wildcard is used in their name, as the wildcard can cover some | ||
| // attributes ("lookup join" generated columns among others); steps like removal of Aliases should ignore fields matching the | ||
| // wildcards. | ||
|
|
@@ -110,19 +106,42 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso | |
| // lookup indices where we request "*" because we may require all their fields | ||
| Set<String> wildcardJoinIndices = new java.util.HashSet<>(); | ||
|
|
||
| boolean[] canRemoveAliases = new boolean[] { true }; | ||
| var canRemoveAliases = new Holder<>(true); | ||
| var needsAllFields = new Holder<>(false); | ||
|
||
|
|
||
| parsed.forEachDown(p -> {// go over each plan top-down | ||
| if (p instanceof RegexExtract re) { // for Grok and Dissect | ||
| var processingLambda = new Holder<Function<LogicalPlan, Boolean>>(); | ||
| processingLambda.set((LogicalPlan p) -> {// go over each plan top-down | ||
|
||
| if (p instanceof Fork fork) { | ||
| // Early return from forEachDown. We will iterate over the children manually. | ||
|
||
| var forkRefsResult = AttributeSet.builder(); | ||
| forkRefsResult.addAll(referencesBuilder.get()); | ||
|
|
||
| for (var child : fork.children()) { | ||
| referencesBuilder.set(AttributeSet.builder()); | ||
| var return_result = child.forEachDownMayReturnEarly(processingLambda.get()); | ||
| // No nested Forks for now... | ||
| assert return_result; | ||
|
||
| if (referencesBuilder.get().isEmpty()) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the logic behind this check? Why would an empty
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See below on line 232: This has the comment
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure that's correct. The logic at line 232 is about queries like this
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree that the logic seems pretty confusing, but at the same time I believe the statement you make regarding the fields is only partially true. Consider the following query with match. Consider the "stats" stage which looks like this It has no referenced columns (p.references() is empty). Thus by the logic of the comment I mentioned above we should retain all columns. This information needs to be propagated back to the caller (the Fork). Thus if any of the Fork's children need all the fields, the we can break out early and just retain all fields. I can try to make this a bit clear in the following way. The lambda can return a pair <bool, ColumnSet>. The bool can indicate if we need to retain all columns. How does this sound?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not about the lambda here, but about the logic of the field name resolution for I have tested this query
and with the logic as Thank you for bringing up the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I have added the query you have in your comment as a test, and it returns *. I'm not following if there is another change your are proposing or you simply have a question/observation about the existing (pre-fork) code?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not following this part. I'm not sure I understand how fork behaves differently here... If we end up with empty set for resolved fields, we will convert to _index, no? In that regard this behavior is the same as it should be, but maybe I'm missing something in your explanation. |
||
| needsAllFields.set(true); | ||
| // Early return. | ||
| return false; | ||
| } | ||
| forkRefsResult.addAll(referencesBuilder.get()); | ||
| } | ||
|
|
||
| forkRefsResult.removeIf(attr -> attr.name().equals(Fork.FORK_FIELD)); | ||
| referencesBuilder.set(forkRefsResult); | ||
| return false; | ||
| } else if (p instanceof RegexExtract re) { // for Grok and Dissect | ||
| // keep the inputs needed by Grok/Dissect | ||
| referencesBuilder.addAll(re.input().references()); | ||
| referencesBuilder.get().addAll(re.input().references()); | ||
| } else if (p instanceof Enrich enrich) { | ||
| AttributeSet enrichFieldRefs = Expressions.references(enrich.enrichFields()); | ||
| AttributeSet.Builder enrichRefs = enrichFieldRefs.combine(enrich.matchField().references()).asBuilder(); | ||
| // Enrich adds an EmptyAttribute if no match field is specified | ||
| // The exact name of the field will be added later as part of enrichPolicyMatchFields Set | ||
| enrichRefs.removeIf(attr -> attr instanceof EmptyAttribute); | ||
| referencesBuilder.addAll(enrichRefs); | ||
| referencesBuilder.get().addAll(enrichRefs); | ||
| } else if (p instanceof LookupJoin join) { | ||
| if (join.config().type() instanceof JoinTypes.UsingJoinType usingJoinType) { | ||
| joinRefs.addAll(usingJoinType.columns()); | ||
|
|
@@ -135,15 +154,15 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso | |
| joinRefs.addAll(keepRefs); | ||
| } | ||
| } else { | ||
| referencesBuilder.addAll(p.references()); | ||
| referencesBuilder.get().addAll(p.references()); | ||
| if (p instanceof UnresolvedRelation ur && ur.indexMode() == IndexMode.TIME_SERIES) { | ||
| // METRICS aggs generally rely on @timestamp without the user having to mention it. | ||
| referencesBuilder.add(new UnresolvedAttribute(ur.source(), MetadataAttribute.TIMESTAMP_FIELD)); | ||
| referencesBuilder.get().add(new UnresolvedAttribute(ur.source(), MetadataAttribute.TIMESTAMP_FIELD)); | ||
| } | ||
| // special handling for UnresolvedPattern (which is not an UnresolvedAttribute) | ||
| p.forEachExpression(UnresolvedNamePattern.class, up -> { | ||
| var ua = new UnresolvedAttribute(up.source(), up.name()); | ||
| referencesBuilder.add(ua); | ||
| referencesBuilder.get().add(ua); | ||
| if (p instanceof Keep) { | ||
| keepRefs.add(ua); | ||
| } else if (p instanceof Drop) { | ||
|
|
@@ -168,10 +187,10 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso | |
| // | ||
| // and ips_policy enriches the results with the same name ip field), | ||
| // these aliases should be kept in the list of fields. | ||
| if (canRemoveAliases[0] && p.anyMatch(FieldNameUtils::couldOverrideAliases)) { | ||
| canRemoveAliases[0] = false; | ||
| if (canRemoveAliases.get() && p.anyMatch(FieldNameUtils::couldOverrideAliases)) { | ||
| canRemoveAliases.set(false); | ||
| } | ||
| if (canRemoveAliases[0]) { | ||
| if (canRemoveAliases.get()) { | ||
| // remove any already discovered UnresolvedAttributes that are in fact aliases defined later down in the tree | ||
| // for example "from test | eval x = salary | stats max = max(x) by gender" | ||
| // remove the UnresolvedAttribute "x", since that is an Alias defined in "eval" | ||
|
|
@@ -187,21 +206,28 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso | |
| if (fieldNames.contains(ne.name())) { | ||
| return; | ||
| } | ||
| referencesBuilder.removeIf( | ||
| attr -> matchByName(attr, ne.name(), keepRefs.contains(attr) || dropWildcardRefs.contains(attr)) | ||
| ); | ||
| referencesBuilder.get() | ||
| .removeIf(attr -> matchByName(attr, ne.name(), keepRefs.contains(attr) || dropWildcardRefs.contains(attr))); | ||
| }); | ||
| } | ||
|
|
||
| // No early return. | ||
| return true; | ||
| }); | ||
| parsed.forEachDownMayReturnEarly(processingLambda.get()); | ||
|
|
||
| if (needsAllFields.get()) { | ||
| return new PreAnalysisResult(enrichResolution, IndexResolver.ALL_FIELDS, Set.of()); | ||
| } | ||
|
|
||
| // Add JOIN ON column references afterward to avoid Alias removal | ||
| referencesBuilder.addAll(joinRefs); | ||
| referencesBuilder.get().addAll(joinRefs); | ||
| // If any JOIN commands need wildcard field-caps calls, persist the index names | ||
|
|
||
| // remove valid metadata attributes because they will be filtered out by the IndexResolver anyway | ||
| // otherwise, in some edge cases, we will fail to ask for "*" (all fields) instead | ||
| referencesBuilder.removeIf(a -> a instanceof MetadataAttribute || MetadataAttribute.isSupported(a.name())); | ||
| Set<String> fieldNames = referencesBuilder.build().names(); | ||
| referencesBuilder.get().removeIf(a -> a instanceof MetadataAttribute || MetadataAttribute.isSupported(a.name())); | ||
| Set<String> fieldNames = referencesBuilder.get().build().names(); | ||
|
|
||
| if (fieldNames.isEmpty() && enrichPolicyMatchFields.isEmpty()) { | ||
| // there cannot be an empty list of fields, we'll ask the simplest and lightest one instead: _index | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
Nodeclass is fundamental to the QL code in ESQL and it very rarely (if at all) changes.I don't think this change here is really necessary to warrant for a
Nodechange. Having this early "exit" from tree traversal outsideNodeis also helpful for whoever writes code that traverses the tree, to be aware of the need of early "exit" and explicitly use it. There are multiple examples in ESQL code for this, one of them is hereUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the example. I was aware such examples. Two issues:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the pattern that we already use in ES|QL.
I agree that the Node class is fundamental and if this is the only place where we need to use something like a
forEachDownMayReturnEarlyit might not be worth adding it.If there's an opportunity for us to add something like a
forEachDownMayReturnEarly, because we see this pattern in multiple places, we can have that as a separate conversation.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is clearly not the case, there is at least one other example as Andrei pointed out, and probably many others, where it will be useful to limit the traversal, from a performance point of view.
Again, if there is specific amount of testing for this, or an API change, lets discuss this, and not reject blindly based on existing usage (and its limitations). I'm looking for constructive feedback here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the change doesn't seem much and seems very trivial to evaluate and review but there is some experience and some paranoia involved with this class and the edge cases are many and not easy to spot. And I do also understand your point of view.
I need more time to look at this in depth, there is a lot on my plate this week and I won't have time to properly review it until early next week.
If you cannot wait until then, then please go ahead and ask for a review from @ioanatia (she's the author of
forkand knows all there is to know about this command) only and merge when the PR is ready.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've looked in depth at the code change proposal and these below are my arguments for not approving this change as is now:
Nodeand the needed changes inresolveFieldNames) for whoever is looking at the code as a first-timer with the goal to understand it. I've tried to refactor it to see if there is a better structure that makes it more easily readable (isolating thetruebranch of theforEachDownMayReturnEarlyonly toforkneeds), but I couldn't find one.forEachDownMayReturnEarlymethod) suddently needs to be aware of the early exit fromforEachDowneven though it doesn't need to know this, onlyforkcode must be aware of the early exit logic. This is the main reason why a different exit logic, which is isolated toforkas much as possible, is better. The logic inresolveFieldNamesis complex enough and many other people who looked at it and contributed to it mentioned the code is complex enough and code comments are essential. Let's aim to make it less complex (with code comments for the next person looking at the code) or at least keep it at the same complexity.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded in a new thread here: https://github.com/elastic/elasticsearch/pull/131723/files#r2252384903 since this one has become outdated.