-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ESQL: Allow pruning columns added by InlineJoin #131204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This updates PruneColumns rule to allow optimizing away the columns added by InlineJoin, dropping unnecessary agg'ing.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @bpintea, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's looking good, but the refactoring can be made a bit more digestible with more comments and smarter method names.
I didn't check the correctness of the optimizer rule unit tests, only looked at the queries they are using.
I have noticed that, after chasing another bug I discovered, that it's fixing another behavior as well.
In #124715, it was described as FROM kibana_sample_data_logs | EVAL timestamp=DATE_TRUNC(5 minute, @timestamp) | INLINESTATS results = count(*) by timestamp | keep results, timestamp. For the dataset used in csv-spec files:
FROM employees | WHERE still_hired == false | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal), count=COUNT(*) BY gender | keep emp_no, still_hired, total, count | SORT emp_no
FROM employees | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal), count=COUNT(*) BY gender | keep emp_no, still_hired, total, count | WHERE still_hired == false
FROM employees | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal) BY gender | keep emp_no, still_hired, total
row salary = 12300, emp_no = 5, gender = "F" | eval sal = salary/1000 | inlinestats sum(sal) by gender | keep emp_no
^ here the presence of that evaluation before inlinestats is at least one of the triggers. We do have some queries with eval before inlinestats and with keep after it which succeed. Those above do not. I recommend adding some flavors of these to the csv-spec file.
| // track inlinestats' own aggregation output (right-hand side of the join) so that any other plan on the left-hand side of the | ||
| // inline join won't have its columns pruned due to the lack of "visibility" into the right hand side output/Attributes | ||
| var inlineJoinRightOutput = new ArrayList<Attribute>(); | ||
| return pruneColumns(plan, plan.outputSet().asBuilder(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a harder to follow code, but with some comments it can be made easier to digest.
For one, I would add two methods; one LogicalPlan pruneColumns(LogicalPlan plan, AttributeSet.Builder used) and use it here. This first method would call pruneColumns(plan, plan.outputSet().asBuilder(), false).
The second method would be pruneInlineJoinColumns(LogicalPlan plan, AttributeSet.Builder used) and call pruneColumns(plan, plan.outputSet().asBuilder(), true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked the uniformity of the previous namings, but I realise that it doesn't make the flow obvious at call sites, so I've renamed the methods to make it clearer what happens with InlineJoin instances. But happy to iterate if still unclear.
| p = aggregate.with(aggregate.groupings(), remaining); | ||
| } | ||
| } else { | ||
| if (inlineJoin && aggregate.groupings().containsAll(remaining)) { // not expecting high groups cardinality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment tbh // not expecting high groups cardinality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The editor suggests the use of HashSets for faster inclusion check. But I think that'd be more costly, for most of the cases, since typically queries won't aggregate by hundred of groups.
|
|
||
| var remaining = pruneUnusedAndAddReferences(aggregate.aggregates(), used); | ||
|
|
||
| if (remaining != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you checked here if (remaining == null) return p;? to have a smaller if branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typically prefer this style (you suggest) too. But I think generally the code base uses the one I used. However, I applied your suggestion.
| if (inlineJoin && aggregate.groupings().containsAll(remaining)) { // not expecting high groups cardinality | ||
| // It's an INLINEJOIN and all remaining attributes are groupings, which are already part of the IJ output (from the | ||
| // left-hand side). | ||
| if (aggregate.child() instanceof StubRelation stub) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this condition will need an update when we start supporting filters on inlinestats aggs... if that guess is valid, I think a TODO is in order here. If you will add that TODO, for some (all, most I hope) I added // TODO Inlinestats:.... to find them quicker later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the TODO (as used in EsqlSession too).
| return p; | ||
| } | ||
|
|
||
| private static LogicalPlan pruneColumns(Project project, AttributeSet.Builder used) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the meat of the PR, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling Project is part of it, but the main change is calling (the now only method called) pruneColumns() within the (now renamed) pruneColumnsInInlineJoin() and using updated used set, since INLINEJOIN can be followed by any other command and each of these will need to have columns inspected for potential pruning, but aware of the fact that the pruning is done following INLINEJOIN; this "awareness" allows dropping the INLINEJOIN node when its output is entirely contained within the left-hand side of the join already.
| p = remaining.isEmpty() || remaining.stream().allMatch(FieldAttribute.class::isInstance) | ||
| ? emptyLocalRelation(project) | ||
| : new Project(project.source(), project.child(), remaining); | ||
| } else if (project.output().stream().allMatch(FieldAttribute.class::isInstance)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we interested in this aspect? (the fact that the projection is made of fields only)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation is the one given in the previous comment -- I've added a code comment too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks, Andrei! |
…-tracking * upstream/main: (26 commits) Add release notes for v9.1.0 release (elastic#131953) Unmute multi_node generative tests (elastic#132021) Avoid re-enqueueing merge tasks (elastic#132020) Fix file entitlements for shared data dir (elastic#131748) ES|QL brute force l2_norm vector function (elastic#132025) Make ES|QL SAMPLE not a pipeline breaker (elastic#132014) Speed up tail computation in MemorySegmentES91OSQVectorsScorer (elastic#132001) Remove deprecated usages in `TransportPutFollowAction` (elastic#132038) Simulate impact of shard movement using shard-level write load (elastic#131406) Remove RemoteClusterService.getConnections() method (elastic#131948) Fix off by one in ValuesBytesRefAggregator (elastic#132032) Use unicode strings in data generation by default (elastic#132028) Adding index.refresh_interval as a data stream setting (elastic#131482) [ES|QL] Add more Min/MaxOverTime CSV tests (elastic#131070) Restrict remote ENRICH after FORK (elastic#131945) Fix decoding of non-ascii field names in ignored source (elastic#132018) [docs] Use centrally maintained version variables (elastic#131939) Configurable Inference timeout during Query time (elastic#131551) ESQL: Allow pruning columns added by InlineJoin (elastic#131204) ESQL: Fail `profile` on text response formats (elastic#128627) ...
This updates
PruneColumnsrule to allow optimizing away the columns added byInlineJoin, dropping unnecessary agg'ing.