-
Notifications
You must be signed in to change notification settings - Fork 25.7k
ESQL: Pull OrderBy followed by InlineJoin on top of it
#137648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add optimisation rule to pull OrderBy above InlineJoin. This is required since otherwise the SORT won't be moved out of the left hand-side of the join, ending up as a stand-alone node that can't be turned into an executable -- it'll be rejected by verification already as an "unbounded sort". Since the InlineJoin is sort agnostic, the OrderBy can be moved upwards past it, ending up as a top TopN.
|
Hi @bpintea, I've created a changelog YAML for you. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
astefan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the tests, and I always like more tests :-).
The fact that this PR pulls up OrderBy passed InlineJoin without "using" or be made aware by the existence of SortAgnostic (as opposed to the original PR which considered SortAgnostic for something related to OrderBy) shows that the presence of SortAgnostic is not well enforced in code (for the developer's visibility and awareness), something that (reading between lines) has been made clear here. In other words, SortAgnostic defies half of its purpose - to be a tool developers are aware of (or be made aware of reliably) when they develop new commands but also (my own addition here) when developers (new or seasoned ones) touch any planning code (not only related to the new command).
The other half of SortAgnostics purpose is well defined: TRYING to reveal issues with using a command that doesn't fully considers all the subtleties and implication of a sort before that command that might silently give wrong results. But, this holds true considering the following:
- the new command PR has good and, as much as possible, enough tests that reveal what
SortAgnosticis trying to guard - the new command PR is being reviewed by people who are aware of
sorttweaks andSortAgnosticpresence and can point out thatsortcan be something to look into further, something that is not always true these days
IMHO, just like my previous review to the original PR, is that this PR brings a much needed improvement to the intricacies of inline stats + sort usage: an irrelevant sort should be pulled up passed inline stats because inline stats right-hand side should use the "original" relevant set of data (and an unbounded sorted set of data is irrelevant) and should be considered for adoption.
I still need to dig through the surgical take from rewriteMidProjections method and see why it's so surgical :-), but this is for the second round of review.
| * <p> | ||
| * See also {@link PruneRedundantOrderBy}. | ||
| */ | ||
| public final class PullUpOrderByBeforeInlineJoin extends OptimizerRules.OptimizerRule<LogicalPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense for this rule to be OptimizerRules.CoordinatorOnly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can't find a scenario where this wouldn't need to run on coordinator, but could run on the data node. Will update.
| */ | ||
| public void testInlineJoinPrunedAfterSortAndLookupJoin() { | ||
| var query = """ | ||
| FROM airports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is interesting in that if I add count to the keep I start getting Unbounded SORT not supported yet [SORT abbrev DESC] please add a LIMIT\nline 31:15: INLINE STATS [INLINE STATS count=COUNT(*) BY scalerank] cannot yet have an unbounded SORT [SORT abbrev DESC] before it : either move the SORT after it, or add a LIMIT before the SORT which might be correct, but it is confusing to the user. SORT abbrev was there before when count was not kept.
We eliminate the inline stats because we don't really need (and the sort is irrelevant from the point of view of inline stats) it but the outcome is definitely confusing to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
astefan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some suggestions, more or less optional.
| 10029 |4 |null |74999 | ||
| ; | ||
|
|
||
| mixedShadowingInlineStatsAfterSort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a variant of this test, as well:
FROM employees
| KEEP salary, emp_no, first_name, gender
| EVAL old_salary = salary
| SORT old_salary, emp_no
| DROP old_salary
| INLINE STATS salary = COUNT(*) BY gender
| LIMIT 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another one:
FROM employees
| KEEP salary, emp_no, first_name, gender
| EVAL old_salary = salary
| SORT old_salary, emp_no
| DROP old_salary
| EVAL salary = salary / 1000
| RENAME salary AS s
| SORT s, first_name
| INLINE STATS s = COUNT(*) BY gender
| LIMIT 5
| protected LogicalPlan rule(LogicalPlan plan) { | ||
| return plan.transformUp(LogicalPlan.class, PullUpOrderByBeforeInlineJoin::pullUpOrderByBeforeInlineJoin); | ||
| } | ||
|
|
||
| private static LogicalPlan pullUpOrderByBeforeInlineJoin(LogicalPlan plan) { | ||
| if (plan instanceof InlineJoin inlineJoin) { | ||
| Holder<OrderBy> orderByHolder = new Holder<>(); | ||
| inlineJoin.forEachDownMayReturnEarly((node, breakEarly) -> { | ||
| if (node instanceof OrderBy orderBy) { | ||
| orderByHolder.set(orderBy); | ||
| breakEarly.set(true); | ||
| } else { | ||
| breakEarly.set(isSortBreaker(node)); | ||
| } | ||
| }); | ||
|
|
||
| OrderBy orderBy = orderByHolder.get(); | ||
| plan = orderBy == null ? plan : pullUpOrderByBeforeInlineJoin(inlineJoin, orderBy); | ||
| } | ||
| return plan; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| protected LogicalPlan rule(LogicalPlan plan) { | |
| return plan.transformUp(LogicalPlan.class, PullUpOrderByBeforeInlineJoin::pullUpOrderByBeforeInlineJoin); | |
| } | |
| private static LogicalPlan pullUpOrderByBeforeInlineJoin(LogicalPlan plan) { | |
| if (plan instanceof InlineJoin inlineJoin) { | |
| Holder<OrderBy> orderByHolder = new Holder<>(); | |
| inlineJoin.forEachDownMayReturnEarly((node, breakEarly) -> { | |
| if (node instanceof OrderBy orderBy) { | |
| orderByHolder.set(orderBy); | |
| breakEarly.set(true); | |
| } else { | |
| breakEarly.set(isSortBreaker(node)); | |
| } | |
| }); | |
| OrderBy orderBy = orderByHolder.get(); | |
| plan = orderBy == null ? plan : pullUpOrderByBeforeInlineJoin(inlineJoin, orderBy); | |
| } | |
| return plan; | |
| } | |
| protected LogicalPlan rule(LogicalPlan plan) { | |
| return plan.transformUp(InlineJoin.class, ij -> pullUpOrderByBeforeInlineJoin(ij)); | |
| } | |
| private static LogicalPlan pullUpOrderByBeforeInlineJoin(InlineJoin inlineJoin) { | |
| Holder<OrderBy> orderByHolder = new Holder<>(); | |
| inlineJoin.forEachDownMayReturnEarly((node, breakEarly) -> { | |
| if (node instanceof OrderBy orderBy) { | |
| orderByHolder.set(orderBy); | |
| breakEarly.set(true); | |
| } else { | |
| breakEarly.set(isSortBreaker(node)); | |
| } | |
| }); | |
| OrderBy orderBy = orderByHolder.get(); | |
| return orderBy == null ? inlineJoin : pullUpOrderByBeforeInlineJoin(inlineJoin, orderBy); | |
| } |
| return plan.transformUp(lp -> { | ||
| if (lp == eval) { | ||
| evalVisited.set(true); | ||
| } else if (lp instanceof Project project && evalVisited.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find a problem, but I wanted to check the aggregation (since it's considered to be a variant of projection) and I would like to propose unit test/csv test with (I know you have one unit test with sort ... stats ... inline stats):
FROM employees
| KEEP salary, emp_no, first_name, gender
| SORT salary
| STATS salary = MAX(salary) BY gender
| SORT salary DESC
| INLINE STATS s = COUNT(*) BY gender
| LIMIT 5
and a slightly different variant:
FROM employees
| KEEP salary, emp_no, first_name, gender
| SORT salary
| STATS salary = MAX(salary) BY gender
| INLINE STATS s = COUNT(*) BY gender
| LIMIT 5
alex-spies
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly superficial review, but I think the approach makes sense. There's very likely a hidden problem with projects that's worth looking into, even though it may not affect current production queries at this time.
And we need to align on opt-in vs. opt-out optimizer rules and the use of interfaces. For now, I'd really like us to use interfaces to opt-in to optimizer rules so that correctness is guaranteed rather than finding out later, when more commands get added to ESQL, that we ran queries that always returned wrong results to the user because a new command silently opted into an optimization (like hoisting a sort before inline stats while the new command is in between) that it doesn't actually work with.
At the very least, let's be consistent: we've already started using opt-in interfaces and have 3 (or more) rules that use them; having an opt-out list of commands as part of the optimizer rule just for 1 rule would be confusing and breaking the pattern.
| * <p> | ||
| * See also {@link PruneRedundantOrderBy}. | ||
| */ | ||
| public final class PullUpOrderByBeforeInlineJoin extends OptimizerRules.OptimizerRule<LogicalPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we've started calling "push up" "hoist" in other rules.
| /** | ||
| * Returns `true` if the {@code plan}'s position cannot be swapped with a SORT, `false` otherwise. | ||
| */ | ||
| private static boolean isSortBreaker(LogicalPlan plan) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to become an interface; a list of instanceofs is too easy to go stale. It's also inconsistent that multiple optimizer rules already use interfaces that allow logical plan nodes to opt into the optimizations, but here we have an instanceof list instead.
E.g. Aggregate will become a sort breaker most likely once we support window functions; when someone who's unaware of this rule works on window functions, they are almost guaranteed to miss this and the problem will only surface through sufficient test coverage by humans - so the reviewer would have to keep this in mind and say "hey, but what about inline stats with window functions when there is an agg before the inline stats and before that there's a sort"? That makes reviews pretty hard.
I also think the interface should be one that opts into this optimizer rule, not one that opts out of it.
If we're worried about the zoo of interfaces being too large to manage by non-expert devs, we can have an abstract super class StandardLogicalPlan (probably with a much better name - RowStreamingLogicalPlan or so, for commands like Eval, Grok, Enrich etc.?) that extends all the opt-in interfaces that most plans would normally work with; and then new commands would generally just extend this class unless they're special enough.
| case MvExpand m -> true; // generative node, can destabilize the order | ||
| case Limit l -> true; | ||
| case TopN t -> true; | ||
| default -> false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--
|
|
||
| return evalAliases.isEmpty() | ||
| ? orderBy.replaceChild(inlineJoin.transformUp(OrderBy.class, ob -> ob == orderBy ? orderBy.child() : ob)) | ||
| : pullUpRewritingMidProjections(inlineJoin, orderBy, evalAliases, orderByAttrMapBuilder.build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't account for potential project nodes that are between the inline join and the sort, which would discard the temporary attributes that are created to keep the order attributes around.
Reproducing this is hard because it requires a plan node that doesn't get pushed down past projects BUT isn't a sort breaker. We may not have such nodes atm.
But someone may add such a plan node later; we've seen real bugs from MV_EXPAND because it prevented some projections to bubble up to the top, and then similar creation of temp attributes upstream were broken because the temp attributes got dropped before the place where they were needed.
Also, if there's any project node after the order by right now, at the very least this rule will put the plan into an inconsistent state and we shouldn't rely on the leniency of our push-down-past-project rules to fix that inconsistency (even though they do fix such problems at the moment, quasi by accident).
Add optimisation rule to pull
OrderByaboveInlineJoin. This is required since otherwise theOrderBywon't be moved out of the left hand-side of the join, ending up as a stand-alone node that can't be turned into an executable -- it'll just be rejected by the verification as an "unbounded sort". Since theInlineJoinis sort agnostic, theOrderBycan be moved upwards past it, ending up as a topTopN.This doesn't entirely solve the issue of unbounded sorts, however: some nodes, such as
MV_EXPANDorLOOKUP JOINbreak the inbound sort, so pulling aSORTon top of them isn't possible. In this cases the query will still fail. This is notINLINE JOINspecific, though.This is a revival of #132417, but with a new approach on how to deal with the attributes used by
SORT, but dropped or overshadowed by the timeINLINE JOINis reached. In this case, we'll add temporary internal attributes kept until the INLINE JOIN and dropped afterwards.Related #124715 (#113727)
Related #124721 (#133120)