Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jun 12, 2025

The local plan optimizer should not change the layout, as it has already been agreed upon. However, CombineProjections can violate this when some grouping elements refer to the same attribute. This occurs when ReplaceFieldWithConstantOrNull replaces missing fields with the same reference for a given data type.

I see several ways to address this issue:

  1. Skip CombineProjections in the local plan. This optimization is useful in the global plan, but likely unnecessary in the local plan, especially since we now need to retain the same number of aggregates and groupings.

  2. Update RuleUtils#aliasedNulls to generate an eval for each missing field separately, rather than one per data type.

This change adopts option 2 as it is simpler and more robust. However, either option is fine, and I am open to other suggestions.

Closes #128054
Closes #129811

@dnhatn dnhatn added v8.18.3 v8.19.0 v9.0.3 auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL >bug labels Jun 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn marked this pull request as ready for review June 12, 2025 22:48
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@costin costin requested a review from astefan June 14, 2025 05:10
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a first look at it, but I need more time to understand it.
It seems to be doing what the comment in RuleUtils.aliasedNulls is suggesting, which is good.

When I tested this I really had to use "max_concurrent_shards_per_node":1 (as you have it in your IT test). Why is this bit needed, if you can shed some light?

@bpintea
Copy link
Contributor

bpintea commented Jun 17, 2025

Nice find of the cause!

This change adopts option 2 as it is simpler and more robust.

I think skipping CombineProjections in LocalLogicalPlanOptimizer should be (1) even simpler (basically, similar to this line) and likely needed anyways, as indeed, the rule (1) shouldn't normally apply locally, but (2) could also change the layout (as it happened).

So IMO we should at a minimum apply 1. Then, the NULL aliasing should make a difference with many locally missing fields, so even it has some fragility, I'd opt to keep it (at least until we substitute with something else).

@alex-spies
Copy link
Contributor

alex-spies commented Jun 18, 2025

Hey, nice find @dnhatn !

Update RuleUtils#aliasedNulls to generate an eval for each missing field separately, rather than one per data type.

I believe this is a good quick fix.

I think skipping CombineProjections in LocalLogicalPlanOptimizer should be (1) even simpler (basically, similar to this line) and likely needed anyways, as indeed, the rule (1) shouldn't normally apply locally, but (2) could also change the layout (as it happened)

I wouldn't disable this locally altogether. There are very valid reasons to combine projections on the local node; for one, ProjectAwayColumns introduces a projection that's only present on local nodes; also, ReplaceFieldWithConstantOrNull introduces a projection only locally; combining those makes sense and simplifies our plans.

I think the crux is that on a local node Aggregate can actually mean "this will be the initial aggregate that corresponds to an intermediate/final aggregate downstream and thus mustn't have its output changed".

Thus, I think a more robust approach would be to mark such Aggregates and avoid the deduplication of the groupings here for any aggregates marked in such a way.

I know this requires a new transport version, but we could add a fixedOutput field to Aggregate and check it in CombineProjections.

@bpintea
Copy link
Contributor

bpintea commented Jun 18, 2025

I wouldn't disable this locally altogether. There are very valid reasons to combine projections on the local node; for one, ProjectAwayColumns introduces a projection that's only present on local nodes; also, ReplaceFieldWithConstantOrNull introduces a projection only locally; combining those makes sense and simplifies our plans.

I think ProjectAwayColumns should matter later, as a physical rule, but the projections that ReplaceFieldWithConstantOrNull produces should be combined, you're right!

Thus, I think a more robust approach would be to mark such Aggregates and avoid the deduplication of the groupings here for any aggregates marked in such a way.

CombineProjections would need to become aware of a special Aggregate flavour, then, plus have a rule to mark these instances (coming/going from/in a fragment). An alternative might be to just make CombineProjections location-aware (coordinator vs. local) and based on this discriminator just use a list, rather then a set, when building the output of the agg in #combineUpperGroupingsAndLowerProjections? That shouldn't require transport changes.

@alex-spies
Copy link
Contributor

I think ProjectAwayColumns should matter later, as a physical rule,

It adds a logical projection into the fragment here. Currently, this top projection gets normally combined with anything that rules like ReplaceFieldWithConstantOrNull add.

An alternative might be to just make CombineProjections location-aware (coordinator vs. local) and based on this discriminator just use a list, rather then a set, when building the output of the agg in #combineUpperGroupingsAndLowerProjections? That shouldn't require transport changes.

Oh, right - I thought this wasn't an option because there may be other aggs in the local plan, but of course there can be at most one aggregate 🤦 . So, a local flavor of CombineProjections would do the trick!

@bpintea
Copy link
Contributor

bpintea commented Jun 18, 2025

Laterally, I was also wondering if we couldn't do a verification of the outputs before and after local replanning.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The code looks good to me, maybe a bit more unit tests (not sure exactly what) that show the different "context" (coordinator/data node) of CombineProjections in "action", like LogicalPlan optimizer tests on both coordinator and data node branches.

@dnhatn dnhatn merged commit 2bc6284 into elastic:main Jun 25, 2025
32 checks passed
@dnhatn dnhatn deleted the local-rules branch June 25, 2025 18:48
@dnhatn
Copy link
Member Author

dnhatn commented Jun 25, 2025

Thanks friends for reviews and feedback!

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jun 25, 2025
The local plan optimizer should not change the layout, as it has already
been agreed upon. However, CombineProjections can violate this when some
grouping elements refer to the same attribute. This occurs when
ReplaceFieldWithConstantOrNull replaces missing fields with the same
reference for a given data type.

Closes elastic#128054
Closes elastic#129811

(cherry picked from commit 2bc6284)
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jun 25, 2025
The local plan optimizer should not change the layout, as it has already
been agreed upon. However, CombineProjections can violate this when some
grouping elements refer to the same attribute. This occurs when
ReplaceFieldWithConstantOrNull replaces missing fields with the same
reference for a given data type.

Closes elastic#128054
Closes elastic#129811

(cherry picked from commit 2bc6284)
elasticsearchmachine pushed a commit that referenced this pull request Jun 26, 2025
The local plan optimizer should not change the layout, as it has already
been agreed upon. However, CombineProjections can violate this when some
grouping elements refer to the same attribute. This occurs when
ReplaceFieldWithConstantOrNull replaces missing fields with the same
reference for a given data type.

Closes #128054
Closes #129811

(cherry picked from commit 2bc6284)
dnhatn added a commit that referenced this pull request Jun 26, 2025
The local plan optimizer should not change the layout, as it has already
been agreed upon. However, CombineProjections can violate this when some
grouping elements refer to the same attribute. This occurs when
ReplaceFieldWithConstantOrNull replaces missing fields with the same
reference for a given data type.

Closes #128054
Closes #129811

(cherry picked from commit 2bc6284)
@elastic elastic deleted a comment from elasticsearchmachine Jun 26, 2025
@dnhatn
Copy link
Member Author

dnhatn commented Jun 26, 2025

💚 All backports created successfully

Status Branch Result
9.0
8.19
8.18
8.17

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jun 26, 2025
The local plan optimizer should not change the layout, as it has already
been agreed upon. However, CombineProjections can violate this when some
grouping elements refer to the same attribute. This occurs when
ReplaceFieldWithConstantOrNull replaces missing fields with the same
reference for a given data type.

Closes elastic#128054
Closes elastic#129811

(cherry picked from commit 2bc6284)
elasticsearchmachine pushed a commit that referenced this pull request Jun 26, 2025
The local plan optimizer should not change the layout, as it has already
been agreed upon. However, CombineProjections can violate this when some
grouping elements refer to the same attribute. This occurs when
ReplaceFieldWithConstantOrNull replaces missing fields with the same
reference for a given data type.

Closes #128054
Closes #129811

(cherry picked from commit 2bc6284)
@alex-spies
Copy link
Contributor

@dnhatn , I realized that 8.17 might also be affected - the optimization of missing fields changed with #126187 (which triggered the bug in CombineProjections to surface on main).

Should we backport one more time, to 8.17?

@dnhatn dnhatn added the v8.17.9 label Jun 26, 2025
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jun 26, 2025
…lastic#130060)

The local plan optimizer should not change the layout, as it has already
been agreed upon. However, CombineProjections can violate this when some
grouping elements refer to the same attribute. This occurs when
ReplaceFieldWithConstantOrNull replaces missing fields with the same
reference for a given data type.

Closes elastic#128054
Closes elastic#129811
@dnhatn
Copy link
Member Author

dnhatn commented Jun 26, 2025

Should we backport one more time, to 8.17?

++ I opened #130128. Thanks Alex!

elasticsearchmachine pushed a commit that referenced this pull request Jun 26, 2025
…#130128)

The local plan optimizer should not change the layout, as it has already
been agreed upon. However, CombineProjections can violate this when some
grouping elements refer to the same attribute. This occurs when
ReplaceFieldWithConstantOrNull replaces missing fields with the same
reference for a given data type.

Closes #128054
Closes #129811
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.17.9 v8.18.4 v8.19.0 v9.0.4 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ArrayIndexOutOfBoundsException in ValuesBytesRefGroupingAggregatorFunction [ES|QL] ClassCastException in PercentileDoubleGroupingAggregatorFunction

5 participants