ESQL: Allow pruning columns added by InlineJoin #131204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

bpintea merged 9 commits into elastic:main from bpintea:enh/inlinejoin_prune

Jul 28, 2025

Contributor

bpintea commented Jul 14, 2025

This updates PruneColumns rule to allow optimizing away the columns added by InlineJoin, dropping unnecessary agg'ing.

bpintea added 2 commits

July 14, 2025 14:38


          Allow pruning columns added by InlineJoin

7867a30

This updates PruneColumns rule to allow optimizing away the columns
added by InlineJoin, dropping unnecessary agg'ing.


          revert logging

64ecf69

bpintea added >enhancement :Analytics/ES|QL v9.1.0 labels

elasticsearchmachine added v9.2.0 Team:Analytics labels

Collaborator

elasticsearchmachine commented Jul 14, 2025

Pinging @elastic/es-analytical-engine (Team:Analytics)


          Update docs/changelog/131204.yaml

d3d1d24

Collaborator

elasticsearchmachine commented Jul 14, 2025

Hi @bpintea, I've created a changelog YAML for you.

bpintea removed the v9.1.0 label

bpintea marked this pull request as draft

July 14, 2025 15:43

bpintea added 2 commits

July 15, 2025 15:53


          Update folding past LOOKUP JOIN

62ad977


          Merge branch 'main' into enh/inlinejoin_prune

13338c3

bpintea requested review from alex-spies, astefan and luigidellaquila

July 15, 2025 15:32

bpintea marked this pull request as ready for review

July 15, 2025 15:32

astefan reviewed

View reviewed changes

Contributor

astefan left a comment

I think it's looking good, but the refactoring can be made a bit more digestible with more comments and smarter method names.
I didn't check the correctness of the optimizer rule unit tests, only looked at the queries they are using.

I have noticed that, after chasing another bug I discovered, that it's fixing another behavior as well.

In #124715, it was described as FROM kibana_sample_data_logs | EVAL timestamp=DATE_TRUNC(5 minute, @timestamp) | INLINESTATS results = count(*) by timestamp | keep results, timestamp. For the dataset used in csv-spec files:

FROM employees | WHERE still_hired == false | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal), count=COUNT(*) BY gender | keep emp_no, still_hired, total, count | SORT emp_no

FROM employees | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal), count=COUNT(*) BY gender | keep emp_no, still_hired, total, count | WHERE still_hired == false

FROM employees | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal) BY gender | keep emp_no, still_hired, total

row salary = 12300, emp_no = 5, gender = "F" | eval sal = salary/1000 | inlinestats sum(sal) by gender | keep emp_no

^ here the presence of that evaluation before inlinestats is at least one of the triggers. We do have some queries with eval before inlinestats and with keep after it which succeed. Those above do not. I recommend adding some flavors of these to the csv-spec file.

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PruneColumns.java

    
                      // track inlinestats' own aggregation output (right-hand side of the join) so that any other plan on the left-hand side of the

                      // inline join won't have its columns pruned due to the lack of "visibility" into the right hand side output/Attributes

                      var inlineJoinRightOutput = new ArrayList<Attribute>();

                      return pruneColumns(plan, plan.outputSet().asBuilder(), false);

Contributor

astefan Jul 16, 2025 •

edited

Loading

This is a harder to follow code, but with some comments it can be made easier to digest.
For one, I would add two methods; one LogicalPlan pruneColumns(LogicalPlan plan, AttributeSet.Builder used) and use it here. This first method would call pruneColumns(plan, plan.outputSet().asBuilder(), false).
The second method would be pruneInlineJoinColumns(LogicalPlan plan, AttributeSet.Builder used) and call pruneColumns(plan, plan.outputSet().asBuilder(), true).

Contributor Author

bpintea Jul 28, 2025

I liked the uniformity of the previous namings, but I realise that it doesn't make the flow obvious at call sites, so I've renamed the methods to make it clearer what happens with InlineJoin instances. But happy to iterate if still unclear.

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PruneColumns.java Outdated

    
                                  p = aggregate.with(aggregate.groupings(), remaining);

                              }

                          } else {

                              if (inlineJoin && aggregate.groupings().containsAll(remaining)) { // not expecting high groups cardinality

Contributor

astefan Jul 16, 2025

I don't understand this comment tbh // not expecting high groups cardinality

Contributor Author

bpintea Jul 28, 2025

The editor suggests the use of HashSets for faster inclusion check. But I think that'd be more costly, for most of the cases, since typically queries won't aggregate by hundred of groups.

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PruneColumns.java Outdated

    
                      var remaining = pruneUnusedAndAddReferences(aggregate.aggregates(), used);

                      if (remaining != null) {

Contributor

astefan Jul 16, 2025

What if you checked here if (remaining == null) return p;? to have a smaller if branch?

Contributor Author

bpintea Jul 28, 2025

I typically prefer this style (you suggest) too. But I think generally the code base uses the one I used. However, I applied your suggestion.

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PruneColumns.java Outdated

    
                              if (inlineJoin && aggregate.groupings().containsAll(remaining)) { // not expecting high groups cardinality

                                  // It's an INLINEJOIN and all remaining attributes are groupings, which are already part of the IJ output (from the

                                  // left-hand side).

                                  if (aggregate.child() instanceof StubRelation stub) {

Contributor

astefan Jul 16, 2025 •

edited

Loading

I think this condition will need an update when we start supporting filters on inlinestats aggs... if that guess is valid, I think a TODO is in order here. If you will add that TODO, for some (all, most I hope) I added // TODO Inlinestats:.... to find them quicker later.

Contributor Author

bpintea Jul 28, 2025

Added the TODO (as used in EsqlSession too).

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PruneColumns.java Outdated

    
                      return p;

                  }

                  private static LogicalPlan pruneColumns(Project project, AttributeSet.Builder used) {

Contributor

astefan Jul 16, 2025

I think this is the meat of the PR, right?

Contributor Author

bpintea Jul 28, 2025

Handling Project is part of it, but the main change is calling (the now only method called) pruneColumns() within the (now renamed) pruneColumnsInInlineJoin() and using updated used set, since INLINEJOIN can be followed by any other command and each of these will need to have columns inspected for potential pruning, but aware of the fact that the pruning is done following INLINEJOIN; this "awareness" allows dropping the INLINEJOIN node when its output is entirely contained within the left-hand side of the join already.

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PruneColumns.java

    
                          p = remaining.isEmpty() || remaining.stream().allMatch(FieldAttribute.class::isInstance)

                              ? emptyLocalRelation(project)

                              : new Project(project.source(), project.child(), remaining);

                      } else if (project.output().stream().allMatch(FieldAttribute.class::isInstance)) {

Contributor

astefan Jul 16, 2025

Why are we interested in this aspect? (the fact that the projection is made of fields only)

Contributor Author

bpintea Jul 28, 2025

The explanation is the one given in the previous comment -- I've added a code comment too.

bpintea added 4 commits

July 28, 2025 11:57


          Merge branch 'main' into enh/inlinejoin_prune

4f004e6


          Review comments

e824c30


          Merge branch 'main' into enh/inlinejoin_prune

458b94a


          bump inlinestats cap version

c458cc3

astefan approved these changes

View reviewed changes

Contributor

astefan left a comment

LGTM

Contributor Author

bpintea commented Jul 28, 2025

Thanks, Andrei!

bpintea merged commit d3a07f7 into elastic:main

33 checks passed

bpintea deleted the enh/inlinejoin_prune branch

July 28, 2025 15:19

szybia added a commit to szybia/elasticsearch that referenced this pull request


          Merge remote-tracking branch 'upstream/main' into component-templates…

ef8ec7a

…-tracking

* upstream/main: (26 commits)
  Add release notes for v9.1.0 release (elastic#131953)
  Unmute multi_node generative tests (elastic#132021)
  Avoid re-enqueueing merge tasks (elastic#132020)
  Fix file entitlements for shared data dir (elastic#131748)
  ES|QL brute force l2_norm vector function (elastic#132025)
  Make ES|QL SAMPLE not a pipeline breaker (elastic#132014)
  Speed up tail computation in MemorySegmentES91OSQVectorsScorer (elastic#132001)
  Remove deprecated usages in `TransportPutFollowAction` (elastic#132038)
  Simulate impact of shard movement using shard-level write load (elastic#131406)
  Remove RemoteClusterService.getConnections() method (elastic#131948)
  Fix off by one in ValuesBytesRefAggregator (elastic#132032)
  Use unicode strings in data generation by default (elastic#132028)
  Adding index.refresh_interval as a data stream setting (elastic#131482)
  [ES|QL] Add more Min/MaxOverTime CSV tests (elastic#131070)
  Restrict remote ENRICH after FORK (elastic#131945)
  Fix decoding of non-ascii field names in ignored source (elastic#132018)
  [docs] Use centrally maintained version variables (elastic#131939)
  Configurable Inference timeout during Query time (elastic#131551)
  ESQL: Allow pruning columns added by InlineJoin (elastic#131204)
  ESQL: Fail `profile` on text response formats (elastic#128627)
  ...

bpintea mentioned this pull request

ESQL: Enable multiple INLINESTATS + remaining tests #124715

Open

7 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL >enhancement Team:Analytics v9.2.0