Skip to content

Conversation

@bpintea
Copy link
Contributor

@bpintea bpintea commented Jul 14, 2025

This updates PruneColumns rule to allow optimizing away the columns added by InlineJoin, dropping unnecessary agg'ing.

bpintea added 2 commits July 14, 2025 14:38
This updates PruneColumns rule to allow optimizing away the columns
added by InlineJoin, dropping unnecessary agg'ing.
@elasticsearchmachine elasticsearchmachine added v9.2.0 Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Jul 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @bpintea, I've created a changelog YAML for you.

@bpintea bpintea removed the v9.1.0 label Jul 14, 2025
@bpintea bpintea marked this pull request as draft July 14, 2025 15:43
@bpintea bpintea marked this pull request as ready for review July 15, 2025 15:32
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's looking good, but the refactoring can be made a bit more digestible with more comments and smarter method names.
I didn't check the correctness of the optimizer rule unit tests, only looked at the queries they are using.

I have noticed that, after chasing another bug I discovered, that it's fixing another behavior as well.

In #124715, it was described as FROM kibana_sample_data_logs | EVAL timestamp=DATE_TRUNC(5 minute, @timestamp) | INLINESTATS results = count(*) by timestamp | keep results, timestamp. For the dataset used in csv-spec files:

FROM employees | WHERE still_hired == false | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal), count=COUNT(*) BY gender | keep emp_no, still_hired, total, count | SORT emp_no

FROM employees | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal), count=COUNT(*) BY gender | keep emp_no, still_hired, total, count | WHERE still_hired == false

FROM employees | EVAL sal = salary/1000 | INLINESTATS total = SUM(sal) BY gender | keep emp_no, still_hired, total

row salary = 12300, emp_no = 5, gender = "F" | eval sal = salary/1000 | inlinestats sum(sal) by gender | keep emp_no

^ here the presence of that evaluation before inlinestats is at least one of the triggers. We do have some queries with eval before inlinestats and with keep after it which succeed. Those above do not. I recommend adding some flavors of these to the csv-spec file.

// track inlinestats' own aggregation output (right-hand side of the join) so that any other plan on the left-hand side of the
// inline join won't have its columns pruned due to the lack of "visibility" into the right hand side output/Attributes
var inlineJoinRightOutput = new ArrayList<Attribute>();
return pruneColumns(plan, plan.outputSet().asBuilder(), false);
Copy link
Contributor

@astefan astefan Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a harder to follow code, but with some comments it can be made easier to digest.
For one, I would add two methods; one LogicalPlan pruneColumns(LogicalPlan plan, AttributeSet.Builder used) and use it here. This first method would call pruneColumns(plan, plan.outputSet().asBuilder(), false).
The second method would be pruneInlineJoinColumns(LogicalPlan plan, AttributeSet.Builder used) and call pruneColumns(plan, plan.outputSet().asBuilder(), true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the uniformity of the previous namings, but I realise that it doesn't make the flow obvious at call sites, so I've renamed the methods to make it clearer what happens with InlineJoin instances. But happy to iterate if still unclear.

p = aggregate.with(aggregate.groupings(), remaining);
}
} else {
if (inlineJoin && aggregate.groupings().containsAll(remaining)) { // not expecting high groups cardinality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment tbh // not expecting high groups cardinality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The editor suggests the use of HashSets for faster inclusion check. But I think that'd be more costly, for most of the cases, since typically queries won't aggregate by hundred of groups.


var remaining = pruneUnusedAndAddReferences(aggregate.aggregates(), used);

if (remaining != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you checked here if (remaining == null) return p;? to have a smaller if branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically prefer this style (you suggest) too. But I think generally the code base uses the one I used. However, I applied your suggestion.

if (inlineJoin && aggregate.groupings().containsAll(remaining)) { // not expecting high groups cardinality
// It's an INLINEJOIN and all remaining attributes are groupings, which are already part of the IJ output (from the
// left-hand side).
if (aggregate.child() instanceof StubRelation stub) {
Copy link
Contributor

@astefan astefan Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this condition will need an update when we start supporting filters on inlinestats aggs... if that guess is valid, I think a TODO is in order here. If you will add that TODO, for some (all, most I hope) I added // TODO Inlinestats:.... to find them quicker later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the TODO (as used in EsqlSession too).

return p;
}

private static LogicalPlan pruneColumns(Project project, AttributeSet.Builder used) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the meat of the PR, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling Project is part of it, but the main change is calling (the now only method called) pruneColumns() within the (now renamed) pruneColumnsInInlineJoin() and using updated used set, since INLINEJOIN can be followed by any other command and each of these will need to have columns inspected for potential pruning, but aware of the fact that the pruning is done following INLINEJOIN; this "awareness" allows dropping the INLINEJOIN node when its output is entirely contained within the left-hand side of the join already.

p = remaining.isEmpty() || remaining.stream().allMatch(FieldAttribute.class::isInstance)
? emptyLocalRelation(project)
: new Project(project.source(), project.child(), remaining);
} else if (project.output().stream().allMatch(FieldAttribute.class::isInstance)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we interested in this aspect? (the fact that the projection is made of fields only)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation is the one given in the previous comment -- I've added a code comment too.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bpintea
Copy link
Contributor Author

bpintea commented Jul 28, 2025

Thanks, Andrei!

@bpintea bpintea merged commit d3a07f7 into elastic:main Jul 28, 2025
33 checks passed
@bpintea bpintea deleted the enh/inlinejoin_prune branch July 28, 2025 15:19
szybia added a commit to szybia/elasticsearch that referenced this pull request Jul 29, 2025
…-tracking

* upstream/main: (26 commits)
  Add release notes for v9.1.0 release (elastic#131953)
  Unmute multi_node generative tests (elastic#132021)
  Avoid re-enqueueing merge tasks (elastic#132020)
  Fix file entitlements for shared data dir (elastic#131748)
  ES|QL brute force l2_norm vector function (elastic#132025)
  Make ES|QL SAMPLE not a pipeline breaker (elastic#132014)
  Speed up tail computation in MemorySegmentES91OSQVectorsScorer (elastic#132001)
  Remove deprecated usages in `TransportPutFollowAction` (elastic#132038)
  Simulate impact of shard movement using shard-level write load (elastic#131406)
  Remove RemoteClusterService.getConnections() method (elastic#131948)
  Fix off by one in ValuesBytesRefAggregator (elastic#132032)
  Use unicode strings in data generation by default (elastic#132028)
  Adding index.refresh_interval as a data stream setting (elastic#131482)
  [ES|QL] Add more Min/MaxOverTime CSV tests (elastic#131070)
  Restrict remote ENRICH after FORK (elastic#131945)
  Fix decoding of non-ascii field names in ignored source (elastic#132018)
  [docs] Use centrally maintained version variables (elastic#131939)
  Configurable Inference timeout during Query time (elastic#131551)
  ESQL: Allow pruning columns added by InlineJoin (elastic#131204)
  ESQL: Fail `profile` on text response formats (elastic#128627)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants