ES|QL: Late materialization after TopN (Node level) #132757

GalLalouche · 2025-08-12T16:35:12Z

This PR adds a late(r) materialization for TopN queries, such that the materialization happes in the "node_reduce" phase instead of during the "data" phase.

For example, if the limit is 20, and each data node spawns 10 workers, we would only read 20 additional columns (i.e., ones not needed for the TopN) filters, instead of 200. To support this, the reducer node maintains a global list of all shard contexts used by its individual data workers (although some of those might be closed if they are no longer needed, thanks to #129454).

There is some additional book-keeping involved, since previously, every data node held a local list of shard contexts, and used its local indices to access it. To avoid changing too much (this local-index logic is spread throughout much of the code!), a new global index is introduced, which replaces the local index after all the rows are merged together in the reduce phase's TopN.

…sticsearch into testing_fetch_v2_passing

alex-spies

Heya, I'm working my way through the changes, focusing mostly on the planning aspect.

Do we have at least some smoke tests that confirm that, with the pragma disabled, our reduction and data node plans remain the same as before this PR, resp., remain correct? That'd be fantastic, because the changes to the reduction planning are substantial enough that it's hard to confirm that they're safe by review alone. (Of course, our ITs do cover a lot of ground, so there's no reason to be overly paranoid, but still.)

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/EsQueryExec.java

alex-spies · 2025-10-03T16:02:15Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/QueryPragmas.java


    public static final Setting<Boolean> NODE_LEVEL_REDUCTION = Setting.boolSetting("node_level_reduction", true);

+    public static final Setting<Boolean> REDUCTION_LATE_MATERIALIZATION = Setting.boolSetting("reduction_late_materialization", false);


Should this be a pragma or in PlannerSettings. In PlannerSettings it's in the cluster level settings and we can disable it in serverless and anyone with their own cluster could disable it. Rather than require it on every request.

Moved to PlannerSettings.

@GalLalouche and I came up with the pragma to have a way to hide the feature until we're ready to switch it on in a separate PR.

Do we need a setting for this? For the purpose stated above, it'd even suffice to enable the behavior only in SNAPSHOT builds, in principle.

Of course, if we find it useful to keep this setting around, it's fine this way.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ReductionPlan.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

nik9000

Left a few final comments. I think we're good except for some extra logging questions. We have to be careful not to log a zillion times on each failure or we won't be able to hear ourselves think when production has problems. OTOH, I haven't traced to be sure that the logs are bad or wrong - just that their level should be lowered and we should double check if we want to keep them.

I'll give this another read soon to be sure, but it looks right to me.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/EsQueryExec.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeSearchContext.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeComputeHandler.java

...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/LateMaterializationPlanner.java

nik9000 · 2025-10-03T18:12:31Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/QueryPragmas.java


    public static final Setting<Boolean> NODE_LEVEL_REDUCTION = Setting.boolSetting("node_level_reduction", true);

+    public static final Setting<Boolean> REDUCTION_LATE_MATERIALIZATION = Setting.boolSetting("reduction_late_materialization", false);


Should this be a pragma or in PlannerSettings. In PlannerSettings it's in the cluster level settings and we can disable it in serverless and anyone with their own cluster could disable it. Rather than require it on every request.

nik9000

Last tranche of reviews. Its all "javadoc please" and "change the logging level" stuff.

For the log things, I think the thing that'd make me most comfortable is to make them all debug and then make followup PRs to change their level where we can talk about that.

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/Driver.java

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/DriverContext.java

.../internalClusterTest/java/org/elasticsearch/xpack/esql/action/AbstractEsqlIntegTestCase.java

GalLalouche · 2025-10-05T17:03:00Z

Do we have at least some smoke tests that confirm that, with the pragma disabled, our reduction and data node plans remain the same as before this PR, resp., remain correct?

@alex-spies There is a test in EsqlActionTaskIT that checks the plans with the feature turned on and off.

alex-spies

Made another pass, close to finishing the whole review.

...in/esql/compute/src/test/java/org/elasticsearch/compute/operator/topn/TopNOperatorTests.java

alex-spies · 2025-10-06T12:25:04Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/QueryPragmas.java


    public static final Setting<Boolean> NODE_LEVEL_REDUCTION = Setting.boolSetting("node_level_reduction", true);

+    public static final Setting<Boolean> REDUCTION_LATE_MATERIALIZATION = Setting.boolSetting("reduction_late_materialization", false);


@GalLalouche and I came up with the pragma to have a way to hide the feature until we're ready to switch it on in a separate PR.

Do we need a setting for this? For the purpose stated above, it'd even suffice to enable the behavior only in SNAPSHOT builds, in principle.

Of course, if we find it useful to keep this setting around, it's fine this way.

...ClusterTest/java/org/elasticsearch/xpack/esql/action/EsqlReductionLateMaterializationIT.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

alex-spies · 2025-10-06T15:31:58Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

+        boolean splitTopN
+    ) {
+        PhysicalPlan source = new ExchangeSourceExec(originalPlan.source(), originalPlan.output(), originalPlan.isIntermediateAgg());
+        if (splitTopN == false && runNodeLevelReduction == false) {


Shouldn't that just be

Suggested change

if (splitTopN == false && runNodeLevelReduction == false) {

if (runNodeLevelReduction == false) {

?

I guess the case where runNodeLevelReduction == false and splitTopN == true is invalid to begin with, but if we were to end up in this case, the current code will happily apply a node-level reduction for TopN even when it's disabled per runNodeLevelReduction, no?

I guess the case where runNodeLevelReduction == false and splitTopN == true is invalid to begin with

Not really, these are two separate features, at least the way it's designed. We can make one dependent on the other, of course, but do we want to (this was actually I implemented it before adding the query pragma/planner setting, and it also made the actual handling of these flags more annoying due to the way runNodeLevelReduction is modified 😕).

but if we were to end up in this case, the current code will happily apply a node-level reduction for TopN even when it's disabled per runNodeLevelReduction, no?

Which is fine, since it's two different features.

As discussed offline, I ended up replacing this with a snapshot check, and this will depend on runNodeLevelReduction.

alex-spies

Nearly there!

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeRequest.java

...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/LateMaterializationPlanner.java

alex-spies

Ok, all done now.

From a query planning POV, this is A-OK. Thanks a lot for the iterations! I'm much more confident that this will hold up as the planner evolves and is much more defensive than it originally was. Really nice.

I left some comments, most of which are rather minor. Please consider this unblocked from my end, but it'd be nice if we could do the following (of course, in follow-ups where it would increase the scope too much):

Double check if we really need a new setting (#132757 (comment)); let's discuss this offline.
Add more tests: it'd be nice to have some optimizer tests for the reduction plan + data plan together, as currently our optimizer tests don't account for node-level reduction at all.
Clarify the concurrency and ownership of the ComputeSearchContexts (#132757 (comment) and #132757 (comment)). I understand these are shared between data and reduce drivers, but if we can't decRef contexts when running the reduce driver, I think we didn't properly take ownership of the reference to begin with and this might become quite complicated going forward.
Double check + test if we correctly perform the row estimation here

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeComputeHandler.java

nik9000

I'm happy with it. Let's get Alex's last few comments solved and bring this thing in for a landing. We should get some rally benchmarks out of this. It's been a lot of work. We might get this for free from the nightly, but once you are good and ready to click that merge button I think you should try and find the rally tracks that we run that'll benefit from this. So we can watch them.

...in/esql/compute/src/test/java/org/elasticsearch/compute/operator/topn/TopNOperatorTests.java

...ClusterTest/java/org/elasticsearch/xpack/esql/action/EsqlReductionLateMaterializationIT.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

GalLalouche

Thanks for the great in depth review, @alex-spies! I've addressed all things, although there is still the issue of of estimateRowSize which is waiting for @nik9000's feedback, and the questions of the dependence between the feature flags (runOnNodeReduce and reduceLateMaterialization or whatever we call it).

...ClusterTest/java/org/elasticsearch/xpack/esql/action/EsqlReductionLateMaterializationIT.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeSearchContext.java

...in/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeSearchContextByShardId.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeRequest.java

...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/LateMaterializationPlanner.java

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeComputeHandler.java

...in/esql/compute/src/test/java/org/elasticsearch/compute/operator/topn/TopNOperatorTests.java

...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/LateMaterializationPlanner.java

GalLalouche · 2025-10-08T18:24:09Z

...ugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/ManyShardsIT.java

            );
            for (String q : queries) {
-                QueryPragmas pragmas = randomPragmas();
+                var pragmas = randomPragmas();


@nik9000 FYI. I remember we discussed this, though I don't remember the exact solution we agreed on (if we did).

GalLalouche · 2025-10-10T11:09:45Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/dense_vector-bit.csv-spec


 FROM dense_vector
-| EVAL k = v_l2_norm(bit_vector, [1])  // workaround to enable fetching dense_vector
+| EVAL k = v_l2_norm(bit_vector, [1,2])  // workaround to enable fetching dense_vector


See #136365.

…ssing

This PR adds a late(r) materialization for TopN queries, such that the materialization happes in the "node_reduce" phase instead of during the "data" phase. For example, if the limit is 20, and each data node spawns 10 workers, we would only read 20 additional columns (i.e., ones not needed for the TopN) filters, instead of 200. To support this, the reducer node maintains a global list of all shard contexts used by its individual data workers (although some of those might be closed if they are no longer needed, thanks to elastic#129454). There is some additional book-keeping involved, since previously, every data node held a local list of shard contexts, and used its local indices to access it. To avoid changing too much (this local-index logic is spread throughout much of the code!), a new global index is introduced, which replaces the local index after all the rows are merged together in the reduce phase's TopN.

GalLalouche added 12 commits July 15, 2025 12:17

Some small additions to make testing easier

dcd1b5f

Working LazyList

5df1f4a

More passes

c740415

Fixes to CsvTests

1ecadac

Fixes, moved hack to start, not all working

7bd74ab

All pass

9d91e9e

Refactors

2fe8058

Refactored LazyList

28f11d6

TEMP

1ada320

More Refactors

2b32f9e

More fixes

7223014

Merge branch 'main' into testing_fetch_v2_passing

290633a

elasticsearchmachine added the v9.2.0 label Aug 12, 2025

GalLalouche added 16 commits August 12, 2025 20:08

Basic cleanup

df9cf86

Checkstyle shenanigans

f78891f

Remove println

12e622c

temp

85bfde2

Replace isFirstTopN and vector with a single int and config

e60a93d

Fix indexing issues

a1b81b5

Fix test compilation

8864058

Fix test compilation

b8ef105

Merge branch 'testing_fetch_v2_passing' of github.com:GalLalouche/ela…

9ac77c4

…sticsearch into testing_fetch_v2_passing

Merge branch 'testing_fetch_v2_passing' of github.com:GalLalouche/ela…

bf08837

…sticsearch into testing_fetch_v2_passing

Merge branch 'testing_fetch_v2_passing' of github.com:GalLalouche/ela…

7de90cf

…sticsearch into testing_fetch_v2_passing

More fixes

4229215

temp removing no commit for tests to run

f358784

Fix test name

93ae30c

Merge branch 'main' into testing_fetch_v2_passing

ef58972

Replace assertion with exception

5b811b3

GalLalouche requested a review from nik9000 August 19, 2025 13:24

alex-spies reviewed Oct 3, 2025

View reviewed changes

nik9000 reviewed Oct 3, 2025

View reviewed changes

nik9000 reviewed Oct 4, 2025

View reviewed changes

More CR fixes

cc904ad

GalLalouche requested review from alex-spies and nik9000 October 5, 2025 17:20

alex-spies reviewed Oct 6, 2025

View reviewed changes

alex-spies approved these changes Oct 7, 2025

View reviewed changes

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/DataNodeComputeHandler.java Show resolved Hide resolved

nik9000 approved these changes Oct 7, 2025

View reviewed changes

GalLalouche commented Oct 8, 2025

View reviewed changes

More CR fixes

ffee946

nik9000 reviewed Oct 8, 2025

View reviewed changes

...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/LateMaterializationPlanner.java Outdated Show resolved Hide resolved

More CR fixes

5535ce2

GalLalouche commented Oct 8, 2025

View reviewed changes

GalLalouche added 3 commits October 9, 2025 12:54

Fix tests

db4f13d

Merge branch 'main' into testing_fetch_v2_passing

ff95ef2

Fix dense_vector test

bff5e71

GalLalouche mentioned this pull request Oct 10, 2025

ES|QL: PruneColumns fails to prune columns when forked #136365

Open

GalLalouche commented Oct 10, 2025

View reviewed changes

GalLalouche added 4 commits October 10, 2025 14:13

Fix failing test

87966fc

Merge branch 'main' into testing_fetch_v2_passing

3721bde

Merge branch 'main' into testing_fetch_v2_passing

81b56c8

Merge branch 'main' into testing_fetch_v2_passing

e938236

GalLalouche enabled auto-merge (squash) October 12, 2025 11:34

Merge remote-tracking branch 'upstream/main' into testing_fetch_v2_pa…

d136211

…ssing

alex-spies self-assigned this Oct 13, 2025

GalLalouche merged commit 0a7d113 into elastic:main Oct 13, 2025
34 checks passed


		public static final Setting<Boolean> NODE_LEVEL_REDUCTION = Setting.boolSetting("node_level_reduction", true);

		public static final Setting<Boolean> REDUCTION_LATE_MATERIALIZATION = Setting.boolSetting("reduction_late_materialization", false);

	if (splitTopN == false && runNodeLevelReduction == false) {
	if (runNodeLevelReduction == false) {

ES|QL: Late materialization after TopN (Node level) #132757

ES|QL: Late materialization after TopN (Node level) #132757

Uh oh!

Conversation

GalLalouche commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GalLalouche commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GalLalouche left a comment

GalLalouche commented Aug 12, 2025 •

edited

Loading

GalLalouche commented Oct 5, 2025 •

edited

Loading