Refactor verifiers and add remote check #134168

smalyshev · 2025-09-04T19:56:54Z

Make verifiers aware of whether we are verifying general or local plan. This can be useful in the future. It also removes the skipRemoteEnrichVerification parameter which is only needed for ad-hoc fix and is not required for the main logic.

Currently this check should never fail, but when we're going to refactor Enrich handling, we want to be sure that we don't accidentally break the semantics and perform remote enrich on the coordinator side.

Followup: add mode flag and ExecutesOn to LookupJoinExec.

elasticsearchmachine · 2025-09-04T21:46:36Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

alex-spies

Yes, I think this makes sense! I think it's not entirely correct in case of upstream lookup joins, though, see below.

alex-spies · 2025-09-05T08:11:24Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/EnrichExec.java

+    public void postPhysicalOptimizationVerification(Failures failures) {
+        if (mode == Enrich.Mode.REMOTE) {
+            // check that there is no FragmentedExec in the child plan - that would mean we're on the wrong side of the remote boundary
+            child().forEachDown(FragmentExec.class, f -> {


Heya, you can't really know that but we're in the process of improving the planning of lookup joins. We made LokoupJoinExec generally contain a FragmentExec in the right hand branch. So this check is wrong in case of upstream (remote) lookup joins.

I'd add a test for this case, and I think we need to go and ignore fragments that belong to lookup joins. One way to do that is to additionally traverse the fragment and ensure its EsRelation is not in lookup mode.

Alternatively, you could extract the "leftmost leave" by calling collectLeaves and checking only the first leaf. This is the main source of the plan if the plan is made only from unary and binary execs, because our binary execs have the "special" side on the right.

I also realized there can be a MergeExec which indicates that we're after a FORK. That'd also be wrong for a remote enrich as the merge part of a fork is executed on the coordinator.

But if we see LokoupJoinExec aren't we on the coordinator side already anyway?

I also realized there can be a MergeExec which indicates that we're after a FORK

Can there be MergeExec without fragment? From what I'm seeing, both branches of the MergeExec eventually have fragments, so we'd still reach it anyway.

In fact, I am starting to suspect we should never see EnrichExec with remote mode in global plan regardless of anything else (though we might in the local plan) but I am not 100% sure yet, and also not sure how do I even know is it a local or a global plan. But checking for FragmentExec seems to be a reasonable proxy for "is this a global plan?", not?

But if we see LokoupJoinExec aren't we on the coordinator side already anyway?

No. Here's an example in our tests.

Can there be MergeExec without fragment?

I'm not super familiar with FORK/Merge but I see from Mapper.java that each branch is just planned normally. There are plans without fragments, like ROW.

Which reminds me that I don't know how we handle remote ENRICH with ROW. Pathological example, though :)

But checking for FragmentExec seems to be a reasonable proxy for "is this a global plan?", not?

It used to, but we're having more complex plans now. Lookup joins contain fragments in the right branch.

I think it's more reliable to check for a local plan by walking the plan to the leaf and checking that it is a EsQueryExec. In case of joins or merges, there will be more than one leaf - but I think the best current condition is "there is exactly one EsQueryExec, and it's the leftmost leaf".

If our plans grow in complexity, we might have to explicitly mark them as data node plan, lookup index plan, coordinator plan etc. This is knowledge that is really easier to write into the plan than reverse engineer from its shape.

Remote Enrich is not allowed after FORK right now, so MergeExec can not appear under remote Enrich I think.

I tried this query:

FROM sample_data | WHERE message == "Connected to 10.1.0.1" OR message == "Connected to 10.1.0.2" | EVAL language_code = "1", client_ip=to_string(client_ip) | LOOKUP JOIN clientips_lookup ON client_ip | ENRICH _remote:languages_policy ON language_code

and I get this plan:

LimitExec[1000[INTEGER],null] \_ExchangeExec[[],false] \_FragmentExec[filter=null, estimatedRowSize=0, reducer=[], fragment=[<> Enrich[REMOTE,languages_policy[KEYWORD],language_code{r}#180,{"match":{"indices":[],"match_field":"language_code","enric h_fields":["language_name"]}},{=.enrich-languages_policy-1750721426622},[language_name{r}#194]] \_Limit[1000[INTEGER],true] \_Join[LEFT,[client_ip{r}#183],[client_ip{r}#183],[client_ip{f}#190]] |_Eval[[1[KEYWORD] AS language_code#180, TOSTRING(client_ip{f}#188) AS client_ip#183]] | \_Limit[1000[INTEGER],false] | \_Filter[IN(Connected to 10.1.0.1[KEYWORD],Connected to 10.1.0.2[KEYWORD],message{f}#189)] | \_EsRelation[sample_data][@timestamp{f}#186, client_ip{f}#188, event_duration..] \_EsRelation[clientips_lookup][LOOKUP][client_ip{f}#190, env{f}#191]<>]]

Which does not have any fragments inside the join (and in fact doesn't even have EnrichExec). And this particular verifier does not seem to be called on the local part of the plan? Is there some unmerged parts that I need to test with?

The plan you posted is the coordinator physical plan. There is also the data node physical plan. The physical verifier runs against both.

I checked out your branch and ran this query, then had a look at the trace logs. It shows a LookupJoinExec with a FragmentExec as right child:

ExchangeSinkExec[[@timestamp{f}#30, event_duration{f}#31, message{f}#33, language_code{r}#24, client_ip{r}#27, env{f}#35, language_name{r}#38],false] \_ProjectExec[[@timestamp{f}#30, event_duration{f}#31, message{f}#33, language_code{r}#24, client_ip{r}#27, env{f}#35, language_name{r}#38]] \_FieldExtractExec[@timestamp{f}#30, event_duration{f}#31, message{f}#..]<[],[]> \_EnrichExec[REMOTE,match,language_code{r}#24,languages_policy,language_code,{=.enrich-languages_policy-1757401858136},[language_ name{r}#38]] \_LimitExec[1000[INTEGER],170] \_LookupJoinExec[[client_ip{r}#27],[client_ip{f}#34],[env{f}#35]] |_EvalExec[[1[KEYWORD] AS language_code#24, TOSTRING(client_ip{f}#32) AS client_ip#27]] | \_FieldExtractExec[client_ip{f}#32]<[],[]> | \_EsQueryExec[sample_data], indexMode[standard], [_doc{f}#39], limit[1000], sort[] estimatedRowSize[286] queryBuilderAndTags [[QueryBuilderAndTags{queryBuilder=[{...}], tags=[]}]] \_FragmentExec[filter=null, estimatedRowSize=0, reducer=[], fragment=[<> EsRelation[clientips_lookup][LOOKUP][client_ip{f}#34, env{f}#35]<>]]]]]

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/EnrichExec.java

alex-spies

I like where this is going, but I think the current PR is not fully correct, yet (see below). I left a suggestion. Let me know if this works for you.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/PhysicalVerifier.java

idegtiarenko · 2025-09-10T06:44:58Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalVerifier.java

-        return false;
+    boolean hasRemoteEnrich(LogicalPlan optimizedPlan) {
+        var enriches = optimizedPlan.collectFirstChildren(Enrich.class::isInstance);
+        return enriches.isEmpty() == false && ((Enrich) enriches.get(0)).mode() == Enrich.Mode.REMOTE;


Could you please help me understand why we only checking the first enrich?

Tbh I am not sure. But I haven't changed that part, it's exactly the same as if were before. @bpintea may know more about this?

https://www.elastic.co/docs/reference/query-languages/esql/esql-cross-clusters#esql-multi-enrich states: "A _remote enrich command can’t be executed after a _coordinator enrich command."
This is checked Enrich#checkForPlansForbiddenBeforeRemoteEnrich

@smalyshev, it might be worth adding a (now missing) comment about why the [0] is chosen to be tried as a REMOTE enrich. 🙏

It could be ANY though, not? ANY does not conflict with other types.

This is confusing, but I think we only skip if the topmost ENRICH in the leftmost branch is remote - this is the only situation when the remote enrich hack can lead to shadowing issues that I can think of, because then the remote enrich is the top enrich node in the local plan.

We could skip more widely by looking for any remote enriches; but I think we want to fix the remote enrich hack soon, anyway, so maybe it's better to invest the effort in the real fix.

alex-spies

Very nice, thanks Stas!

alex-spies · 2025-09-15T16:00:30Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalVerifier.java

-        return false;
+    boolean hasRemoteEnrich(LogicalPlan optimizedPlan) {
+        var enriches = optimizedPlan.collectFirstChildren(Enrich.class::isInstance);
+        return enriches.isEmpty() == false && ((Enrich) enriches.get(0)).mode() == Enrich.Mode.REMOTE;


This is confusing, but I think we only skip if the topmost ENRICH in the leftmost branch is remote - this is the only situation when the remote enrich hack can lead to shadowing issues that I can think of, because then the remote enrich is the top enrich node in the local plan.

We could skip more widely by looking for any remote enriches; but I think we want to fix the remote enrich hack soon, anyway, so maybe it's better to invest the effort in the real fix.

alex-spies · 2025-09-15T16:49:59Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/PhysicalVerifier.java


-            if (p instanceof ExecutesOn ex && ex.executesOn() == ExecutesOn.ExecuteLocation.REMOTE) {
+            // This check applies only for general physical plans (isLocal == false)
+            if (isLocal == false && p instanceof ExecutesOn ex && ex.executesOn() == ExecutesOn.ExecuteLocation.REMOTE) {


We could also add a check for the converse - after physical optimization on the remote, there shouldn't be any coordinator-only nodes in the plan.

alex-spies · 2025-09-15T16:53:52Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalVerifier.java


-    private LogicalVerifier() {}
+    public static LogicalVerifier getLocalVerifier() {
+        return new LogicalVerifier(true);


nit: instead of re-instantiating this all the time, we could hand out a singleton. Not sure it makes a difference, though.

We could also make this methods on PostOptimizationPhasePlanVerifier as they work the same for the logical and physical verifiers.

instead of re-instantiating this all the time, we could hand out a singleton.

+1. I guess we'll want this for both flavours (logical, physical).

Not sure it makes a difference, though.

Maybe not performance-wise, but still has a "better feel". Besides, a node can play both roles, so both types (local/coordinator) would eventually be used.

Not sure why the variable which it gets assigned to is not static anyway? Probably should make it so?

bpintea · 2025-09-16T22:35:18Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalVerifier.java

+        return new LogicalVerifier(true);
+    }
+
+    public static LogicalVerifier getGeneralVerifier() {


I think so far we only have a "local" thing (verifier, mapper, optimizer etc.) and the unqualified other variant, which is the coordinator one. Let's avoid introducing a notion of a "general" thing (which wouldn't be "general", as it wouldn't apply locally :) ).
We should use "coordinator" wherever we need to denote a non-local execution.

Well, maybe "global" because it applies to the whole plan?

Well, maybe "global" because it applies to the whole plan?

My thinking is that there isn't a "global plan", executed "globally, everywhere": there's a coordinator plan and one or more (as the pipelines break) node plans, no?

bpintea · 2025-09-16T22:38:56Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalVerifier.java


-    private LogicalVerifier() {}
+    public static LogicalVerifier getLocalVerifier() {
+        return new LogicalVerifier(true);


instead of re-instantiating this all the time, we could hand out a singleton.

+1. I guess we'll want this for both flavours (logical, physical).

Not sure it makes a difference, though.

Maybe not performance-wise, but still has a "better feel". Besides, a node can play both roles, so both types (local/coordinator) would eventually be used.

bpintea · 2025-09-16T22:45:24Z

.../src/main/java/org/elasticsearch/xpack/esql/optimizer/PostOptimizationPhasePlanVerifier.java


-    abstract boolean skipVerification(P optimizedPlan, boolean skipRemoteEnrichVerification);
+    // This is a temporary workaround to skip verification when there is a remote enrich, due to a bug
+    abstract boolean hasRemoteEnrich(P optimizedPlan);


Nit: hasRemoteEnrich() is a bit of a misleading name, since it's not actually a property of the verifier, but of the thing the verifier verifies. And this method's sole role is to stop/skip the verification. skipVerification() could be repurposed for other cases, but even if we'll remove it after we fix the bug[*], I'd find something like "skipOnRemoteEnrich()" (or some other variation) a bit more suggestive.

[*] "due to a bug" should contain a pointer to announced bug. We could add just a (#118531) if the URL is too long.

bpintea · 2025-09-16T22:55:07Z

...l/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/CrossClusterEnrichIT.java

            assertCCSExecutionInfoDetails(executionInfo);
        }
+
+        // No renames, no KEEP


Might be good to make a new test out of this, possibly adding a reference to why we add this test (optional). And/or detail why KEEP is "detrimental" to the test. 🙏

idegtiarenko · 2025-09-17T06:49:16Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalVerifier.java


 public final class LogicalVerifier extends PostOptimizationPhasePlanVerifier<LogicalPlan> {

-    public static final LogicalVerifier INSTANCE = new LogicalVerifier();


When working on micro optimizations I remember seeing some cost associated with initializing list of rules in various optimizers. Here LogicalVerifier looks much simpler, but I wonder if we could reuse a singe static object for local verifier and one for general? This would avoid creating new instances for each query execution. I acknowledge here it is much cheaper than actual rules initialization else where.

smalyshev · 2025-10-08T23:26:49Z

This is absorbed into #134967

Add check that remote enrich stays on remote side

d777561

elasticsearchmachine added the v9.2.0 label Sep 4, 2025

smalyshev requested a review from alex-spies September 4, 2025 21:45

smalyshev marked this pull request as ready for review September 4, 2025 21:46

smalyshev added >non-issue :Analytics/ES|QL AKA ESQL labels Sep 4, 2025

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 4, 2025

smalyshev marked this pull request as draft September 4, 2025 21:49

smalyshev marked this pull request as ready for review September 4, 2025 21:53

alex-spies reviewed Sep 5, 2025

View reviewed changes

smalyshev and others added 3 commits September 8, 2025 17:02

Merge branch 'main' into remote-enrich-check

a93438f

Make it more generic

b6244e5

[CI] Auto commit changes from spotless

a787ae8

idegtiarenko reviewed Sep 9, 2025

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/EnrichExec.java Outdated Show resolved Hide resolved

alex-spies reviewed Sep 9, 2025

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/PhysicalVerifier.java Outdated Show resolved Hide resolved

smalyshev changed the title ~~Add check that remote enrich stays on remote side~~ Refactor verifiers and add remote check Sep 9, 2025

Verifier refactor

6d638be

smalyshev requested review from alex-spies, idegtiarenko and julian-elastic September 9, 2025 18:58

Merge branch 'main' into remote-enrich-check

95ef823

idegtiarenko reviewed Sep 10, 2025

View reviewed changes

smalyshev and others added 2 commits September 12, 2025 13:33

Merge branch 'main' into remote-enrich-check

2490f60

Fix the fix for elastic#118531

4e16026

smalyshev requested review from bpintea and idegtiarenko September 12, 2025 20:16

alex-spies approved these changes Sep 15, 2025

View reviewed changes

Merge branch 'main' into remote-enrich-check

932315f

bpintea approved these changes Sep 16, 2025

View reviewed changes

idegtiarenko reviewed Sep 17, 2025

View reviewed changes

idegtiarenko approved these changes Sep 17, 2025

View reviewed changes

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

[CI] Update transport version definitions

ff5cabd

smalyshev marked this pull request as draft October 3, 2025 22:08

smalyshev closed this Oct 8, 2025


		public final class LogicalVerifier extends PostOptimizationPhasePlanVerifier<LogicalPlan> {

		public static final LogicalVerifier INSTANCE = new LogicalVerifier();

Refactor verifiers and add remote check #134168

Refactor verifiers and add remote check #134168

Uh oh!

Conversation

smalyshev commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 4, 2025

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

alex-spies Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-spies Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-spies Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smalyshev commented Sep 4, 2025 •

edited

Loading

alex-spies Sep 5, 2025 •

edited

Loading

alex-spies Sep 8, 2025 •

edited

Loading

alex-spies Sep 8, 2025 •

edited

Loading

bpintea Sep 10, 2025 •

edited

Loading