Improve shutdown method to avoid test failures #122554

smalyshev · 2025-02-13T23:19:35Z

elasticsearchmachine · 2025-02-13T23:20:42Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2025-02-13T23:20:42Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

dnhatn · 2025-02-13T23:26:41Z

...internalClusterTest/java/org/elasticsearch/xpack/esql/action/CrossClusterCancellationIT.java

        assertTrue(SimplePauseFieldPlugin.startEmitting.await(30, TimeUnit.SECONDS));
        SimplePauseFieldPlugin.allowEmitting.countDown();
        cluster(REMOTE_CLUSTER).close();
+        removeClusterFromGroup(REMOTE_CLUSTER);


Can we use something other than cluster(REMOTE_CLUSTER).close() (line 308) in the test? I suspect that is the problem.

Something other like what? I am open to suggestions. The test needs that cluster to go away. I tried to do removeRemoteCluster() but that doesn't work. And also that would be a different case anyway - what I am testing is what happens if remote just goes away in the middle.

what I am testing is what happens if remote just goes away in the middle.

Can we just disconnect from the remote cluster?

Well, just disconnect would be different because something might just reconnect back. Even if the code wouldn't do it now, it might later. Also, I am not sure how to properly do a disconnect in the middle of the request either.

We won't be using the new connections in ES|QL, even if we reconnect. If you replace cluster(REMOTE_CLUSTER).close() with something like below, it should work. Please let me know if you need more help on this.

for (var transportService : cluster(LOCAL_CLUSTER).getInstances(TransportService.class)) { transportService.getRemoteClusterService().getRemoteClusterConnection(REMOTE_CLUSTER).close(); }

Can't we use an actual network disruption here instead of closing the connection? That seems quite brittle, might work now but it seems rather invasive manually killing connections. I don't think we do that anywhere else or do we?

Well, the whole notion that connection drop is survivable is somewhat new I think - before, if any connection drops, it fails the whole thing, but here it still returns proper results, so it's a bit different. I like the idea to add a test for network disruption too, but I am not sure how to do it - @original-brownbear do you have suggestions/examples?

Maybe something along the lines of what we use in CCR would be more canonical:

for (TransportService transportService : getFollowerCluster().getDataOrMasterNodeInstances(TransportService.class)) { MockTransportService mockTransportService = (MockTransportService) transportService; transportServices.add(mockTransportService); mockTransportService.addSendBehavior((connection, requestId, action, request, options) -> { if (action.equals("actionYouCareAbout")) { connection.getNode(); // or maybe simply throw new ClosedChannelException(); } });

You could make the connection permanently fail on sending on either or both sides. That allows you to deterministically turn things on and off via either some flag that you switch or removing the send behavior.
That might be a bit more robust. You could obviously do it simpler than in my example and just fail every send :)

I probably can't fail every send because this test if for failure in runtime, and for the query to get to the runtime it needs to do some stuff first in planning time which needs to succeed. That's why we need pause fields here (otherwise we could just try to connect to non-existing cluster or something like that, but that would fail too early to test the runtime code). But failing per action could work, possibly, I will look into that. I think though these might be two separate tests because throwing inside connection service, strictly speaking, is not the same as connection going down - connection can go down in various places and the question is whether this would always lead to the same code path. E.g. what if it breaks not when sending but when having sent the request and waiting for the answer? Etc. I wonder if this all is covered by that code pattern.

You could simulate sending but failing to receive the response by adding a send behavior on the receiving cluster/node ? :) That seems easiest. The beauty of doing it with these hooks is that it's very deterministic and you might not even need that script pausing as you point out :). Feel free to ping me on another channel if you want help with that, I'm still around for a bit today

Remove cluster from group after shutting down

49808d2

smalyshev requested review from dnhatn and quux00 February 13, 2025 23:19

smalyshev added :Analytics/ES|QL AKA ESQL :Search Foundations/CCS v9.1.0 v8.19.0 >test Issues or PRs that are addressing/adding tests labels Feb 13, 2025

smalyshev marked this pull request as ready for review February 13, 2025 23:20

elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Feb 13, 2025

dnhatn reviewed Feb 13, 2025

View reviewed changes

Refactor test with different disconnect code

e8c8553

smalyshev force-pushed the fix-122336 branch from bca9d00 to e8c8553 Compare February 14, 2025 19:06

smalyshev changed the title ~~Remove cluster from group after shutting down~~ Improve shutdown method to avoid test failures Feb 14, 2025

Merge branch 'main' into fix-122336

939753c

alex-spies removed :Analytics/ES|QL AKA ESQL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Feb 17, 2025

smalyshev closed this Feb 20, 2025

javanna removed v8.19.0 v9.1.0 labels Feb 21, 2025

Improve shutdown method to avoid test failures #122554

Improve shutdown method to avoid test failures #122554

Uh oh!

Conversation

smalyshev commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 13, 2025

Uh oh!

elasticsearchmachine commented Feb 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smalyshev Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

smalyshev commented Feb 13, 2025 •

edited

Loading

smalyshev Feb 13, 2025 •

edited

Loading