Skip to content

Conversation

@smalyshev
Copy link
Contributor

@smalyshev smalyshev commented Feb 13, 2025

Fixes #122336

@smalyshev smalyshev requested review from dnhatn and quux00 February 13, 2025 23:19
@smalyshev smalyshev added :Analytics/ES|QL AKA ESQL :Search Foundations/CCS v9.1.0 v8.19.0 >test Issues or PRs that are addressing/adding tests labels Feb 13, 2025
@smalyshev smalyshev marked this pull request as ready for review February 13, 2025 23:20
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Feb 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

assertTrue(SimplePauseFieldPlugin.startEmitting.await(30, TimeUnit.SECONDS));
SimplePauseFieldPlugin.allowEmitting.countDown();
cluster(REMOTE_CLUSTER).close();
removeClusterFromGroup(REMOTE_CLUSTER);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something other than cluster(REMOTE_CLUSTER).close() (line 308) in the test? I suspect that is the problem.

Copy link
Contributor Author

@smalyshev smalyshev Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something other like what? I am open to suggestions. The test needs that cluster to go away. I tried to do removeRemoteCluster() but that doesn't work. And also that would be a different case anyway - what I am testing is what happens if remote just goes away in the middle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I am testing is what happens if remote just goes away in the middle.

Can we just disconnect from the remote cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, just disconnect would be different because something might just reconnect back. Even if the code wouldn't do it now, it might later. Also, I am not sure how to properly do a disconnect in the middle of the request either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't be using the new connections in ES|QL, even if we reconnect. If you replace cluster(REMOTE_CLUSTER).close() with something like below, it should work. Please let me know if you need more help on this.

for (var transportService : cluster(LOCAL_CLUSTER).getInstances(TransportService.class)) {
    transportService.getRemoteClusterService().getRemoteClusterConnection(REMOTE_CLUSTER).close();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use an actual network disruption here instead of closing the connection? That seems quite brittle, might work now but it seems rather invasive manually killing connections. I don't think we do that anywhere else or do we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the whole notion that connection drop is survivable is somewhat new I think - before, if any connection drops, it fails the whole thing, but here it still returns proper results, so it's a bit different. I like the idea to add a test for network disruption too, but I am not sure how to do it - @original-brownbear do you have suggestions/examples?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something along the lines of what we use in CCR would be more canonical:


        for (TransportService transportService : getFollowerCluster().getDataOrMasterNodeInstances(TransportService.class)) {
            MockTransportService mockTransportService = (MockTransportService) transportService;
            transportServices.add(mockTransportService);
            mockTransportService.addSendBehavior((connection, requestId, action, request, options) -> {
                if (action.equals("actionYouCareAbout")) {
                    connection.getNode();
                    // or maybe simply
                    throw new ClosedChannelException();
                }
            });

You could make the connection permanently fail on sending on either or both sides. That allows you to deterministically turn things on and off via either some flag that you switch or removing the send behavior.
That might be a bit more robust. You could obviously do it simpler than in my example and just fail every send :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably can't fail every send because this test if for failure in runtime, and for the query to get to the runtime it needs to do some stuff first in planning time which needs to succeed. That's why we need pause fields here (otherwise we could just try to connect to non-existing cluster or something like that, but that would fail too early to test the runtime code). But failing per action could work, possibly, I will look into that. I think though these might be two separate tests because throwing inside connection service, strictly speaking, is not the same as connection going down - connection can go down in various places and the question is whether this would always lead to the same code path. E.g. what if it breaks not when sending but when having sent the request and waiting for the answer? Etc. I wonder if this all is covered by that code pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simulate sending but failing to receive the response by adding a send behavior on the receiving cluster/node ? :) That seems easiest. The beauty of doing it with these hooks is that it's very deterministic and you might not even need that script pausing as you point out :). Feel free to ping me on another channel if you want help with that, I'm still around for a bit today

@smalyshev smalyshev changed the title Remove cluster from group after shutting down Improve shutdown method to avoid test failures Feb 14, 2025
@alex-spies alex-spies removed :Analytics/ES|QL AKA ESQL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Feb 17, 2025
@smalyshev smalyshev closed this Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Search Foundations/CCS Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch >test Issues or PRs that are addressing/adding tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] CrossClusterCancellationIT testCloseSkipUnavailable failing

6 participants