-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Improve shutdown method to avoid test failures #122554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
| assertTrue(SimplePauseFieldPlugin.startEmitting.await(30, TimeUnit.SECONDS)); | ||
| SimplePauseFieldPlugin.allowEmitting.countDown(); | ||
| cluster(REMOTE_CLUSTER).close(); | ||
| removeClusterFromGroup(REMOTE_CLUSTER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use something other than cluster(REMOTE_CLUSTER).close() (line 308) in the test? I suspect that is the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something other like what? I am open to suggestions. The test needs that cluster to go away. I tried to do removeRemoteCluster() but that doesn't work. And also that would be a different case anyway - what I am testing is what happens if remote just goes away in the middle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I am testing is what happens if remote just goes away in the middle.
Can we just disconnect from the remote cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, just disconnect would be different because something might just reconnect back. Even if the code wouldn't do it now, it might later. Also, I am not sure how to properly do a disconnect in the middle of the request either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't be using the new connections in ES|QL, even if we reconnect. If you replace cluster(REMOTE_CLUSTER).close() with something like below, it should work. Please let me know if you need more help on this.
for (var transportService : cluster(LOCAL_CLUSTER).getInstances(TransportService.class)) {
transportService.getRemoteClusterService().getRemoteClusterConnection(REMOTE_CLUSTER).close();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use an actual network disruption here instead of closing the connection? That seems quite brittle, might work now but it seems rather invasive manually killing connections. I don't think we do that anywhere else or do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the whole notion that connection drop is survivable is somewhat new I think - before, if any connection drops, it fails the whole thing, but here it still returns proper results, so it's a bit different. I like the idea to add a test for network disruption too, but I am not sure how to do it - @original-brownbear do you have suggestions/examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something along the lines of what we use in CCR would be more canonical:
for (TransportService transportService : getFollowerCluster().getDataOrMasterNodeInstances(TransportService.class)) {
MockTransportService mockTransportService = (MockTransportService) transportService;
transportServices.add(mockTransportService);
mockTransportService.addSendBehavior((connection, requestId, action, request, options) -> {
if (action.equals("actionYouCareAbout")) {
connection.getNode();
// or maybe simply
throw new ClosedChannelException();
}
});
You could make the connection permanently fail on sending on either or both sides. That allows you to deterministically turn things on and off via either some flag that you switch or removing the send behavior.
That might be a bit more robust. You could obviously do it simpler than in my example and just fail every send :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably can't fail every send because this test if for failure in runtime, and for the query to get to the runtime it needs to do some stuff first in planning time which needs to succeed. That's why we need pause fields here (otherwise we could just try to connect to non-existing cluster or something like that, but that would fail too early to test the runtime code). But failing per action could work, possibly, I will look into that. I think though these might be two separate tests because throwing inside connection service, strictly speaking, is not the same as connection going down - connection can go down in various places and the question is whether this would always lead to the same code path. E.g. what if it breaks not when sending but when having sent the request and waiting for the answer? Etc. I wonder if this all is covered by that code pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could simulate sending but failing to receive the response by adding a send behavior on the receiving cluster/node ? :) That seems easiest. The beauty of doing it with these hooks is that it's very deterministic and you might not even need that script pausing as you point out :). Feel free to ping me on another channel if you want help with that, I'm still around for a bit today
bca9d00 to
e8c8553
Compare
Fixes #122336