-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Reduce usage of transport-layer timeouts #132713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Reduce usage of transport-layer timeouts #132713
Conversation
Transport-layer timeouts are kinda trappy, particularly noting that they do not (reliably) cancel the remote task or perform other necessary cleanup. Really such behaviour should be the responsibility of the caller rather than the transport layer itself. This commit introduces an `ActionListener#addTimeout` utility to allow adding timeout wrappers to arbitrary listeners, and uses it to replace several transport-layer timeouts for requests that have no cancellation functionality anyway. Relates elastic#123568
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
One slight difference in this approach is that timing out in the transport layer completes (and thus releases) the |
| nodeRequest, | ||
| TransportRequestOptions.timeout(request.getTimeout()), | ||
| new ActionListenerResponseHandler<>(listener, GetTaskResponse::new, EsExecutors.DIRECT_EXECUTOR_SERVICE) | ||
| TransportRequestOptions.EMPTY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this parameter ultimately be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also used to select between the different channel types (see org.elasticsearch.transport.TransportRequestOptions.Type) so no we still need it. But the goal is to remove org.elasticsearch.transport.TransportRequestOptions#timeout once it's unused.
NB it also eliminates log messages of the form |
Thanks Josh, I'm going to keep this one off of my ready-to-merge list for now until we've had a chance to discuss it as a team.
If the response never comes back, how do we clean up the stale ResponseHandler? |
Same way we deal with the vast majority of requests that have no transport-level timeout: eventually the connection will close for some reason (nothing lasts forever) and that completes any outstanding handlers with a TBC that'd be a pretty serious bug, we already aren't relying on these timeouts to avoid leaking handlers like this. |
|
What do you think if we do the similar thing but keeping it at the transport layer? I think the benefit is handling most changes in one place ( |
The trouble is that callers still have to concern themselves with cancellation. If they don't handle cancellation properly, the receiving node will just keep on processing the request even though the sending node has given up and moved on. That's how many callers behave today anyway - indeed all the callers touched by this PR do exactly that. The transport-layer timeout implementation pretty much encourages this sort of mistake. I'm ok with callers deciding that this is what they want I guess but there's no need to encode this antipattern in the transport layer. Timeouts are usually an end-to-end thing. We don't need individual node-level requests to time out, we need the overall request to time out and cancel any remaining child tasks in reaction. Note that the transport-layer timeouts were implemented some time before task cancellation. If we had done cancellation first, I do not think we'd have added transport-layer timeouts. Note also that it will make the transport layer quite a bit simpler if we could tear out everything to do with these timeouts. |
There's really no need to time out so enthusiastically here, we can wait for as long as it takes to receive a list of other discovery targets from the remote peer. This commit removes the timeout, deferring failure detection down to the TCP level (e.g. keepalives) as managed by the OS. Relates elastic#132713, elastic#123568
There's really no need to time out so enthusiastically here, we can wait for as long as it takes to receive a list of other discovery targets from the remote peer. This commit removes the timeout, deferring failure detection down to the TCP level (e.g. keepalives) as managed by the OS. Relates #132713, #123568
There's really no need to time out so enthusiastically here, we can wait for as long as it takes to receive a list of other discovery targets from the remote peer. This commit removes the timeout, deferring failure detection down to the TCP level (e.g. keepalives) as managed by the OS. Relates elastic#132713, elastic#123568
There's really no need to time out so enthusiastically here, we can wait for as long as it takes to receive a list of other discovery targets from the remote peer. This commit removes the timeout, deferring failure detection down to the TCP level (e.g. keepalives) as managed by the OS. Relates elastic#132713, elastic#123568
Transport-layer timeouts are kinda trappy, particularly noting that they
do not (reliably) cancel the remote task or perform other necessary
cleanup. Really such behaviour should be the responsibility of the
caller rather than the transport layer itself.
This commit introduces an
ActionListener#addTimeoututility to allowadding timeout wrappers to arbitrary listeners, and uses it to replace
several transport-layer timeouts for requests that have no cancellation
functionality anyway.
Relates #123568