Skip to content

Conversation

@DaveCTurner
Copy link
Contributor

Transport-layer timeouts are kinda trappy, particularly noting that they
do not (reliably) cancel the remote task or perform other necessary
cleanup. Really such behaviour should be the responsibility of the
caller rather than the transport layer itself.

This commit introduces an ActionListener#addTimeout utility to allow
adding timeout wrappers to arbitrary listeners, and uses it to replace
several transport-layer timeouts for requests that have no cancellation
functionality anyway.

Relates #123568

Transport-layer timeouts are kinda trappy, particularly noting that they
do not (reliably) cancel the remote task or perform other necessary
cleanup. Really such behaviour should be the responsibility of the
caller rather than the transport layer itself.

This commit introduces an `ActionListener#addTimeout` utility to allow
adding timeout wrappers to arbitrary listeners, and uses it to replace
several transport-layer timeouts for requests that have no cancellation
functionality anyway.

Relates elastic#123568
@DaveCTurner DaveCTurner added >non-issue :Distributed Coordination/Network Http and internode communication implementations v9.2.0 labels Aug 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Aug 12, 2025
@DaveCTurner
Copy link
Contributor Author

One slight difference in this approach is that timing out in the transport layer completes (and thus releases) the TransportResponseHandler itself, whereas with a caller-level timeout we keep hold of the entry in ResponseHandlers awaiting the response on the wire. The retained handler itself will hold a reference to the ElasticsearchTimeoutException, but no references to the underlying listener. Thus there's a little more memory overhead, but on the other hand it eliminates all the mysterious Transport response handler not found of id [xxx] log messages.

joshua-adams-1
joshua-adams-1 previously approved these changes Aug 12, 2025
nodeRequest,
TransportRequestOptions.timeout(request.getTimeout()),
new ActionListenerResponseHandler<>(listener, GetTaskResponse::new, EsExecutors.DIRECT_EXECUTOR_SERVICE)
TransportRequestOptions.EMPTY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this parameter ultimately be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also used to select between the different channel types (see org.elasticsearch.transport.TransportRequestOptions.Type) so no we still need it. But the goal is to remove org.elasticsearch.transport.TransportRequestOptions#timeout once it's unused.

@DaveCTurner
Copy link
Contributor Author

it eliminates all the mysterious Transport response handler not found of id [xxx] log messages.

NB it also eliminates log messages of the form "Received response for a request that has timed out, sent [{}/{}ms] ago, timed out [{}/{}ms] ago, action [{}], node [{}], id [{}]". I'm in two minds about this. On the one hand, this isn't really very useful to know - it's up to the caller to deal with the timeout as it sees fit, and if that means logging something then it can do so. But on the other hand it isn't totally useless to see that a bunch of different requests to a particular node are all taking longer than expected. We could reinstate this.

@DaveCTurner DaveCTurner dismissed joshua-adams-1’s stale review August 12, 2025 10:41

Thanks Josh, I'm going to keep this one off of my ready-to-merge list for now until we've had a chance to discuss it as a team.

@ywangd
Copy link
Member

ywangd commented Aug 13, 2025

a caller-level timeout we keep hold of the entry in ResponseHandlers awaiting the response on the wire

If the response never comes back, how do we clean up the stale ResponseHandler?

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Aug 13, 2025

If the response never comes back, how do we clean up the stale ResponseHandler?

Same way we deal with the vast majority of requests that have no transport-level timeout: eventually the connection will close for some reason (nothing lasts forever) and that completes any outstanding handlers with a NodeDisconnectedException.

TBC that'd be a pretty serious bug, we already aren't relying on these timeouts to avoid leaking handlers like this.

@ywangd
Copy link
Member

ywangd commented Aug 13, 2025

What do you think if we do the similar thing but keeping it at the transport layer? I think the benefit is handling most changes in one place (TransportService) so that callers do not need concern themselves with common logics such as cancellation? Or do you see any timeout, regardless of its behaviour, at the transport layer as an issue of its own?

@DaveCTurner
Copy link
Contributor Author

callers do not need concern themselves with common logics such as cancellation

The trouble is that callers still have to concern themselves with cancellation. If they don't handle cancellation properly, the receiving node will just keep on processing the request even though the sending node has given up and moved on. That's how many callers behave today anyway - indeed all the callers touched by this PR do exactly that. The transport-layer timeout implementation pretty much encourages this sort of mistake. I'm ok with callers deciding that this is what they want I guess but there's no need to encode this antipattern in the transport layer.

Timeouts are usually an end-to-end thing. We don't need individual node-level requests to time out, we need the overall request to time out and cancel any remaining child tasks in reaction.

Note that the transport-layer timeouts were implemented some time before task cancellation. If we had done cancellation first, I do not think we'd have added transport-layer timeouts. Note also that it will make the transport layer quite a bit simpler if we could tear out everything to do with these timeouts.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Sep 9, 2025
There's really no need to time out so enthusiastically here, we can wait
for as long as it takes to receive a list of other discovery targets
from the remote peer. This commit removes the timeout, deferring failure
detection down to the TCP level (e.g. keepalives) as managed by the OS.

Relates elastic#132713, elastic#123568
DaveCTurner added a commit that referenced this pull request Sep 15, 2025
There's really no need to time out so enthusiastically here, we can wait
for as long as it takes to receive a list of other discovery targets
from the remote peer. This commit removes the timeout, deferring failure
detection down to the TCP level (e.g. keepalives) as managed by the OS.

Relates #132713, #123568
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Sep 17, 2025
There's really no need to time out so enthusiastically here, we can wait
for as long as it takes to receive a list of other discovery targets
from the remote peer. This commit removes the timeout, deferring failure
detection down to the TCP level (e.g. keepalives) as managed by the OS.

Relates elastic#132713, elastic#123568
gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 18, 2025
There's really no need to time out so enthusiastically here, we can wait
for as long as it takes to receive a list of other discovery targets
from the remote peer. This commit removes the timeout, deferring failure
detection down to the TCP level (e.g. keepalives) as managed by the OS.

Relates elastic#132713, elastic#123568
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants