Reduce usage of transport-layer timeouts #132713

DaveCTurner · 2025-08-12T09:33:19Z

Transport-layer timeouts are kinda trappy, particularly noting that they
do not (reliably) cancel the remote task or perform other necessary
cleanup. Really such behaviour should be the responsibility of the
caller rather than the transport layer itself.

This commit introduces an ActionListener#addTimeout utility to allow
adding timeout wrappers to arbitrary listeners, and uses it to replace
several transport-layer timeouts for requests that have no cancellation
functionality anyway.

Relates #123568

Transport-layer timeouts are kinda trappy, particularly noting that they do not (reliably) cancel the remote task or perform other necessary cleanup. Really such behaviour should be the responsibility of the caller rather than the transport layer itself. This commit introduces an `ActionListener#addTimeout` utility to allow adding timeout wrappers to arbitrary listeners, and uses it to replace several transport-layer timeouts for requests that have no cancellation functionality anyway. Relates elastic#123568

elasticsearchmachine · 2025-08-12T09:33:43Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner · 2025-08-12T10:08:38Z

One slight difference in this approach is that timing out in the transport layer completes (and thus releases) the TransportResponseHandler itself, whereas with a caller-level timeout we keep hold of the entry in ResponseHandlers awaiting the response on the wire. The retained handler itself will hold a reference to the ElasticsearchTimeoutException, but no references to the underlying listener. Thus there's a little more memory overhead, but on the other hand it eliminates all the mysterious Transport response handler not found of id [xxx] log messages.

joshua-adams-1 · 2025-08-12T10:07:41Z

.../main/java/org/elasticsearch/action/admin/cluster/node/tasks/get/TransportGetTaskAction.java

            nodeRequest,
-            TransportRequestOptions.timeout(request.getTimeout()),
-            new ActionListenerResponseHandler<>(listener, GetTaskResponse::new, EsExecutors.DIRECT_EXECUTOR_SERVICE)
+            TransportRequestOptions.EMPTY,


Will this parameter ultimately be removed?

It is also used to select between the different channel types (see org.elasticsearch.transport.TransportRequestOptions.Type) so no we still need it. But the goal is to remove org.elasticsearch.transport.TransportRequestOptions#timeout once it's unused.

DaveCTurner · 2025-08-12T10:41:00Z

it eliminates all the mysterious Transport response handler not found of id [xxx] log messages.

NB it also eliminates log messages of the form "Received response for a request that has timed out, sent [{}/{}ms] ago, timed out [{}/{}ms] ago, action [{}], node [{}], id [{}]". I'm in two minds about this. On the one hand, this isn't really very useful to know - it's up to the caller to deal with the timeout as it sees fit, and if that means logging something then it can do so. But on the other hand it isn't totally useless to see that a bunch of different requests to a particular node are all taking longer than expected. We could reinstate this.

Thanks Josh, I'm going to keep this one off of my ready-to-merge list for now until we've had a chance to discuss it as a team.

ywangd · 2025-08-13T05:32:54Z

a caller-level timeout we keep hold of the entry in ResponseHandlers awaiting the response on the wire

If the response never comes back, how do we clean up the stale ResponseHandler?

DaveCTurner · 2025-08-13T05:43:02Z

If the response never comes back, how do we clean up the stale ResponseHandler?

Same way we deal with the vast majority of requests that have no transport-level timeout: eventually the connection will close for some reason (nothing lasts forever) and that completes any outstanding handlers with a NodeDisconnectedException.

TBC that'd be a pretty serious bug, we already aren't relying on these timeouts to avoid leaking handlers like this.

ywangd · 2025-08-13T06:50:45Z

What do you think if we do the similar thing but keeping it at the transport layer? I think the benefit is handling most changes in one place (TransportService) so that callers do not need concern themselves with common logics such as cancellation? Or do you see any timeout, regardless of its behaviour, at the transport layer as an issue of its own?

DaveCTurner · 2025-08-13T08:17:02Z

callers do not need concern themselves with common logics such as cancellation

The trouble is that callers still have to concern themselves with cancellation. If they don't handle cancellation properly, the receiving node will just keep on processing the request even though the sending node has given up and moved on. That's how many callers behave today anyway - indeed all the callers touched by this PR do exactly that. The transport-layer timeout implementation pretty much encourages this sort of mistake. I'm ok with callers deciding that this is what they want I guess but there's no need to encode this antipattern in the transport layer.

Timeouts are usually an end-to-end thing. We don't need individual node-level requests to time out, we need the overall request to time out and cancel any remaining child tasks in reaction.

Note that the transport-layer timeouts were implemented some time before task cancellation. If we had done cancellation first, I do not think we'd have added transport-layer timeouts. Note also that it will make the transport layer quite a bit simpler if we could tear out everything to do with these timeouts.

There's really no need to time out so enthusiastically here, we can wait for as long as it takes to receive a list of other discovery targets from the remote peer. This commit removes the timeout, deferring failure detection down to the TCP level (e.g. keepalives) as managed by the OS. Relates elastic#132713, elastic#123568

There's really no need to time out so enthusiastically here, we can wait for as long as it takes to receive a list of other discovery targets from the remote peer. This commit removes the timeout, deferring failure detection down to the TCP level (e.g. keepalives) as managed by the OS. Relates #132713, #123568

There's really no need to time out so enthusiastically here, we can wait for as long as it takes to receive a list of other discovery targets from the remote peer. This commit removes the timeout, deferring failure detection down to the TCP level (e.g. keepalives) as managed by the OS. Relates elastic#132713, elastic#123568

DaveCTurner added >non-issue :Distributed Coordination/Network Http and internode communication implementations v9.2.0 labels Aug 12, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Aug 12, 2025

joshua-adams-1 previously approved these changes Aug 12, 2025

View reviewed changes

Oops

6467908

DaveCTurner added the team-discuss label Aug 12, 2025

Merge branch 'main' into 2025/08/12/reduce-transport-layer-timeouts

bd66bfd

DaveCTurner requested a review from henningandersen August 27, 2025 07:50

DaveCTurner added 3 commits August 27, 2025 10:14

Merge branch 'main' into 2025/08/12/reduce-transport-layer-timeouts

9802cb9

Merge branch 'main' into 2025/08/12/reduce-transport-layer-timeouts

74df5b5

Merge branch 'main' into 2025/08/12/reduce-transport-layer-timeouts

83dd897

DaveCTurner removed the team-discuss label Sep 2, 2025

DaveCTurner mentioned this pull request Sep 9, 2025

Remove PeerFinder request timeout #134365

Merged

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

Merge branch 'main' into 2025/08/12/reduce-transport-layer-timeouts

c0f8af8

PeerFinder has no timeout any more

aaef3be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce usage of transport-layer timeouts #132713

Reduce usage of transport-layer timeouts #132713

Uh oh!

DaveCTurner commented Aug 12, 2025

Uh oh!

elasticsearchmachine commented Aug 12, 2025

Uh oh!

DaveCTurner commented Aug 12, 2025

Uh oh!

joshua-adams-1 Aug 12, 2025

Uh oh!

DaveCTurner Aug 12, 2025

Uh oh!

DaveCTurner commented Aug 12, 2025

Uh oh!

ywangd commented Aug 13, 2025

Uh oh!

DaveCTurner commented Aug 13, 2025 •

edited

Loading

Uh oh!

ywangd commented Aug 13, 2025

Uh oh!

DaveCTurner commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Reduce usage of transport-layer timeouts #132713

Are you sure you want to change the base?

Reduce usage of transport-layer timeouts #132713

Uh oh!

Conversation

DaveCTurner commented Aug 12, 2025

Uh oh!

elasticsearchmachine commented Aug 12, 2025

Uh oh!

DaveCTurner commented Aug 12, 2025

Uh oh!

joshua-adams-1 Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Aug 12, 2025

Uh oh!

ywangd commented Aug 13, 2025

Uh oh!

DaveCTurner commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd commented Aug 13, 2025

Uh oh!

DaveCTurner commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DaveCTurner commented Aug 13, 2025 •

edited

Loading