Skip to content

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Oct 10, 2025

Introduces a per-event-loop limit on the number of TLS handshakes
running at once. When at the limit, subsequent TLS handshakes are
delayed and processed in LIFO order.

Closes ES-12457

Introduces a per-event-loop limit on the number of TLS handshakes
running at once. When at the limit, subsequent TLS handshakes are
delayed and processed in LIFO order.
@DaveCTurner DaveCTurner requested a review from mhl-b October 10, 2025 12:54
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Network Http and internode communication implementations v9.3.0 labels Oct 10, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@DaveCTurner DaveCTurner removed the request for review from mhl-b October 10, 2025 13:04
@DaveCTurner
Copy link
Contributor Author

Oh well CI didn't like this very much at all :/

@DaveCTurner DaveCTurner marked this pull request as draft October 10, 2025 13:05
@mhl-b
Copy link
Contributor

mhl-b commented Oct 10, 2025

I think using throttler per event-loop is smart.

But absolute number of in-flight-handshakes looks hard to tune. Ultimately we want handshakes do not exhaust CPU (stating obvious), hence number of handshakes derives from CPU usage. Why not to use a better metric proxy to CPU that requires less tuning? For example how much time we spent handshaking in last N seconds.

I think there are too many factors that can impact number of connections: hardware, JDK version, CPU used for other needs, etc.

@DaveCTurner DaveCTurner marked this pull request as ready for review October 13, 2025 10:52
@DaveCTurner
Copy link
Contributor Author

Ultimately we want handshakes do not exhaust CPU (stating obvious)

It's not obvious, and indeed that's not actually the goal at all. We want to avoid the situation where we're doing work on TLS handshakes that ultimately time out, because that work is a pointless waste of CPU. I added a comment in ead6075 to describe things in more detail, and to give some justification behind the defaults I settled on.

You're right that the actual numbers depend on all sorts of factors but a maximum handshake rate of 400Hz seems like a reasonable starting point. I can see some value in auto-tuning: these defaults assume we can achieve a handshake rate of at least 100Hz, but if the CPU is more than 75% busy with other things then that isn't a valid assumption. With a handshake rate lower than 100Hz we will start to hit client timeouts on the 1000 in_progress handshakes which means we're back into wasting-CPU territory.

My hunch is that this isn't a case which happens (ignoring bugs that just clog up the transport worker thread which are a separate thing) and it adds quite some complexity so I'd rather wait until we see evidence it's needed before doing it.

I could however be convinced that we want a different split than 1000+1000 by default. For instance if we we went with a 200+1800 split then we'd be able to cope with a 20Hz average handshake rate without any in_progress handshakes timing out.

@DaveCTurner DaveCTurner requested a review from mhl-b October 13, 2025 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants