Limit concurrent TLS handshakes #136386

DaveCTurner · 2025-10-10T12:54:18Z

Introduces a per-event-loop limit on the number of TLS handshakes
running at once. When at the limit, subsequent TLS handshakes are
delayed and processed in LIFO order.

Closes ES-12457

Introduces a per-event-loop limit on the number of TLS handshakes running at once. When at the limit, subsequent TLS handshakes are delayed and processed in LIFO order.

elasticsearchmachine · 2025-10-10T12:54:43Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-10-10T12:54:44Z

Hi @DaveCTurner, I've created a changelog YAML for you.

DaveCTurner · 2025-10-10T13:05:08Z

Oh well CI didn't like this very much at all :/

mhl-b · 2025-10-10T18:03:09Z

I think using throttler per event-loop is smart.

But absolute number of in-flight-handshakes looks hard to tune. Ultimately we want handshakes do not exhaust CPU (stating obvious), hence number of handshakes derives from CPU usage. Why not to use a better metric proxy to CPU that requires less tuning? For example how much time we spent handshaking in last N seconds.

I think there are too many factors that can impact number of connections: hardware, JDK version, CPU used for other needs, etc.

DaveCTurner · 2025-10-13T11:11:44Z

Ultimately we want handshakes do not exhaust CPU (stating obvious)

It's not obvious, and indeed that's not actually the goal at all. We want to avoid the situation where we're doing work on TLS handshakes that ultimately time out, because that work is a pointless waste of CPU. I added a comment in ead6075 to describe things in more detail, and to give some justification behind the defaults I settled on.

You're right that the actual numbers depend on all sorts of factors but a maximum handshake rate of 400Hz seems like a reasonable starting point. I can see some value in auto-tuning: these defaults assume we can achieve a handshake rate of at least 100Hz, but if the CPU is more than 75% busy with other things then that isn't a valid assumption. With a handshake rate lower than 100Hz we will start to hit client timeouts on the 1000 in_progress handshakes which means we're back into wasting-CPU territory.

My hunch is that this isn't a case which happens (ignoring bugs that just clog up the transport worker thread which are a separate thing) and it adds quite some complexity so I'd rather wait until we see evidence it's needed before doing it.

I could however be convinced that we want a different split than 1000+1000 by default. For instance if we we went with a 200+1800 split then we'd be able to cope with a 20Hz average handshake rate without any in_progress handshakes timing out.

Limit concurrent TLS handshakes

e2f0fe6

Introduces a per-event-loop limit on the number of TLS handshakes running at once. When at the limit, subsequent TLS handshakes are delayed and processed in LIFO order.

DaveCTurner requested a review from mhl-b October 10, 2025 12:54

DaveCTurner added >enhancement :Distributed Coordination/Network Http and internode communication implementations v9.3.0 labels Oct 10, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 10, 2025

Update docs/changelog/136386.yaml

1191de6

DaveCTurner removed the request for review from mhl-b October 10, 2025 13:04

DaveCTurner marked this pull request as draft October 10, 2025 13:05

DaveCTurner added 6 commits October 10, 2025 14:37

Protect manager lifecycle

498ffbb

Workarounds

54f84a8

Rename metrics

3e3ac4d

Register settings in tests

948e3d0

Merge branch 'main' into 2025/10/10/tls-handshake-throttle

14000f1

Merge branch 'main' into 2025/10/10/tls-handshake-throttle

7f0f333

DaveCTurner added 2 commits October 13, 2025 11:07

Merge branch 'main' into 2025/10/10/tls-handshake-throttle

70b68c1

Renames & explanatory comment

ead6075

DaveCTurner marked this pull request as ready for review October 13, 2025 10:52

Merge branch 'main' into 2025/10/10/tls-handshake-throttle

6fbc4fa

DaveCTurner requested a review from mhl-b October 13, 2025 11:11

Merge branch 'main' into 2025/10/10/tls-handshake-throttle

d5124ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Limit concurrent TLS handshakes #136386

Limit concurrent TLS handshakes #136386

DaveCTurner commented Oct 10, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Oct 10, 2025

Uh oh!

elasticsearchmachine commented Oct 10, 2025

Uh oh!

DaveCTurner commented Oct 10, 2025

Uh oh!

mhl-b commented Oct 10, 2025 •

edited

Loading

Uh oh!

DaveCTurner commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Limit concurrent TLS handshakes #136386

Are you sure you want to change the base?

Limit concurrent TLS handshakes #136386

Conversation

DaveCTurner commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 10, 2025

Uh oh!

elasticsearchmachine commented Oct 10, 2025

Uh oh!

DaveCTurner commented Oct 10, 2025

Uh oh!

mhl-b commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DaveCTurner commented Oct 10, 2025 •

edited

Loading

mhl-b commented Oct 10, 2025 •

edited

Loading