-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Limit concurrent TLS handshakes #136386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Limit concurrent TLS handshakes #136386
Conversation
Introduces a per-event-loop limit on the number of TLS handshakes running at once. When at the limit, subsequent TLS handshakes are delayed and processed in LIFO order.
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
Hi @DaveCTurner, I've created a changelog YAML for you. |
Oh well CI didn't like this very much at all :/ |
I think using throttler per event-loop is smart. But absolute number of in-flight-handshakes looks hard to tune. Ultimately we want handshakes do not exhaust CPU (stating obvious), hence number of handshakes derives from CPU usage. Why not to use a better metric proxy to CPU that requires less tuning? For example how much time we spent handshaking in last N seconds. I think there are too many factors that can impact number of connections: hardware, JDK version, CPU used for other needs, etc. |
It's not obvious, and indeed that's not actually the goal at all. We want to avoid the situation where we're doing work on TLS handshakes that ultimately time out, because that work is a pointless waste of CPU. I added a comment in ead6075 to describe things in more detail, and to give some justification behind the defaults I settled on. You're right that the actual numbers depend on all sorts of factors but a maximum handshake rate of 400Hz seems like a reasonable starting point. I can see some value in auto-tuning: these defaults assume we can achieve a handshake rate of at least 100Hz, but if the CPU is more than 75% busy with other things then that isn't a valid assumption. With a handshake rate lower than 100Hz we will start to hit client timeouts on the 1000 My hunch is that this isn't a case which happens (ignoring bugs that just clog up the transport worker thread which are a separate thing) and it adds quite some complexity so I'd rather wait until we see evidence it's needed before doing it. I could however be convinced that we want a different split than 1000+1000 by default. For instance if we we went with a 200+1800 split then we'd be able to cope with a 20Hz average handshake rate without any |
Introduces a per-event-loop limit on the number of TLS handshakes
running at once. When at the limit, subsequent TLS handshakes are
delayed and processed in LIFO order.
Closes ES-12457