connection_status deadlock on poisoning connection#446
connection_status deadlock on poisoning connection#446Keruspe merged 3 commits intoamqp-rs:lapin-3.xfrom
Conversation
|
I can forward port this to main once reviewed, merged here. This seemed like the shortest path to getting it fixed in a release. |
|
Ouch, let me dig a little into it, it's weird that I haven't been beaten up by this, hopefully tonight |
|
It happens for me when I add some custom TLS logic to the connector callback. If something happens that results in an IO error (bad cert, connection refused, etc), the thread stays alive forever, causing memory and thread count to continually increase as I retry the connection. |
|
|
Did confirm the issue goes away with this changeset |
|
Thanks a lot for the detailed investigation, will make the reviewing much easier. I always merge N-1 branch into N btw, so no need to do another PR on main, this one will get there too |
|
Ok, I get it now, after reading the code. We've actually even already been hit by this one in the past, which makes it even more embarrassing and which definitely shows this part should be reworked, as you can see in one of the method from connection_status.... |
|
Yeah, I saw that function but didn't use it so I could make sure poison still held the lock until after the reject call finished, wasn't sure about assumptions elsewhere. No hurt feelings on my end if you want to refactor or implement differently. |
|
What I don't get is why it works in my tests, would you happen to have a minimal reproducer? |
|
Hrm, do the tests verify that the IoLoop thread goes away? the deadlock doesn't block the connection attempt, it just leaves a blocked thread behind forever. |
|
What would you think, instead of dropping the |
Riiiight, I get it now, it's a deadlock... not locking the application, but locking a ~defunkt thread, which is why it went unnoticed |
Yeah, in a long-running application it causes unbounded memory growth if there are any networking issues or periodic downtime, etc. |
|
This seems to repro it: Main.rs: Cargo.toml |
|
Thread count of process is continually growing |
|
Thanks a lot, expect a 3.x release in a matter of minutes. Do you need a 4.x too, or can it wait a couple of days? |
|
We're sticking to released builds so will be on 3.7.x until 4.0 is finalized, ty! 😃 |
If a connection fails, the IoLoop attempts to poison it, unfortunately the current implementation causes a deadlock.
This change refactors the poison call to explicitly drop the lock before the connection.