Skip to content

connection_status deadlock on poisoning connection#446

Merged
Keruspe merged 3 commits intoamqp-rs:lapin-3.xfrom
brosander:connection-failure-deadlock
Nov 5, 2025
Merged

connection_status deadlock on poisoning connection#446
Keruspe merged 3 commits intoamqp-rs:lapin-3.xfrom
brosander:connection-failure-deadlock

Conversation

@brosander
Copy link
Contributor

If a connection fails, the IoLoop attempts to poison it, unfortunately the current implementation causes a deadlock.

This change refactors the poison call to explicitly drop the lock before the connection.

@brosander
Copy link
Contributor Author

I can forward port this to main once reviewed, merged here. This seemed like the shortest path to getting it fixed in a release.

@Keruspe
Copy link
Collaborator

Keruspe commented Nov 5, 2025

Ouch, let me dig a little into it, it's weird that I haven't been beaten up by this, hopefully tonight

@brosander
Copy link
Contributor Author

It happens for me when I add some custom TLS logic to the connector callback.

If something happens that results in an IO error (bad cert, connection refused, etc), the thread stays alive forever, causing memory and thread count to continually increase as I retry the connection.

@brosander
Copy link
Contributor Author

#0 0x00007f965371d0dd in syscall () from /usr/lib/libc.so.6
#1 0x0000563e8e923fa6 in std::sys::pal::unix::futex::futex_wait () at library/std/src/sys/pal/unix/[futex.rs:73](http://futex.rs:73/)
#2 std::sys::sync::mutex::futex::Mutex::lock_contended () at library/std/src/sys/sync/mutex/[futex.rs:61](http://futex.rs:61/)
#3 0x0000563e8e234ea9 in std::sys::sync::mutex::futex::Mutex::lock (self=0x7f96509f1310) at /usr/local/rustup/toolchains/1.91.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/sync/mutex/[futex.rs:33](http://futex.rs:33/)
#4 std::sync::poison::mutex::Mutex<lapin::connection_status::Inner>::lock<lapin::connection_status::Inner> (self=0x7f96509f1310) at /usr/local/rustup/toolchains/1.91.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/poison/[mutex.rs:489](http://mutex.rs:489/)
#5 lapin::connection_status::ConnectionStatus::lock_inner (self=<optimized out>) at src/[connection_status.rs:123](http://connection_status.rs:123/)
#6 lapin::connection_status::ConnectionStatus::auto_close (self=0x7f95f88a71d0) at src/[connection_status.rs:119](http://connection_status.rs:119/)
#7 lapin::connection_closer::{impl#1}::drop (self=0x7f95f88a71d0) at src/[connection_closer.rs:30](http://connection_closer.rs:30/)
#8 0x0000563e8e1ef615 in core::ptr::drop_in_place<lapin::connection_closer::ConnectionCloser> () at /usr/local/rustup/toolchains/1.91.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/[mod.rs:804](http://mod.rs:804/)
#9 alloc::sync::Arc<lapin::connection_closer::ConnectionCloser, alloc::alloc::Global>::drop_slow<lapin::connection_closer::ConnectionCloser, alloc::alloc::Global> (self=<optimized out>) at /usr/local/rustup/toolchains/1.91.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/[sync.rs:1942](http://sync.rs:1942/)
#10 0x0000563e8e23e101 in core::ptr::drop_in_place<core::option::Option<lapin::connection::Connection>> () at /usr/local/rustup/toolchains/1.91.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/[mod.rs:804](http://mod.rs:804/)
#11 lapin::connection_status::Inner::poison (self=<optimized out>, err=...) at src/[connection_status.rs:268](http://connection_status.rs:268/)
#12 lapin::connection_status::ConnectionStatus::poison (self=<optimized out>, err=...) at src/[connection_status.rs:115](http://connection_status.rs:115/)
#13 0x0000563e8e21b751 in lapin::io_loop::{impl#0}::start::{closure#0}::{closure#0} (err=0x7f9595af8d90) at src/[io_loop.rs:188](http://io_loop.rs:188/)
#14 core::result::Result<tcp_stream::TcpStream, lapin::error::Error>::inspect_err<tcp_stream::TcpStream, lapin::error::Error, lapin::io_loop::{impl#0}::start::{closure#0}::{closure_env#0}> (self=..., f=...) at /usr/local/rustup/toolchains/1.91.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/[result.rs:1014](http://result.rs:1014/)

@brosander
Copy link
Contributor Author

brosander commented Nov 5, 2025

Did confirm the issue goes away with this changeset

@Keruspe
Copy link
Collaborator

Keruspe commented Nov 5, 2025

Thanks a lot for the detailed investigation, will make the reviewing much easier.

I always merge N-1 branch into N btw, so no need to do another PR on main, this one will get there too

@Keruspe
Copy link
Collaborator

Keruspe commented Nov 5, 2025

Ok, I get it now, after reading the code.

We've actually even already been hit by this one in the past, which makes it even more embarrassing and which definitely shows this part should be reworked, as you can see in one of the method from connection_status....

    pub(crate) fn connection_resolver(&self) -> Option<PromiseResolver<Connection>> {
        let resolver = self.lock_inner().connection_resolver();
        // We carry the Connection here to drop the lock() above before dropping the Connection
        resolver.map(|(resolver, _connection)| resolver.resolver_in)
    }

@brosander
Copy link
Contributor Author

Yeah, I saw that function but didn't use it so I could make sure poison still held the lock until after the reject call finished, wasn't sure about assumptions elsewhere. No hurt feelings on my end if you want to refactor or implement differently.

@Keruspe
Copy link
Collaborator

Keruspe commented Nov 5, 2025

What I don't get is why it works in my tests, would you happen to have a minimal reproducer?

@brosander
Copy link
Contributor Author

Hrm, do the tests verify that the IoLoop thread goes away? the deadlock doesn't block the connection attempt, it just leaves a blocked thread behind forever.

@Keruspe
Copy link
Collaborator

Keruspe commented Nov 5, 2025

What would you think, instead of dropping the Inner::poison method, to make it set the poison first, then return the output from the connection_resolver method.
Then in the calling site in ConnectionStatus;;poinson, having something like let (resolver, _connection) = self.lock_inner().poison(err); which would automatically drop the lock, while keeping a ref to connection (with a comment similar to the one from the other function)

@Keruspe
Copy link
Collaborator

Keruspe commented Nov 5, 2025

Hrm, do the tests verify that the IoLoop thread goes away? the deadlock doesn't block the connection attempt, it just leaves a blocked thread behind forever.

Riiiight, I get it now, it's a deadlock... not locking the application, but locking a ~defunkt thread, which is why it went unnoticed

@brosander
Copy link
Contributor Author

Hrm, do the tests verify that the IoLoop thread goes away? the deadlock doesn't block the connection attempt, it just leaves a blocked thread behind forever.

Riiiight, I get it now, it's a deadlock... not locking the application, but locking a ~defunkt thread, which is why it went unnoticed

Yeah, in a long-running application it causes unbounded memory growth if there are any networking issues or periodic downtime, etc.

@brosander
Copy link
Contributor Author

This seems to repro it:

Main.rs:

use std::time::Duration;

use lapin::{Connection, ConnectionProperties};
use tokio::time::sleep;

#[tokio::main(worker_threads = 4)]
async fn main() {
    loop {
        match Connection::connect("amqps://google.com:80", ConnectionProperties::default()).await {
            Ok(_) => panic!("google shouldn't let us connect"),
            Err(e) => {
                eprintln!("expected error: {e:?}");
                sleep(Duration::from_secs(1)).await;
            }
        }
    }
}

Cargo.toml

[package]
name = "deadlock-repro"
version = "0.1.0"
edition = "2024"

[dependencies]
lapin = "3.7.1"
tokio = {version = "1.48.0", features = ["rt-multi-thread", "macros", "time"]}

@brosander
Copy link
Contributor Author

Thread count of process is continually growing

@Keruspe
Copy link
Collaborator

Keruspe commented Nov 5, 2025

Thanks a lot, expect a 3.x release in a matter of minutes. Do you need a 4.x too, or can it wait a couple of days?

@brosander
Copy link
Contributor Author

We're sticking to released builds so will be on 3.7.x until 4.0 is finalized, ty! 😃

@Keruspe Keruspe merged commit c29fdbc into amqp-rs:lapin-3.x Nov 5, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants