Skip to content

Deadlock (?) between garbage collection and RcvQueue worker thread termination #83

@jrudolph

Description

@jrudolph

We observe a situation where UDT completely hangs with many threads stuck waiting for the m_ControlLock.

At this point the lock is held by the garbage collection thread (in checkBrokenSockets) which is waiting for a rcv queue worker thread termination:

(gdb) bt
#0  0x00007f5b9f593ef7 in pthread_join (threadid=140028744247040, thread_return=0x0) at pthread_join.c:92
#1  0x00007f5b5c3b6221 in CRcvQueue::~CRcvQueue() () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#2  0x00007f5b5c39b0bd in CUDTUnited::removeSocket(int) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#3  0x00007f5b5c39baa2 in CUDTUnited::checkBrokenSockets() () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#4  0x00007f5b5c39bc64 in CUDTUnited::garbageCollect(void*) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#5  0x00007f5b9f592dc5 in start_thread (arg=0x7f5b17fff700) at pthread_create.c:308
#6  0x00007f5b9eea628d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) frame 0
#0  0x00007f5b9f593ef7 in pthread_join (threadid=140028744247040, thread_return=0x0) at pthread_join.c:92
92      lll_wait_tid (pd->tid);
(gdb) print pd->tid
$3 = 17122

The worker thread seems to be stuck in recvmsg:

Thread 7 (Thread 0x7f5afb8f2700 (LWP 17122)):
#0  0x00007f5b9f59967d in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f5b5c3a0b2b in CChannel::recvfrom(sockaddr*, CPacket&) const () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#2  0x00007f5b5c3b6fee in CRcvQueue::worker(void*) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#3  0x00007f5b9f592dc5 in start_thread (arg=0x7f5afb8f2700) at pthread_create.c:308
#4  0x00007f5b9eea628d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

This doesn't seem to be a classical deadlock, maybe it's more a problem with the blocking recvmsg call.

Has anyone an idea how this could happen?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions