Skip to content

Commit c0fedb5

Browse files
committed
mds/MDSDaemon: unlock mds_lock while shutting down Beacon and others
This fixes a deadlock bug during MDS shutdown: - the "signal_handler" thread receives the shutdown signal and invokes MDSDaemon::suicide() while holding `mds_lock` - MDSDaemon::suicide() invokes Beacon::send_and_wait() while still holding `mds_lock` - meanwhile, all "ms_dispatch" threads get stuck waiting for `mds_lock`, for example in MDCache::upkeep_main() or MDSDaemon::ms_dispatch2() - Beacon::send_and_wait() waits for a `MSG_MDS_BEACON` packet to be dispatched (via `cvar` with a timeout) At this point, even if a `MSG_MDS_BEACON` packet is received by one of the worker threads, they will put it in the `DispatchQueue`, but no dispatcher thread will be able to handle it because they are all stuck. The cvar.wait_for() call in Beacon::send_and_wait() will therefore time out and the `MSG_MDS_BEACON` will never be processed. The proper solution is to unlock `mds_lock` to avoid the dispatchers from getting stuck. And in general, we should be holding a lock strictly only when it is needed and never do blocking calls while holding a lock. Fixes: https://tracker.ceph.com/issues/68760 Signed-off-by: Max Kellermann <[email protected]>
1 parent 5b828b6 commit c0fedb5

File tree

1 file changed

+9
-0
lines changed

1 file changed

+9
-0
lines changed

src/mds/MDSDaemon.cc

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -923,6 +923,13 @@ void MDSDaemon::suicide()
923923
// to wait for us to go laggy. Only do this if we're actually in the MDSMap,
924924
// because otherwise the MDSMonitor will drop our message.
925925
beacon.set_want_state(*mdsmap, MDSMap::STATE_DNE);
926+
927+
/* Unlock the mds_lock while waiting for beacon ACK to avoid a
928+
* deadlock with the dispatcher thread which may try to acquire
929+
* mds_lock, preventing it from receiving the beacon ACK.
930+
*/
931+
mds_lock.unlock();
932+
926933
if (!mdsmap->is_dne_gid(mds_gid_t(monc->get_global_id()))) {
927934
beacon.send_and_wait(1);
928935
}
@@ -931,6 +938,8 @@ void MDSDaemon::suicide()
931938
if (mgrc.is_initialized())
932939
mgrc.shutdown();
933940

941+
mds_lock.lock();
942+
934943
if (mds_rank) {
935944
mds_rank->shutdown();
936945
} else {

0 commit comments

Comments
 (0)