Skip to content

Commit e4ecd60

Browse files
committed
fix queue locking behavior when creating event lists
This fixes a race when creating queue events without acquiring the appropriate locks, and also fixes a deadlock when acquiring multiple queue locks. The deadlock scenario is roughly this: queue1 = ...; queue2 = ...; for (;;) { queue1_event = urEnqueueKernelLaunch(queue1, ...); // lock order: queue2->Mutex, queue1->Mutex urEnqueueEventsWait(queue2, 1, [queue1_event], nullptr); } T2: for (;;) { queue2_event = urEnqueueKernelLaunch(queue2, ...); // lock order: queue1->Mutex, queue2->Mutex urEnqueueEventsWait(queue1, 1, [queue2_event], nullptr); } The solution in this patch is the simplest possible fix I managed to figure out, but I strongly believe this code requires a substantial refactor to better structure the synchronization logic. Right now it's a minefield.
1 parent 7fcfe3a commit e4ecd60

File tree

1 file changed

+25
-11
lines changed

1 file changed

+25
-11
lines changed

source/adapters/level_zero/event.cpp

Lines changed: 25 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1261,16 +1261,26 @@ ur_result_t _ur_ze_event_list_t::createAndRetainUrZeEventList(
12611261
}
12621262

12631263
auto Queue = EventList[I]->UrQueue;
1264-
if (Queue) {
1265-
// The caller of createAndRetainUrZeEventList must already hold
1266-
// a lock of the CurQueue. Additionally lock the Queue if it
1267-
// is different from CurQueue.
1268-
// TODO: rework this to avoid deadlock when another thread is
1269-
// locking the same queues but in a different order.
1270-
auto Lock = ((Queue == CurQueue)
1271-
? std::unique_lock<ur_shared_mutex>()
1272-
: std::unique_lock<ur_shared_mutex>(Queue->Mutex));
12731264

1265+
auto CurQueueDevice = CurQueue->Device;
1266+
std::optional<std::unique_lock<ur_shared_mutex>> QueueLock =
1267+
std::nullopt;
1268+
// The caller of createAndRetainUrZeEventList must already hold
1269+
// a lock of the CurQueue. However, if the CurQueue is different
1270+
// then the Event's Queue, we need to drop that lock and
1271+
// acquire the Event's Queue lock. This is done to avoid a lock
1272+
// ordering issue.
1273+
// For the rest of this scope, CurQueue cannot be accessed.
1274+
// TODO: This solution is very error-prone. This requires a refactor
1275+
// to either have fine-granularity locks inside of the queues or
1276+
// to move any operations on queues other than CurQueue out
1277+
// of this scope.
1278+
if (Queue && Queue != CurQueue) {
1279+
CurQueue->Mutex.unlock();
1280+
QueueLock = std::unique_lock<ur_shared_mutex>(Queue->Mutex);
1281+
}
1282+
1283+
if (Queue) {
12741284
// If the event that is going to be waited is in an open batch
12751285
// different from where this next command is going to be added,
12761286
// then we have to force execute of that open command-list
@@ -1313,7 +1323,7 @@ ur_result_t _ur_ze_event_list_t::createAndRetainUrZeEventList(
13131323
}
13141324

13151325
ur_command_list_ptr_t CommandList;
1316-
if (Queue && Queue->Device != CurQueue->Device) {
1326+
if (Queue && Queue->Device != CurQueueDevice) {
13171327
// Get a command list prior to acquiring an event lock.
13181328
// This prevents a potential deadlock with recursive
13191329
// event locks.
@@ -1323,7 +1333,7 @@ ur_result_t _ur_ze_event_list_t::createAndRetainUrZeEventList(
13231333

13241334
std::shared_lock<ur_shared_mutex> Lock(EventList[I]->Mutex);
13251335

1326-
if (Queue && Queue->Device != CurQueue->Device &&
1336+
if (Queue && Queue->Device != CurQueueDevice &&
13271337
!EventList[I]->IsMultiDevice) {
13281338
ze_event_handle_t MultiDeviceZeEvent = nullptr;
13291339
ur_event_handle_t MultiDeviceEvent;
@@ -1357,6 +1367,10 @@ ur_result_t _ur_ze_event_list_t::createAndRetainUrZeEventList(
13571367
this->UrEventList[TmpListLength]->RefCount.increment();
13581368
}
13591369

1370+
if (QueueLock.has_value()) {
1371+
QueueLock.reset();
1372+
CurQueue->Mutex.lock();
1373+
}
13601374
TmpListLength += 1;
13611375
}
13621376
}

0 commit comments

Comments
 (0)