Skip to content

Commit 87651f3

Browse files
matt-auldThomas Hellström
authored andcommitted
drm/xe/guc_submit: fix race around suspend_pending
Currently in some testcases we can trigger: xe 0000:03:00.0: [drm] Assertion `exec_queue_destroyed(q)` failed! .... WARNING: CPU: 18 PID: 2640 at drivers/gpu/drm/xe/xe_guc_submit.c:1826 xe_guc_sched_done_handler+0xa54/0xef0 [xe] xe 0000:03:00.0: [drm] *ERROR* GT1: DEREGISTER_DONE: Unexpected engine state 0x00a1, guc_id=57 Looking at a snippet of corresponding ftrace for this GuC id we can see: 162.673311: xe_sched_msg_add: dev=0000:03:00.0, gt=1 guc_id=57, opcode=3 162.673317: xe_sched_msg_recv: dev=0000:03:00.0, gt=1 guc_id=57, opcode=3 162.673319: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0x29, flags=0x0 162.674089: xe_exec_queue_kill: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0x29, flags=0x0 162.674108: xe_exec_queue_close: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa9, flags=0x0 162.674488: xe_exec_queue_scheduling_done: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa9, flags=0x0 162.678452: xe_exec_queue_deregister: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa1, flags=0x0 It looks like we try to suspend the queue (opcode=3), setting suspend_pending and triggering a disable_scheduling. The user then closes the queue. However the close will also forcefully signal the suspend fence after killing the queue, later when the G2H response for disable_scheduling comes back we have now cleared suspend_pending when signalling the suspend fence, so the disable_scheduling now incorrectly tries to also deregister the queue. This leads to warnings since the queue has yet to even be marked for destruction. We also seem to trigger errors later with trying to double unregister the same queue. To fix this tweak the ordering when handling the response to ensure we don't race with a disable_scheduling that didn't actually intend to perform an unregister. The destruction path should now also correctly wait for any pending_disable before marking as destroyed. Fixes: dd08ebf ("drm/xe: Introduce a new DRM driver for Intel GPUs") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3371 Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: <[email protected]> # v6.8+ Reviewed-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit f161809) Signed-off-by: Thomas Hellström <[email protected]>
1 parent 6965f91 commit 87651f3

File tree

1 file changed

+15
-2
lines changed

1 file changed

+15
-2
lines changed

drivers/gpu/drm/xe/xe_guc_submit.c

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1871,16 +1871,29 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
18711871
xe_gt_assert(guc_to_gt(guc), runnable_state == 0);
18721872
xe_gt_assert(guc_to_gt(guc), exec_queue_pending_disable(q));
18731873

1874-
clear_exec_queue_pending_disable(q);
18751874
if (q->guc->suspend_pending) {
18761875
suspend_fence_signal(q);
1876+
clear_exec_queue_pending_disable(q);
18771877
} else {
18781878
if (exec_queue_banned(q) || check_timeout) {
18791879
smp_wmb();
18801880
wake_up_all(&guc->ct.wq);
18811881
}
1882-
if (!check_timeout)
1882+
if (!check_timeout && exec_queue_destroyed(q)) {
1883+
/*
1884+
* Make sure to clear the pending_disable only
1885+
* after sampling the destroyed state. We want
1886+
* to ensure we don't trigger the unregister too
1887+
* early with something intending to only
1888+
* disable scheduling. The caller doing the
1889+
* destroy must wait for an ongoing
1890+
* pending_disable before marking as destroyed.
1891+
*/
1892+
clear_exec_queue_pending_disable(q);
18831893
deregister_exec_queue(guc, q);
1894+
} else {
1895+
clear_exec_queue_pending_disable(q);
1896+
}
18841897
}
18851898
}
18861899
}

0 commit comments

Comments
 (0)