Skip to content

Commit 0e00a88

Browse files
drm/i915/guc: Avoid circular locking issue on busyness flush
Avoid the following lockdep complaint: <4> [298.856498] ====================================================== <4> [298.856500] WARNING: possible circular locking dependency detected <4> [298.856503] 6.7.0-rc5-CI_DRM_14017-g58ac4ffc75b6+ #1 Tainted: G N <4> [298.856505] ------------------------------------------------------ <4> [298.856507] kworker/4:1H/190 is trying to acquire lock: <4> [298.856509] ffff8881103e9978 (&gt->reset.backoff_srcu){++++}-{0:0}, at: _intel_gt_reset_lock+0x35/0x380 [i915] <4> [298.856661] but task is already holding lock: <4> [298.856663] ffffc900013f7e58 ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x264/0x530 <4> [298.856671] which lock already depends on the new lock. The complaint is not actually valid. The busyness worker thread does indeed hold the worker lock and then attempt to acquire the reset lock (which may have happened in reverse order elsewhere). However, it does so with a trylock that exits if the reset lock is not available (specifically to prevent this and other similar deadlocks). Unfortunately, lockdep does not understand the trylock semantics (the lock is an i915 specific custom implementation for resets). Not doing a synchronous flush of the worker thread when a reset is in progress resolves the lockdep splat by never even attempting to grab the lock in this particular scenario. There are situatons where a synchronous cancel is required, however. So, always do the synchronous cancel if not in reset. And add an extra synchronous cancel to the end of the reset flow to account for when a reset is occurring at driver shutdown and the cancel is no longer synchronous but could lead to unallocated memory accesses if the worker is not stopped. Signed-off-by: Zhanjun Dong <[email protected]> Signed-off-by: John Harrison <[email protected]> Cc: Andi Shyti <[email protected]> Cc: Daniel Vetter <[email protected]> Reviewed-by: Andi Shyti <[email protected]> Acked-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
1 parent 2f2cc53 commit 0e00a88

File tree

2 files changed

+48
-2
lines changed

2 files changed

+48
-2
lines changed

drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1375,7 +1375,45 @@ static void guc_enable_busyness_worker(struct intel_guc *guc)
13751375

13761376
static void guc_cancel_busyness_worker(struct intel_guc *guc)
13771377
{
1378-
cancel_delayed_work_sync(&guc->timestamp.work);
1378+
/*
1379+
* There are many different call stacks that can get here. Some of them
1380+
* hold the reset mutex. The busyness worker also attempts to acquire the
1381+
* reset mutex. Synchronously flushing a worker thread requires acquiring
1382+
* the worker mutex. Lockdep sees this as a conflict. It thinks that the
1383+
* flush can deadlock because it holds the worker mutex while waiting for
1384+
* the reset mutex, but another thread is holding the reset mutex and might
1385+
* attempt to use other worker functions.
1386+
*
1387+
* In practice, this scenario does not exist because the busyness worker
1388+
* does not block waiting for the reset mutex. It does a try-lock on it and
1389+
* immediately exits if the lock is already held. Unfortunately, the mutex
1390+
* in question (I915_RESET_BACKOFF) is an i915 implementation which has lockdep
1391+
* annotation but not to the extent of explaining the 'might lock' is also a
1392+
* 'does not need to lock'. So one option would be to add more complex lockdep
1393+
* annotations to ignore the issue (if at all possible). A simpler option is to
1394+
* just not flush synchronously when a rest in progress. Given that the worker
1395+
* will just early exit and re-schedule itself anyway, there is no advantage
1396+
* to running it immediately.
1397+
*
1398+
* If a reset is not in progress, then the synchronous flush may be required.
1399+
* As noted many call stacks lead here, some during suspend and driver unload
1400+
* which do require a synchronous flush to make sure the worker is stopped
1401+
* before memory is freed.
1402+
*
1403+
* Trying to pass a 'need_sync' or 'in_reset' flag all the way down through
1404+
* every possible call stack is unfeasible. It would be too intrusive to many
1405+
* areas that really don't care about the GuC backend. However, there is the
1406+
* 'reset_in_progress' flag available, so just use that.
1407+
*
1408+
* And note that in the case of a reset occurring during driver unload
1409+
* (wedge_on_fini), skipping the cancel in _prepare (when the reset flag is set
1410+
* is fine because there is another cancel in _finish (when the reset flag is
1411+
* not).
1412+
*/
1413+
if (guc_to_gt(guc)->uc.reset_in_progress)
1414+
cancel_delayed_work(&guc->timestamp.work);
1415+
else
1416+
cancel_delayed_work_sync(&guc->timestamp.work);
13791417
}
13801418

13811419
static void __reset_guc_busyness_stats(struct intel_guc *guc)
@@ -1966,8 +2004,16 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc)
19662004

19672005
void intel_guc_submission_reset_finish(struct intel_guc *guc)
19682006
{
2007+
/*
2008+
* Ensure the busyness worker gets cancelled even on a fatal wedge.
2009+
* Note that reset_prepare is not allowed to because it confuses lockdep.
2010+
*/
2011+
if (guc_submission_initialized(guc))
2012+
guc_cancel_busyness_worker(guc);
2013+
19692014
/* Reset called during driver load or during wedge? */
19702015
if (unlikely(!guc_submission_initialized(guc) ||
2016+
!intel_guc_is_fw_running(guc) ||
19712017
intel_gt_is_wedged(guc_to_gt(guc)))) {
19722018
return;
19732019
}

drivers/gpu/drm/i915/gt/uc/intel_uc.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -640,7 +640,7 @@ void intel_uc_reset_finish(struct intel_uc *uc)
640640
uc->reset_in_progress = false;
641641

642642
/* Firmware expected to be running when this function is called */
643-
if (intel_guc_is_fw_running(guc) && intel_uc_uses_guc_submission(uc))
643+
if (intel_uc_uses_guc_submission(uc))
644644
intel_guc_submission_reset_finish(guc);
645645
}
646646

0 commit comments

Comments
 (0)