Skip to content

Commit aab4707

Browse files
committed
drm/i915/gt: Harden the heartbeat against a stuck driver
If the driver gets stuck holding the kernel timeline, we cannot issue a heartbeat and so fail to discover that the driver is indeed stuck and do not issue a GPU reset (which would hopefully unstick the driver!). Switch to using a trylock so that we can query if the heartbeat's timeline mutex is locked elsewhere, and then use the timer to probe if it remains stuck at the same spot for consecutive heartbeats, indicating that the mutex has not been released and the engine has not progressed. Signed-off-by: Chris Wilson <[email protected]> Reviewed-by: Mika Kuoppala <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
1 parent 680c45c commit aab4707

File tree

2 files changed

+13
-2
lines changed

2 files changed

+13
-2
lines changed

drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ static void heartbeat(struct work_struct *wrk)
6565
container_of(wrk, typeof(*engine), heartbeat.work.work);
6666
struct intel_context *ce = engine->kernel_context;
6767
struct i915_request *rq;
68+
unsigned long serial;
6869

6970
/* Just in case everything has gone horribly wrong, give it a kick */
7071
intel_engine_flush_submission(engine);
@@ -122,10 +123,19 @@ static void heartbeat(struct work_struct *wrk)
122123
goto out;
123124
}
124125

125-
if (engine->wakeref_serial == engine->serial)
126+
serial = READ_ONCE(engine->serial);
127+
if (engine->wakeref_serial == serial)
126128
goto out;
127129

128-
mutex_lock(&ce->timeline->mutex);
130+
if (!mutex_trylock(&ce->timeline->mutex)) {
131+
/* Unable to lock the kernel timeline, is the engine stuck? */
132+
if (xchg(&engine->heartbeat.blocked, serial) == serial)
133+
intel_gt_handle_error(engine->gt, engine->mask,
134+
I915_ERROR_CAPTURE,
135+
"no heartbeat on %s",
136+
engine->name);
137+
goto out;
138+
}
129139

130140
intel_context_enter(ce);
131141
rq = __i915_request_create(ce, GFP_NOWAIT | __GFP_NOWARN);

drivers/gpu/drm/i915/gt/intel_engine_types.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,7 @@ struct intel_engine_cs {
348348
struct {
349349
struct delayed_work work;
350350
struct i915_request *systole;
351+
unsigned long blocked;
351352
} heartbeat;
352353

353354
unsigned long serial;

0 commit comments

Comments
 (0)