Skip to content

Commit 408c46b

Browse files
Tomer Tayarogabbay
authored andcommitted
habanalabs: print context refcount value if hard reset fails
Failing to kill a user process during a hard reset can be due to a reference to the user context which isn't released. To make it easier to understand if this the reason for the failure and not something else, add a print of the context refcount value. Signed-off-by: Tomer Tayar <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
1 parent 0abcae8 commit 408c46b

File tree

1 file changed

+15
-3
lines changed

1 file changed

+15
-3
lines changed

drivers/misc/habanalabs/common/device.c

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -696,10 +696,22 @@ static void device_hard_reset_pending(struct work_struct *work)
696696
flags = device_reset_work->flags | HL_DRV_RESET_FROM_RESET_THR;
697697

698698
rc = hl_device_reset(hdev, flags);
699+
699700
if ((rc == -EBUSY) && !hdev->device_fini_pending) {
700-
dev_info(hdev->dev,
701-
"Could not reset device. will try again in %u seconds",
702-
HL_PENDING_RESET_PER_SEC);
701+
struct hl_ctx *ctx = hl_get_compute_ctx(hdev);
702+
703+
if (ctx) {
704+
/* The read refcount value should subtracted by one, because the read is
705+
* protected with hl_get_compute_ctx().
706+
*/
707+
dev_info(hdev->dev,
708+
"Could not reset device (compute_ctx refcount %u). will try again in %u seconds",
709+
kref_read(&ctx->refcount) - 1, HL_PENDING_RESET_PER_SEC);
710+
hl_ctx_put(ctx);
711+
} else {
712+
dev_info(hdev->dev, "Could not reset device. will try again in %u seconds",
713+
HL_PENDING_RESET_PER_SEC);
714+
}
703715

704716
queue_delayed_work(hdev->reset_wq, &device_reset_work->reset_work,
705717
msecs_to_jiffies(HL_PENDING_RESET_PER_SEC * 1000));

0 commit comments

Comments
 (0)