Skip to content

Commit a9f4fa3

Browse files
matt-auldrodrigovivi
authored andcommitted
drm/xe/userptr: fix EFAULT handling
Currently we treat EFAULT from hmm_range_fault() as a non-fatal error when called from xe_vm_userptr_pin() with the idea that we want to avoid killing the entire vm and chucking an error, under the assumption that the user just did an unmap or something, and has no intention of actually touching that memory from the GPU. At this point we have already zapped the PTEs so any access should generate a page fault, and if the pin fails there also it will then become fatal. However it looks like it's possible for the userptr vma to still be on the rebind list in preempt_rebind_work_func(), if we had to retry the pin again due to something happening in the caller before we did the rebind step, but in the meantime needing to re-validate the userptr and this time hitting the EFAULT. This explains an internal user report of hitting: [ 191.738349] WARNING: CPU: 1 PID: 157 at drivers/gpu/drm/xe/xe_res_cursor.h:158 xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738551] Workqueue: xe-ordered-wq preempt_rebind_work_func [xe] [ 191.738616] RIP: 0010:xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738690] Call Trace: [ 191.738692] <TASK> [ 191.738694] ? show_regs+0x69/0x80 [ 191.738698] ? __warn+0x93/0x1a0 [ 191.738703] ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738759] ? report_bug+0x18f/0x1a0 [ 191.738764] ? handle_bug+0x63/0xa0 [ 191.738767] ? exc_invalid_op+0x19/0x70 [ 191.738770] ? asm_exc_invalid_op+0x1b/0x20 [ 191.738777] ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738834] ? ret_from_fork_asm+0x1a/0x30 [ 191.738849] bind_op_prepare+0x105/0x7b0 [xe] [ 191.738906] ? dma_resv_reserve_fences+0x301/0x380 [ 191.738912] xe_pt_update_ops_prepare+0x28c/0x4b0 [xe] [ 191.738966] ? kmemleak_alloc+0x4b/0x80 [ 191.738973] ops_execute+0x188/0x9d0 [xe] [ 191.739036] xe_vm_rebind+0x4ce/0x5a0 [xe] [ 191.739098] ? trace_hardirqs_on+0x4d/0x60 [ 191.739112] preempt_rebind_work_func+0x76f/0xd00 [xe] Followed by NPD, when running some workload, since the sg was never actually populated but the vma is still marked for rebind when it should be skipped for this special EFAULT case. This is confirmed to fix the user report. v2 (MattB): - Move earlier. v3 (MattB): - Update the commit message to make it clear that this indeed fixes the issue. Fixes: 521db22 ("drm/xe: Invalidate userptr VMA on page pin fault") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Thomas Hellström <[email protected]> Cc: <[email protected]> # v6.10+ Reviewed-by: Matthew Brost <[email protected]> Reviewed-by: Thomas Hellström <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit 6b93cb9) Signed-off-by: Rodrigo Vivi <[email protected]>
1 parent e043dc1 commit a9f4fa3

File tree

1 file changed

+12
-0
lines changed

1 file changed

+12
-0
lines changed

drivers/gpu/drm/xe/xe_vm.c

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -681,6 +681,18 @@ int xe_vm_userptr_pin(struct xe_vm *vm)
681681
err = xe_vma_userptr_pin_pages(uvma);
682682
if (err == -EFAULT) {
683683
list_del_init(&uvma->userptr.repin_link);
684+
/*
685+
* We might have already done the pin once already, but
686+
* then had to retry before the re-bind happened, due
687+
* some other condition in the caller, but in the
688+
* meantime the userptr got dinged by the notifier such
689+
* that we need to revalidate here, but this time we hit
690+
* the EFAULT. In such a case make sure we remove
691+
* ourselves from the rebind list to avoid going down in
692+
* flames.
693+
*/
694+
if (!list_empty(&uvma->vma.combined_links.rebind))
695+
list_del_init(&uvma->vma.combined_links.rebind);
684696

685697
/* Wait for pending binds */
686698
xe_vm_lock(vm, false);

0 commit comments

Comments
 (0)