Skip to content

Commit f844732

Browse files
EmilyDeng666alexdeucher
authored andcommitted
drm/amdgpu: Fix the race condition for draining retry fault
Issue: In the scenario where svm_range_restore_pages is called, but svm->checkpoint_ts has not been set and the retry fault has not been drained, svm_range_unmap_from_cpu is triggered and calls svm_range_free. Meanwhile, svm_range_restore_pages continues execution and reaches svm_range_from_addr. This results in a "failed to find prange..." error, causing the page recovery to fail. How to fix: Move the timestamp check code under the protection of svm->lock. v2: Make sure all right locks are released before go out. v3: Directly goto out_unlock_svms, and return -EAGAIN. v4: Refine code. Signed-off-by: Emily Deng <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
1 parent b9e75bc commit f844732

File tree

1 file changed

+17
-14
lines changed

1 file changed

+17
-14
lines changed

drivers/gpu/drm/amd/amdkfd/kfd_svm.c

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3009,19 +3009,6 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
30093009
goto out;
30103010
}
30113011

3012-
/* check if this page fault time stamp is before svms->checkpoint_ts */
3013-
if (svms->checkpoint_ts[gpuidx] != 0) {
3014-
if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) {
3015-
pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
3016-
r = 0;
3017-
goto out;
3018-
} else
3019-
/* ts is after svms->checkpoint_ts now, reset svms->checkpoint_ts
3020-
* to zero to avoid following ts wrap around give wrong comparing
3021-
*/
3022-
svms->checkpoint_ts[gpuidx] = 0;
3023-
}
3024-
30253012
if (!p->xnack_enabled) {
30263013
pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
30273014
r = -EFAULT;
@@ -3041,6 +3028,21 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
30413028
mmap_read_lock(mm);
30423029
retry_write_locked:
30433030
mutex_lock(&svms->lock);
3031+
3032+
/* check if this page fault time stamp is before svms->checkpoint_ts */
3033+
if (svms->checkpoint_ts[gpuidx] != 0) {
3034+
if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) {
3035+
pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
3036+
r = -EAGAIN;
3037+
goto out_unlock_svms;
3038+
} else {
3039+
/* ts is after svms->checkpoint_ts now, reset svms->checkpoint_ts
3040+
* to zero to avoid following ts wrap around give wrong comparing
3041+
*/
3042+
svms->checkpoint_ts[gpuidx] = 0;
3043+
}
3044+
}
3045+
30443046
prange = svm_range_from_addr(svms, addr, NULL);
30453047
if (!prange) {
30463048
pr_debug("failed to find prange svms 0x%p address [0x%llx]\n",
@@ -3166,7 +3168,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
31663168
mutex_unlock(&svms->lock);
31673169
mmap_read_unlock(mm);
31683170

3169-
svm_range_count_fault(node, p, gpuidx);
3171+
if (r != -EAGAIN)
3172+
svm_range_count_fault(node, p, gpuidx);
31703173

31713174
mmput(mm);
31723175
out:

0 commit comments

Comments
 (0)