Skip to content

Commit 0b9ebd7

Browse files
John Clementsalexdeucher
authored andcommitted
drm/amdgpu: resolve mGPU RAS query instability
upon receiving uncorrectable error, query every GPU node for ras errors Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
1 parent dec7880 commit 0b9ebd7

File tree

1 file changed

+15
-5
lines changed

1 file changed

+15
-5
lines changed

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1424,12 +1424,22 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
14241424
{
14251425
struct amdgpu_ras *ras =
14261426
container_of(work, struct amdgpu_ras, recovery_work);
1427+
struct amdgpu_device *remote_adev = NULL;
1428+
struct amdgpu_device *adev = ras->adev;
1429+
struct list_head device_list, *device_list_handle = NULL;
1430+
struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, false);
1431+
1432+
/* Build list of devices to query RAS related errors */
1433+
if (hive && adev->gmc.xgmi.num_physical_nodes > 1) {
1434+
device_list_handle = &hive->device_list;
1435+
} else {
1436+
list_add_tail(&adev->gmc.xgmi.head, &device_list);
1437+
device_list_handle = &device_list;
1438+
}
14271439

1428-
/*
1429-
* Query and print non zero error counter per IP block for
1430-
* awareness before recovering GPU.
1431-
*/
1432-
amdgpu_ras_log_on_err_counter(ras->adev);
1440+
list_for_each_entry(remote_adev, device_list_handle, gmc.xgmi.head) {
1441+
amdgpu_ras_log_on_err_counter(remote_adev);
1442+
}
14331443

14341444
if (amdgpu_device_should_recover_gpu(ras->adev))
14351445
amdgpu_device_gpu_recover(ras->adev, 0);

0 commit comments

Comments
 (0)