Skip to content

Commit e23300d

Browse files
YiPeng Chaialexdeucher
authored andcommitted
drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed
The problem case is as follows: 1. GPU A triggers a gpu ras reset, and GPU A drives GPU B to also perform a gpu ras reset. 2. After gpu B ras reset started, gpu B queried a DE data. Since the DE data was queried in the ras reset thread instead of the page retirement thread, bad page retirement work would not be triggered. Then even if all gpu resets are completed, the bad pages will be cached in RAM until GPU B's bad page retirement work is triggered again and then saved to eeprom. This patch can save the bad pages to eeprom in time after gpu ras reset is completed. v2: 1. Add the above description to code comments. 2. Reuse existing function. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
1 parent c047069 commit e23300d

File tree

2 files changed

+23
-1
lines changed

2 files changed

+23
-1
lines changed

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2934,8 +2934,12 @@ static void amdgpu_ras_do_page_retirement(struct work_struct *work)
29342934
struct ras_err_data err_data;
29352935
unsigned long err_cnt;
29362936

2937-
if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev))
2937+
/* If gpu reset is ongoing, delay retiring the bad pages */
2938+
if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) {
2939+
amdgpu_ras_schedule_retirement_dwork(con,
2940+
AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3);
29382941
return;
2942+
}
29392943

29402944
amdgpu_ras_error_data_init(&err_data);
29412945

drivers/gpu/drm/amd/amdgpu/umc_v12_0.c

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
#include "mp/mp_13_0_6_sh_mask.h"
3030

3131
#define MAX_ECC_NUM_PER_RETIREMENT 32
32+
#define DELAYED_TIME_FOR_GPU_RESET 1000 //ms
3233

3334
static inline uint64_t get_umc_v12_0_reg_offset(struct amdgpu_device *adev,
3435
uint32_t node_inst,
@@ -568,6 +569,23 @@ static int umc_v12_0_update_ecc_status(struct amdgpu_device *adev,
568569

569570
con->umc_ecc_log.de_queried_count++;
570571

572+
/* The problem case is as follows:
573+
* 1. GPU A triggers a gpu ras reset, and GPU A drives
574+
* GPU B to also perform a gpu ras reset.
575+
* 2. After gpu B ras reset started, gpu B queried a DE
576+
* data. Since the DE data was queried in the ras reset
577+
* thread instead of the page retirement thread, bad
578+
* page retirement work would not be triggered. Then
579+
* even if all gpu resets are completed, the bad pages
580+
* will be cached in RAM until GPU B's bad page retirement
581+
* work is triggered again and then saved to eeprom.
582+
* Trigger delayed work to save the bad pages to eeprom in time
583+
* after gpu ras reset is completed.
584+
*/
585+
if (amdgpu_ras_in_recovery(adev))
586+
schedule_delayed_work(&con->page_retirement_dwork,
587+
msecs_to_jiffies(DELAYED_TIME_FOR_GPU_RESET));
588+
571589
return 0;
572590
}
573591

0 commit comments

Comments
 (0)