Skip to content

Commit d25c694

Browse files
aeglsuryasaimadhu
authored andcommitted
RAS/CEC: Reduce offline page threshold for Intel systems
A large scale study of memory errors on Intel systems in data centers showed that aggressively taking pages with corrected errors offline is the best strategy of using corrected errors as a predictor of future uncorrected errors. Set the threshold to "2" on Intel systems. AMD guidance is that this is not necessary for their systems. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/YulOZ/[email protected]
1 parent 1c23f9e commit d25c694

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

drivers/ras/cec.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -556,6 +556,14 @@ static int __init cec_init(void)
556556
if (ce_arr.disabled)
557557
return -ENODEV;
558558

559+
/*
560+
* Intel systems may avoid uncorrectable errors
561+
* if pages with corrected errors are aggressively
562+
* taken offline.
563+
*/
564+
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
565+
action_threshold = 2;
566+
559567
ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
560568
if (!ce_arr.array) {
561569
pr_err("Error allocating CE array page!\n");

0 commit comments

Comments
 (0)