Skip to content

Commit f5b6c76

Browse files
committed
Merge branch 'pci/aer'
- Initialize struct aer_err_info before using it to avoid depending on stack garbage (Bjorn Helgaas) - Log the DPC Error Source ID only when it's actually valid (when ERR_FATAL or ERR_NONFATAL was received from a downstream device) and decode into bus/device/function (Bjorn Helgaas) - Consolidate AER Error Source ID in one place for message consistency (Bjorn Helgaas) - Update statistics and emit trace events early in AER logging paths, before any potential ratelimiting (Bjorn Helgaas) - Determine AER log level once and save it so all related messages use the same level (Karolina Stolarek) - Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable Errors. - Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs controls on interval and burst count, to avoid flooding logs and RCU stall warnings (Jon Pan-Doh) * pci/aer: PCI/ERR: Remove misleading TODO regarding kernel panic PCI/AER: Add sysfs attributes for log ratelimits PCI/AER: Add ratelimits to PCI AER Documentation PCI/AER: Ratelimit correctable and non-fatal error logging PCI/AER: Simplify add_error_device() PCI/AER: Convert aer_get_device_error_info(), aer_print_error() to index PCI/AER: Rename struct aer_stats to aer_info PCI/AER: Reduce pci_print_aer() correctable error level to KERN_WARNING PCI/ERR: Add printk level to pcie_print_tlp_log() PCI/AER: Check log level once and remember it PCI/AER: Trace error event before ratelimiting PCI/AER: Update statistics before ratelimiting PCI/AER: Simplify pci_print_aer() PCI/AER: Initialize aer_err_info before using it PCI/AER: Move aer_print_source() earlier in file PCI/AER: Rename aer_print_port_info() to aer_print_source() PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc PCI/AER: Consolidate Error Source ID logging in aer_isr_one_error_type() PCI/AER: Factor COR/UNCOR error handling out from aer_isr_one_error() PCI/DPC: Log Error Source ID only when valid PCI/DPC: Initialize aer_err_info before using it
2 parents 0af2f6b + b06d125 commit f5b6c76

File tree

9 files changed

+430
-171
lines changed

9 files changed

+430
-171
lines changed

Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats renamed to Documentation/ABI/testing/sysfs-bus-pci-devices-aer

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,3 +117,47 @@ Date: July 2018
117117
KernelVersion: 4.19.0
118118
119119
Description: Total number of ERR_NONFATAL messages reported to rootport.
120+
121+
PCIe AER ratelimits
122+
-------------------
123+
124+
These attributes show up under all the devices that are AER capable.
125+
They represent configurable ratelimits of logs per error type.
126+
127+
See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
128+
129+
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_interval_ms
130+
Date: May 2025
131+
KernelVersion: 6.16.0
132+
133+
Description: Writing 0 disables AER correctable error log ratelimiting.
134+
Writing a positive value sets the ratelimit interval in ms.
135+
Default is DEFAULT_RATELIMIT_INTERVAL (5000 ms).
136+
137+
What: /sys/bus/pci/devices/<dev>/aer/correctable_ratelimit_burst
138+
Date: May 2025
139+
KernelVersion: 6.16.0
140+
141+
Description: Ratelimit burst for correctable error logs. Writing a value
142+
changes the number of errors (burst) allowed per interval
143+
before ratelimiting. Reading gets the current ratelimit
144+
burst. Default is DEFAULT_RATELIMIT_BURST (10).
145+
146+
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_interval_ms
147+
Date: May 2025
148+
KernelVersion: 6.16.0
149+
150+
Description: Writing 0 disables AER non-fatal uncorrectable error log
151+
ratelimiting. Writing a positive value sets the ratelimit
152+
interval in ms. Default is DEFAULT_RATELIMIT_INTERVAL
153+
(5000 ms).
154+
155+
What: /sys/bus/pci/devices/<dev>/aer/nonfatal_ratelimit_burst
156+
Date: May 2025
157+
KernelVersion: 6.16.0
158+
159+
Description: Ratelimit burst for non-fatal uncorrectable error logs.
160+
Writing a value changes the number of errors (burst)
161+
allowed per interval before ratelimiting. Reading gets the
162+
current ratelimit burst. Default is DEFAULT_RATELIMIT_BURST
163+
(10).

Documentation/PCI/pcieaer-howto.rst

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,12 +85,27 @@ In the example, 'Requester ID' means the ID of the device that sent
8585
the error message to the Root Port. Please refer to PCIe specs for other
8686
fields.
8787

88+
AER Ratelimits
89+
--------------
90+
91+
Since error messages can be generated for each transaction, we may see
92+
large volumes of errors reported. To prevent spammy devices from flooding
93+
the console/stalling execution, messages are throttled by device and error
94+
type (correctable vs. non-fatal uncorrectable). Fatal errors, including
95+
DPC errors, are not ratelimited.
96+
97+
AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
98+
DEFAULT_RATELIMIT_INTERVAL (5 seconds).
99+
100+
Ratelimits are exposed in the form of sysfs attributes and configurable.
101+
See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
102+
88103
AER Statistics / Counters
89104
-------------------------
90105

91106
When PCIe AER errors are captured, the counters / statistics are also exposed
92107
in the form of sysfs attributes which are documented at
93-
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
108+
Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
94109

95110
Developer Guide
96111
===============

drivers/pci/pci-sysfs.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1805,6 +1805,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
18051805
&pcie_dev_attr_group,
18061806
#ifdef CONFIG_PCIEAER
18071807
&aer_stats_attr_group,
1808+
&aer_attr_group,
18081809
#endif
18091810
#ifdef CONFIG_PCIEASPM
18101811
&aspm_ctrl_attr_group,

drivers/pci/pci.h

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -587,12 +587,15 @@ static inline bool pci_dev_test_and_set_removed(struct pci_dev *dev)
587587

588588
struct aer_err_info {
589589
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
590+
int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
590591
int error_dev_num;
592+
const char *level; /* printk level */
591593

592594
unsigned int id:16;
593595

594596
unsigned int severity:2; /* 0:NONFATAL | 1:FATAL | 2:COR */
595-
unsigned int __pad1:5;
597+
unsigned int root_ratelimit_print:1; /* 0=skip, 1=print */
598+
unsigned int __pad1:4;
596599
unsigned int multi_error_valid:1;
597600

598601
unsigned int first_error:5;
@@ -604,15 +607,16 @@ struct aer_err_info {
604607
struct pcie_tlp_log tlp; /* TLP Header */
605608
};
606609

607-
int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
608-
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
610+
int aer_get_device_error_info(struct aer_err_info *info, int i);
611+
void aer_print_error(struct aer_err_info *info, int i);
609612

610613
int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
611614
unsigned int tlp_len, bool flit,
612615
struct pcie_tlp_log *log);
613616
unsigned int aer_tlp_log_len(struct pci_dev *dev, u32 aercc);
614617
void pcie_print_tlp_log(const struct pci_dev *dev,
615-
const struct pcie_tlp_log *log, const char *pfx);
618+
const struct pcie_tlp_log *log, const char *level,
619+
const char *pfx);
616620
#endif /* CONFIG_PCIEAER */
617621

618622
#ifdef CONFIG_PCIEPORTBUS
@@ -961,6 +965,7 @@ void pci_no_aer(void);
961965
void pci_aer_init(struct pci_dev *dev);
962966
void pci_aer_exit(struct pci_dev *dev);
963967
extern const struct attribute_group aer_stats_attr_group;
968+
extern const struct attribute_group aer_attr_group;
964969
void pci_aer_clear_fatal_status(struct pci_dev *dev);
965970
int pci_aer_clear_status(struct pci_dev *dev);
966971
int pci_aer_raw_clear_status(struct pci_dev *dev);

0 commit comments

Comments
 (0)