Skip to content

Commit 51aea96

Browse files
perlunsamber
andauthored
Adjust OOM kill detected rule (#495)
* Adjust OOM kill detected rule When a machine runs out of memory, it happens that the node exporter stops responding for multiple minutes. I've adjusted the rule now to take this into account: even if it takes 15-20 minutes before the machine becomes responsive again, the alert should still fire. * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
1 parent 1d69457 commit 51aea96

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

_data/rules.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,8 +271,10 @@ groups:
271271
severity: info
272272
- name: Host OOM kill detected
273273
description: OOM kill detected
274-
query: "(increase(node_vmstat_oom_kill[1m]) > 0)"
274+
query: "(increase(node_vmstat_oom_kill[30m]) > 0)"
275275
severity: warning
276+
comments: |
277+
When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 15–20 minutes to recover, the alert should still trigger.
276278
- name: Host EDAC Correctable Errors detected
277279
description: 'Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.'
278280
query: "(increase(node_edac_correctable_errors_total[1m]) > 0)"

0 commit comments

Comments
 (0)