Skip to content

Commit ce9ed5b

Browse files
committed
add node failure alert
1 parent d806dee commit ce9ed5b

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

environments/common/files/prometheus/rules/slurm.rules

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,8 @@ groups:
99
expr: "slurm_nodes_down > 0\n"
1010
labels:
1111
severity: critical
12+
- alert: SlurmNodeFail
13+
annotations:
14+
description: '{{ $value }} Slurm nodes are in fail status'
15+
summary: 'At least one Slurm node is failed.'
16+
expr: "slurm_nodes_fail > 0\n"

0 commit comments

Comments
 (0)