-
Notifications
You must be signed in to change notification settings - Fork 85
Open
Labels
Description
The current configuration for the systemd unit files are monitoring the active state like this:
- name: systemd-services-monitoring
rules:- alert: service-down-pacemaker
expr: node_systemd_unit_state{name="pacemaker.service",
state="active"} == 0
labels:
severity: page
annotations:
summary: Pacemaker service not running
- alert: service-down-pacemaker
This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin.
I would suggest to change the monitoring rule from active to failed:
- name: systemd-services-monitoring
rules:- alert: service-failed-pacemaker
expr: node_systemd_unit_state{name="pacemaker.service",
state="failed"} == 1
labels:
severity: page
annotations:
summary: Pacemaker service could not start or is crashed.
- alert: service-failed-pacemaker
This would create less calls in regards to the situation a systemd unit is stop due to maintenance.
If we would go this way we could think about to shorten the list and using a configuration like this:
- alert: HostSystemdServiceCrashed
expr: node_systemd_unit_state{state="failed"} == 1
for: 1m
labels:
severity: page
annotations:
description: |-
systemd service crashed
VALUE = {{ $value }}
LABELS = {{ $labels }}
summary: Host systemd service crashed (instance {{ $labels.instance }})
Reactions are currently unavailable