Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml)

The current configuration for the systemd unit files are monitoring the **active** state like this:

- name: systemd-services-monitoring
  rules:
  - alert: service-down-pacemaker
    expr: node_systemd_unit_state{name="pacemaker.service",
state="active"} == 0
    labels:
      severity: page
    annotations:
      summary: Pacemaker service not running

This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin.
I would suggest to change the monitoring rule from **active** to **failed**:

- name: systemd-services-monitoring
  rules:
  - alert: service-failed-pacemaker
    expr: node_systemd_unit_state{name="pacemaker.service",
state="failed"} == 1
    labels:
      severity: page
    annotations:
      summary: Pacemaker service could not start or is crashed.

This would create less calls in regards to the situation a systemd unit is stop due to maintenance.
If we would go this way we could think about to shorten the list and using a configuration like this:

- alert: HostSystemdServiceCrashed
  expr: node_systemd_unit_state{state="failed"}   == 1
for: 1m
labels:
  severity: page
annotations:
  description: |-
    systemd service crashed
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
  summary: Host systemd service crashed (instance {{ $labels.instance }})



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions