Skip to content

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

@pirat013

Description

@pirat013

The current configuration for the systemd unit files are monitoring the active state like this:

  • name: systemd-services-monitoring
    rules:
    • alert: service-down-pacemaker
      expr: node_systemd_unit_state{name="pacemaker.service",
      state="active"} == 0
      labels:
      severity: page
      annotations:
      summary: Pacemaker service not running

This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin.
I would suggest to change the monitoring rule from active to failed:

  • name: systemd-services-monitoring
    rules:
    • alert: service-failed-pacemaker
      expr: node_systemd_unit_state{name="pacemaker.service",
      state="failed"} == 1
      labels:
      severity: page
      annotations:
      summary: Pacemaker service could not start or is crashed.

This would create less calls in regards to the situation a systemd unit is stop due to maintenance.
If we would go this way we could think about to shorten the list and using a configuration like this:

  • alert: HostSystemdServiceCrashed
    expr: node_systemd_unit_state{state="failed"} == 1
    for: 1m
    labels:
    severity: page
    annotations:
    description: |-
    systemd service crashed
    VALUE = {{ $value }}
    LABELS = {{ $labels }}
    summary: Host systemd service crashed (instance {{ $labels.instance }})

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions