Skip to content

Commit 15b77db

Browse files
committed
separated prometheus recording and alerting rules
1 parent 5864b56 commit 15b77db

File tree

3 files changed

+17
-12
lines changed

3 files changed

+17
-12
lines changed

ansible/roles/kube_prometheus_stack/defaults/main/main.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,13 +63,16 @@ prometheus_external_labels:
6363

6464
prometheus_scrape_configs: []
6565

66-
prometheus_extra_rules: []
66+
prometheus_extra_recording_rules: []
67+
prometheus_extra_alerting_rules: []
6768

6869
prometheus_rules:
6970
appliance-rules:
7071
groups:
71-
- name: all
72-
rules: "{{ prometheus_extra_rules }}"
72+
- name: appliance-recording-rules
73+
rules: "{{ prometheus_extra_recording_rules }}"
74+
- name: appliance-alerting-rules
75+
rules: "{{ prometheus_extra_alerting_rules }}"
7376

7477
# ------------------------------------------------------------------------------------------
7578
grafana_image_tag: 11.2.2

docs/monitoring-and-logging.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,7 @@ The appliance previously used [cloudalchemy.prometheus](https://github.com/cloud
236236

237237
See the upstream documentation for [alerting](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) and [recording](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) rules.
238238

239-
In addition to the default recording and alerting rules set by kube-prometheus-stack, the appliances provides a default set of rules which can be found in the `prometheus_extra_rules` list in:
239+
In addition to the default recording and alerting rules set by kube-prometheus-stack, the appliance provides its own sets of default rules which can be found and modified in the `prometheus_extra_recording_rules` and `prometheus_extra_alerting_rules` lists in:
240240

241241
> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
242242

environments/common/inventory/group_vars/all/prometheus.yml

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,7 @@ prometheus_scrape_configs_default:
2020
replacement: '${1}'
2121

2222
prometheus_scrape_configs: "{{ prometheus_scrape_configs_default + (openondemand_scrape_configs if groups['openondemand'] | count > 0 else []) }}"
23-
prometheus_extra_rules:
24-
- alert: SlurmNodeDown
25-
annotations:
26-
description: '{% raw %}{{ $value }} Slurm nodes are in down status.{% endraw %}'
27-
summary: 'At least one Slurm node is down.'
28-
expr: "slurm_nodes_down > 0\n"
29-
labels:
30-
severity: critical
23+
prometheus_extra_recording_rules:
3124
- record: node_cpu_system_seconds:record
3225
expr: (100 * sum by(instance)(increase(node_cpu_seconds_total{mode="system",job="node-exporter"}[60s]))) / (sum by(instance)(increase(node_cpu_seconds_total{job="node-exporter"}[60s])))
3326
- record: node_cpu_user_seconds:record
@@ -42,3 +35,12 @@ prometheus_extra_rules:
4235
expr: min by (instance) (node_cpu_scaling_frequency_hertz)
4336
- record: node_cpu_scaling_frequency_hertz_max:record
4437
expr: max by (instance) (node_cpu_scaling_frequency_hertz)
38+
39+
prometheus_extra_alerting_rules:
40+
- alert: SlurmNodeDown
41+
annotations:
42+
description: '{% raw %}{{ $value }} Slurm nodes are in down status.{% endraw %}'
43+
summary: 'At least one Slurm node is down.'
44+
expr: "slurm_nodes_down > 0\n"
45+
labels:
46+
severity: critical

0 commit comments

Comments
 (0)