Skip to content

Commit 27a563e

Browse files
Merge pull request #766 from openstack-k8s-operators/OSPRH-20230/alertmanager-sample-alerts
Add Alertmanager sample alerts
2 parents 478dc6b + e8e35db commit 27a563e

File tree

4 files changed

+420
-0
lines changed

4 files changed

+420
-0
lines changed
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Alertmanager sample rules
2+
3+
This document outlines the custom Alertmanager alerting rules for monitoring an OpenStack deployment running on top of an OpenShift cluster. The sample alerts are divided into two main groups: services status alerts and nodes status alerts.
4+
5+
> **NOTE:** The samples provided in this document are intended as examples for guidance only. You should review and adapt them to fit the specific metrics, labels, and operational context of your environment. Thresholds for resource utilization, in particular, may need significant tuning based on your workload patterns and capacity planning.
6+
7+
## OpenStack Observability Services Status Alerts
8+
9+
This group of alerts monitors the availability of core OpenStack services. These alerts are critical as they indicate a direct impact on the functionality of the OpenStack control plane and its APIs.
10+
11+
## OpenStack Observability Nodes Status Alerts
12+
13+
This group of alerts monitors the fundamental compute and resources managed by the OpenStack deployment. These alerts help prevent service degradation by providing early warnings about resource exhaustion.
Lines changed: 339 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,339 @@
1+
apiVersion: monitoring.rhobs/v1
2+
kind: PrometheusRule
3+
metadata:
4+
labels:
5+
service: metricStorage
6+
name: openstack-observability-nodes-status
7+
namespace: openstack
8+
spec:
9+
groups:
10+
- name: openstack-observability.nodes.status
11+
rules:
12+
# Disk usage alerts
13+
- expr: rate(node_disk_io_time_seconds_total[5m])
14+
record: job:iotime:rate_5m
15+
- expr: stddev_over_time(job:iotime:rate_5m[1h])
16+
record: job:iotime:rate_5m:stddev_over_time_1h
17+
- expr: avg_over_time(job:iotime:rate_5m[1h])
18+
record: job:iotime:rate_5m:avg_over_time_1h
19+
- alert: HighIOtimeWarning
20+
expr: >-
21+
(abs(job:iotime:rate_5m - job:iotime:rate_5m:avg_over_time_1h) /
22+
job:iotime:rate_5m:stddev_over_time_1h) > 3
23+
for: 10m
24+
labels:
25+
severity: warning
26+
annotations:
27+
summary: "Disk I/O time is moderately abnormal"
28+
description: |
29+
The time the disk is actively spending on I/O (reads or writes) has deviated more than
30+
3 standard deviations from the 1-hour average. This may indicate increasing I/O load,
31+
slow operations, or unusual activity patterns. Alert has been active for at least 10 minutes.
32+
33+
- expr: rate(node_disk_io_time_weighted_seconds_total[5m])
34+
record: job:weightiotime:rate_5m
35+
- expr: stddev_over_time(job:weightiotime:rate_5m[1h])
36+
record: job:weightiotime:rate_5m:stddev_over_time_1h
37+
- expr: avg_over_time(job:weightiotime:rate_5m[1h])
38+
record: job:weightiotime:rate_5m:avg_over_time_1h
39+
- alert: HighWeightedIOtimeWarning
40+
expr: >-
41+
(abs(job:weightiotime:rate_5m -
42+
job:weightiotime:rate_5m:avg_over_time_1h) /
43+
job:weightiotime:rate_5m:stddev_over_time_1h) > 3
44+
for: 10m
45+
labels:
46+
severity: warning
47+
annotations:
48+
summary: "Disk weighted I/O time is moderately abnormal"
49+
description: |
50+
The weighted disk I/O time (which reflects total time and concurrency of disk operations)
51+
has deviated more than 3 standard deviations from the 1-hour average for at least 10 minutes.
52+
This may indicate increasing disk load, contention, or early signs of I/O saturation.
53+
54+
- expr: rate(node_disk_read_time_seconds_total[5m])
55+
record: job:disk:time:read:rate_5m
56+
- expr: stddev_over_time(job:disk:time:read:rate_5m[1h])
57+
record: job:disk:time:read:rate_5m:stddev_over_time_1h
58+
- expr: avg_over_time(job:disk:time:read:rate_5m[1h])
59+
record: job:disk:time:read:rate_5m:avg_over_time_1h
60+
- alert: HighDiskReadTimeWarning
61+
expr: >-
62+
(abs(job:disk:time:read:rate_5m -
63+
job:disk:time:read:rate_5m:avg_over_time_1h) /
64+
job:disk:time:read:rate_5m:stddev_over_time_1h) >3
65+
for: 10m
66+
labels:
67+
severity: warning
68+
annotations:
69+
summary: "Disk read time is moderately abnormal"
70+
description: |
71+
The time spent on disk read operations is significantly higher or lower than the 1-hour average.
72+
This may indicate increasing disk latency or abnormal I/O behavior.
73+
Value deviates more than 3 standard deviations for at least 10 minutes.
74+
- alert: HighDiskReadTimeCritical
75+
expr: >-
76+
(abs(job:disk:time:read:rate_5m -
77+
job:disk:time:read:rate_5m:avg_over_time_1h) /
78+
job:disk:time:read:rate_5m:stddev_over_time_1h) >6
79+
for: 10m
80+
labels:
81+
severity: critical
82+
annotations:
83+
summary: "Disk read time is critically abnormal"
84+
description: |
85+
The time spent on disk read operations has deviated more than 6 standard deviations
86+
from the 1-hour average for at least 10 minutes. This indicates a likely performance issue,
87+
such as disk contention, hardware failure, or severe I/O bottleneck.
88+
89+
- expr: rate(node_disk_write_time_seconds_total[5m])
90+
record: job:disk:time:write:rate_5m
91+
- expr: stddev_over_time(job:disk:time:write:rate_5m[1h])
92+
record: job:disk:time:write:rate_5m:stddev_over_time_1h
93+
- expr: avg_over_time(job:disk:time:write:rate_5m[1h])
94+
record: job:disk:time:write:rate_5m:avg_over_time_1h
95+
- alert: HighDiskWriteTimeWarning
96+
expr: >-
97+
(abs(job:disk:time:write:rate_5m -
98+
job:disk:time:write:rate_5m:avg_over_time_1h) /
99+
job:disk:time:write:rate_5m:stddev_over_time_1h) >3
100+
for: 10m
101+
labels:
102+
severity: warning
103+
annotations:
104+
summary: "Disk write time is moderately abnormal"
105+
description: |
106+
The disk is spending more time than usual on write operations.
107+
The write time has deviated more than 3 standard deviations from the 1-hour average,
108+
for at least 10 minutes. This may indicate increasing disk latency or a rising write load.
109+
- alert: HighDiskWriteTimeCritical
110+
expr: >-
111+
(abs(job:disk:time:write:rate_5m -
112+
job:disk:time:write:rate_5m:avg_over_time_1h) /
113+
job:disk:time:write:rate_5m:stddev_over_time_1h) >6
114+
for: 10m
115+
labels:
116+
severity: critical
117+
annotations:
118+
summary: "Disk write time is critically abnormal"
119+
description: |
120+
The time the disk spends writing has deviated more than 6 standard deviations
121+
from the 1-hour average for at least 10 minutes. This indicates a likely performance
122+
issue such as disk write contention, a hardware bottleneck, or a heavily loaded application.
123+
124+
- expr: rate(node_disk_reads_completed_total[5m])
125+
record: job:disk:ops:read:rate_5m
126+
- expr: stddev_over_time(job:disk:ops:read:rate_5m[1h])
127+
record: job:disk:ops:read:rate_5m:stddev_over_time_1h
128+
- expr: avg_over_time(job:disk:ops:read:rate_5m[1h])
129+
record: job:disk:ops:read:rate_5m:avg_over_time_1h
130+
- alert: HighDiskReadOpsWarning
131+
expr: >-
132+
(abs(job:disk:ops:read:rate_5m -
133+
job:disk:ops:read:rate_5m:avg_over_time_1h) /
134+
job:disk:ops:read:rate_5m:stddev_over_time_1h) >3
135+
for: 10m
136+
labels:
137+
severity: warning
138+
annotations:
139+
summary: "Disk read rate is moderately abnormal"
140+
description: |
141+
The disk read operations rate has deviated more than 3 standard deviations
142+
from the 1-hour average for at least 10 minutes. This may indicate a change
143+
in workload or a potential disk performance issue.
144+
- alert: HighDiskReadOpsCritical
145+
expr: >-
146+
(abs(job:disk:ops:read:rate_5m -
147+
job:disk:ops:read:rate_5m:avg_over_time_1h) /
148+
job:disk:ops:read:rate_5m:stddev_over_time_1h) >6
149+
for: 10m
150+
labels:
151+
severity: critical
152+
annotations:
153+
summary: "Disk read rate is critically abnormal"
154+
description: |
155+
The disk read operations rate has deviated more than 6 standard deviations
156+
from the 1-hour average for at least 10 minutes. This is a significant anomaly
157+
and may indicate disk overload, hardware issues, or a misbehaving application.
158+
159+
- expr: rate(node_disk_writes_completed_total[5m])
160+
record: job:disk:ops:write:rate_5m
161+
- expr: stddev_over_time(job:disk:ops:write:rate_5m[1h])
162+
record: job:disk:ops:write:rate_5m:stddev_over_time_1h
163+
- expr: avg_over_time(job:disk:ops:write:rate_5m[1h])
164+
record: job:disk:ops:write:rate_5m:avg_over_time_1h
165+
- alert: HighDiskWriteOpsWarning
166+
expr: >-
167+
(abs(job:disk:ops:write:rate_5m -
168+
job:disk:ops:write:rate_5m:avg_over_time_1h) /
169+
job:disk:ops:write:rate_5m:stddev_over_time_1h) >3
170+
for: 10m
171+
labels:
172+
severity: warning
173+
annotations:
174+
summary: "Disk write ops rate is moderately abnormal"
175+
description: |
176+
The number of completed disk write operations per second has deviated
177+
more than 3 standard deviations from the 1-hour average for at least 10 minutes.
178+
This may indicate unexpected disk activity, workload spikes, or application issues.
179+
- alert: HighDiskWriteOpsCritical
180+
expr: >-
181+
(abs(job:disk:ops:write:rate_5m -
182+
job:disk:ops:write:rate_5m:avg_over_time_1h) /
183+
job:disk:ops:write:rate_5m:stddev_over_time_1h) >6
184+
for: 10m
185+
labels:
186+
severity: critical
187+
annotations:
188+
summary: "Disk write ops rate is critically abnormal"
189+
description: |
190+
The number of completed disk write operations per second has deviated
191+
more than 6 standard deviations from the 1-hour baseline for at least 10 minutes.
192+
This likely indicates a serious workload anomaly, disk saturation, or malfunctioning application.
193+
194+
# CPU usage alerts
195+
# Count of CPU cores per instance
196+
- record: job:cpu:count:cpu_cores_total
197+
expr: count by (instance) (count by (instance, cpu) (node_cpu_seconds_total))
198+
# CPU usage rate (excluding idle & iowait)
199+
- record: job:cpu:rate:core_usage_seconds
200+
expr: sum by (instance) (
201+
rate(node_cpu_seconds_total{mode!~"idle|iowait"}[5m])
202+
)
203+
# Average CPU usage per core (percentage)
204+
- record: job:cpu:rate:avg_core_usage_percent
205+
expr: job:cpu:rate:core_usage_seconds
206+
/ job:cpu:count:cpu_cores_total
207+
* 100
208+
- alert: HighAverageCPUUsageWarning
209+
expr: job:cpu:rate:avg_core_usage_percent > 50 and job:cpu:rate:avg_core_usage_percent < 70
210+
for: 5m
211+
labels:
212+
severity: warning
213+
annotations:
214+
summary: "High average CPU usage on instance {{ $labels.instance }}"
215+
description: |
216+
The average CPU usage per core is above 50% and below 70% on instance {{ $labels.instance }}
217+
for more than 5 minutes. This may indicate CPU saturation and could affect performance.
218+
219+
- alert: HighCPUUsageCritical
220+
expr: job:cpu:rate:avg_core_usage_percent >= 70
221+
for: 10m
222+
labels:
223+
severity: critical
224+
annotations:
225+
summary: "High average CPU usage on instance {{ $labels.instance }}"
226+
description: |
227+
The average CPU usage per core is above 70% on instance {{ $labels.instance }}
228+
for more than 10 minutes. This may indicate CPU saturation and could affect performance.
229+
230+
# Inode usage alerts
231+
- record: job:filesystem:inode_usage_ratio
232+
expr: |
233+
(node_filesystem_files{fstype!~"tmpfs|devtmpfs|overlay"} - node_filesystem_files_free{fstype!~"tmpfs|devtmpfs|overlay"}) / node_filesystem_files{fstype!~"tmpfs|devtmpfs|overlay"}
234+
- alert: InodeUsageWarning
235+
expr: |
236+
job:filesystem:inode_usage_ratio > 0.6
237+
for: 10m
238+
labels:
239+
severity: warning
240+
annotations:
241+
summary: "Inode usage high (warning)"
242+
description: "Inode usage is above 60% for more than 10 minutes on {{ $labels.instance }} (mountpoint: {{ $labels.mountpoint }}, device: {{ $labels.device }})"
243+
244+
# Hugepages usage alerts
245+
- alert: HugepagesLowWarning
246+
expr: (sum(node_memory_HugePages_Free) / sum(node_memory_HugePages_Total)) < 0.2
247+
for: 10m
248+
labels:
249+
severity: warning
250+
annotations:
251+
summary: Hugepages free ratio is below 20% (warning)
252+
description: "The ratio of free hugepages to total hugepages is below 20% for more than 10 minutes."
253+
254+
- alert: HugepagesLowCritical
255+
expr: (sum(node_memory_HugePages_Free) / sum(node_memory_HugePages_Total)) < 0.1
256+
for: 10m
257+
labels:
258+
severity: critical
259+
annotations:
260+
summary: Hugepages free ratio is below 10% (critical)
261+
description: "The ratio of free hugepages to total hugepages is below 10% for more than 10 minutes."
262+
263+
# CPU load alerts
264+
# Long-term load average alerts
265+
- alert: LoadLongTermWarning
266+
expr: (node_load15 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) > 0.7 and (node_load15 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) < 0.9
267+
for: 10m
268+
labels:
269+
severity: warning
270+
annotations:
271+
summary: Load average (15m) is high (warning)
272+
description: "15-minute load average is above 70% of CPU cores on {{ $labels.instance }}"
273+
274+
- alert: LoadLongTermCritical
275+
expr: (node_load15 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) >= 0.9
276+
for: 10m
277+
labels:
278+
severity: critical
279+
annotations:
280+
summary: Load average (15m) is critical
281+
description: "15-minute load average is above 90% of CPU cores on {{ $labels.instance }}"
282+
283+
# Mid-term load average alerts
284+
- alert: LoadMidTermWarning
285+
expr: (node_load5 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) > 0.7 and (node_load5 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) < 0.9
286+
for: 10m
287+
labels:
288+
severity: warning
289+
annotations:
290+
summary: Load average (5m) is high (warning)
291+
description: "5-minute load average is above 70% of CPU cores on {{ $labels.instance }}"
292+
293+
- alert: LoadMidTermCritical
294+
expr: (node_load5 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) >= 0.9
295+
for: 10m
296+
labels:
297+
severity: critical
298+
annotations:
299+
summary: Load average (5m) is critical
300+
description: "5-minute load average is above 90% of CPU cores on {{ $labels.instance }}"
301+
302+
# Short-term load average alerts
303+
- alert: LoadShortTermWarning
304+
expr: (node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) > 0.7 and (node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) < 0.9
305+
for: 10m
306+
labels:
307+
severity: warning
308+
annotations:
309+
summary: Load average (1m) is high (warning)
310+
description: "1-minute load average is above 70% of CPU cores on {{ $labels.instance }}"
311+
312+
- alert: LoadShortTermCritical
313+
expr: (node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)) >= 0.9
314+
for: 10m
315+
labels:
316+
severity: critical
317+
annotations:
318+
summary: Load average (1m) is critical
319+
description: "1-minute load average is above 90% of CPU cores on {{ $labels.instance }}"
320+
321+
# Memory usage alerts
322+
- alert: MemoryUsageWarning
323+
expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemAvailable_bytes)) / sum(node_memory_MemTotal_bytes) > 0.8 and
324+
(sum(node_memory_MemTotal_bytes) - sum(node_memory_MemAvailable_bytes)) / sum(node_memory_MemTotal_bytes) < 0.9
325+
for: 10m
326+
labels:
327+
severity: warning
328+
annotations:
329+
summary: Memory usage is high (warning)
330+
description: Memory usage is above 80% but below 90% for more than 10 minutes.
331+
332+
- alert: MemoryUsageCritical
333+
expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemAvailable_bytes)) / sum(node_memory_MemTotal_bytes) >= 0.9
334+
for: 10m
335+
labels:
336+
severity: critical
337+
annotations:
338+
summary: Memory usage is critical
339+
description: Memory usage is above 90% for more than 10 minutes.

0 commit comments

Comments
 (0)