File tree Expand file tree Collapse file tree 3 files changed +92
-99
lines changed
monitoring/alerting/rules Expand file tree Collapse file tree 3 files changed +92
-99
lines changed Original file line number Diff line number Diff line change 8
8
- title : Catalog DB Connections High
9
9
annotations :
10
10
description :
11
- " Checks the in-use catalog database connections > 21. The default
11
+ Checks the in-use catalog database connections > 21. The default
12
12
db connection pool is 22. If it reaches the limit, the rabbitmq queues starts
13
13
to fill up with ready messages causing issues with Model Studio pipelines.
14
-
15
- Click on the URL on how to remediate the issue."
16
14
summary :
17
15
The active catalog database connections > 21. If it reaches the max.
18
16
db connections, it will impact the rabbitmq queues.
@@ -210,3 +208,92 @@ groups:
210
208
labels :
211
209
severity : warning
212
210
uid : postgres_connection_utilization
211
+ - title : Crunchy Backrest Repo
212
+ annotations :
213
+ description : Checks to see /pgbackrest/repo1 filesystem is more than 50% full.
214
+ summary :
215
+ /pgbackrest/repo1 storage > 50% full in the pgbackrest repo. This typically
216
+ happens when the archived WAL logs are increasing and not being expired and
217
+ cleared.
218
+ condition : C
219
+ data :
220
+ - datasourceUid : prometheus
221
+ model :
222
+ editorMode : code
223
+ expr :
224
+ ((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"}
225
+ - kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
226
+ / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
227
+ * 100
228
+ instant : true
229
+ intervalMs : 1000
230
+ legendFormat : __auto
231
+ maxDataPoints : 43200
232
+ range : false
233
+ refId : A
234
+ refId : A
235
+ relativeTimeRange :
236
+ from : 600
237
+ to : 0
238
+ - datasourceUid : __expr__
239
+ model :
240
+ conditions :
241
+ - evaluator :
242
+ params : []
243
+ type : gt
244
+ operator :
245
+ type : and
246
+ query :
247
+ params :
248
+ - B
249
+ reducer :
250
+ params : []
251
+ type : last
252
+ type : query
253
+ datasource :
254
+ type : __expr__
255
+ uid : __expr__
256
+ expression : A
257
+ intervalMs : 1000
258
+ maxDataPoints : 43200
259
+ reducer : last
260
+ refId : B
261
+ type : reduce
262
+ refId : B
263
+ relativeTimeRange :
264
+ from : 600
265
+ to : 0
266
+ - datasourceUid : __expr__
267
+ model :
268
+ conditions :
269
+ - evaluator :
270
+ params :
271
+ - 50
272
+ type : gt
273
+ operator :
274
+ type : and
275
+ query :
276
+ params :
277
+ - C
278
+ reducer :
279
+ params : []
280
+ type : last
281
+ type : query
282
+ datasource :
283
+ type : __expr__
284
+ uid : __expr__
285
+ expression : B
286
+ intervalMs : 1000
287
+ maxDataPoints : 43200
288
+ refId : C
289
+ type : threshold
290
+ refId : C
291
+ relativeTimeRange :
292
+ from : 600
293
+ to : 0
294
+ execErrState : Error
295
+ for : 5m
296
+ isPaused : false
297
+ labels : {}
298
+ noDataState : NoData
299
+ uid : abe80c6a-3add-477a-b228-f8283704570f
Original file line number Diff line number Diff line change 5
5
folder : Other Alerts
6
6
orgId : 1
7
7
rules :
8
- - title : Crunchy Backrest Repo
9
- annotations :
10
- description :
11
- " Checks to see /pgbackrest/repo1 filesystem is more than 50% full.
12
-
13
- Go to the URL to follow the troubleshooting steps."
14
- summary :
15
- /pgbackrest/repo1 storage > 50% full in the pgbackrest repo. This typically
16
- happens when the archived WAL logs are increasing and not being expired and
17
- cleared.
18
- condition : C
19
- data :
20
- - datasourceUid : prometheus
21
- model :
22
- editorMode : code
23
- expr :
24
- ((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"}
25
- - kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
26
- / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
27
- * 100
28
- instant : true
29
- intervalMs : 1000
30
- legendFormat : __auto
31
- maxDataPoints : 43200
32
- range : false
33
- refId : A
34
- refId : A
35
- relativeTimeRange :
36
- from : 600
37
- to : 0
38
- - datasourceUid : __expr__
39
- model :
40
- conditions :
41
- - evaluator :
42
- params : []
43
- type : gt
44
- operator :
45
- type : and
46
- query :
47
- params :
48
- - B
49
- reducer :
50
- params : []
51
- type : last
52
- type : query
53
- datasource :
54
- type : __expr__
55
- uid : __expr__
56
- expression : A
57
- intervalMs : 1000
58
- maxDataPoints : 43200
59
- reducer : last
60
- refId : B
61
- type : reduce
62
- refId : B
63
- relativeTimeRange :
64
- from : 600
65
- to : 0
66
- - datasourceUid : __expr__
67
- model :
68
- conditions :
69
- - evaluator :
70
- params :
71
- - 50
72
- type : gt
73
- operator :
74
- type : and
75
- query :
76
- params :
77
- - C
78
- reducer :
79
- params : []
80
- type : last
81
- type : query
82
- datasource :
83
- type : __expr__
84
- uid : __expr__
85
- expression : B
86
- intervalMs : 1000
87
- maxDataPoints : 43200
88
- refId : C
89
- type : threshold
90
- refId : C
91
- relativeTimeRange :
92
- from : 600
93
- to : 0
94
- execErrState : Error
95
- for : 5m
96
- isPaused : false
97
- labels : {}
98
- noDataState : NoData
99
- uid : abe80c6a-3add-477a-b228-f8283704570f
100
8
- title : NFS Share Usage High
101
9
annotations :
102
10
description : Checks if the NFS share attached to CAS is > 85% full.
Original file line number Diff line number Diff line change @@ -99,10 +99,8 @@ groups:
99
99
- title : RabbitMQ Ready Queue Backlog
100
100
annotations :
101
101
description :
102
- Checks for accumulation of Rabbitmq ready messages > 10,000. It
103
- could impact Model Studio pipelines. Follow the steps in the url to help
104
- troubleshoot. The covers potential orphan queues and/or bottlenecking of
105
- queues due to catalog service.
102
+ Checks for accumulation of Rabbitmq ready messages > 10,000. The covers potential orphan
103
+ queues and/or bottlenecking of queues due to catalog service.
106
104
summary :
107
105
Rabbitmq ready messages > 10,000. This means there is a large backlog
108
106
of messages due to high activity (which can be temporary) or something has
You can’t perform that action at this time.
0 commit comments