Skip to content

Commit e4baf38

Browse files
committed
address review comments
1 parent 2226cd9 commit e4baf38

File tree

3 files changed

+92
-99
lines changed

3 files changed

+92
-99
lines changed

monitoring/alerting/rules/database_alerts.yaml

Lines changed: 90 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,9 @@ groups:
88
- title: Catalog DB Connections High
99
annotations:
1010
description:
11-
"Checks the in-use catalog database connections > 21. The default
11+
Checks the in-use catalog database connections > 21. The default
1212
db connection pool is 22. If it reaches the limit, the rabbitmq queues starts
1313
to fill up with ready messages causing issues with Model Studio pipelines.
14-
15-
Click on the URL on how to remediate the issue."
1614
summary:
1715
The active catalog database connections > 21. If it reaches the max.
1816
db connections, it will impact the rabbitmq queues.
@@ -210,3 +208,92 @@ groups:
210208
labels:
211209
severity: warning
212210
uid: postgres_connection_utilization
211+
- title: Crunchy Backrest Repo
212+
annotations:
213+
description: Checks to see /pgbackrest/repo1 filesystem is more than 50% full.
214+
summary:
215+
/pgbackrest/repo1 storage > 50% full in the pgbackrest repo. This typically
216+
happens when the archived WAL logs are increasing and not being expired and
217+
cleared.
218+
condition: C
219+
data:
220+
- datasourceUid: prometheus
221+
model:
222+
editorMode: code
223+
expr:
224+
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"}
225+
- kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
226+
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
227+
* 100
228+
instant: true
229+
intervalMs: 1000
230+
legendFormat: __auto
231+
maxDataPoints: 43200
232+
range: false
233+
refId: A
234+
refId: A
235+
relativeTimeRange:
236+
from: 600
237+
to: 0
238+
- datasourceUid: __expr__
239+
model:
240+
conditions:
241+
- evaluator:
242+
params: []
243+
type: gt
244+
operator:
245+
type: and
246+
query:
247+
params:
248+
- B
249+
reducer:
250+
params: []
251+
type: last
252+
type: query
253+
datasource:
254+
type: __expr__
255+
uid: __expr__
256+
expression: A
257+
intervalMs: 1000
258+
maxDataPoints: 43200
259+
reducer: last
260+
refId: B
261+
type: reduce
262+
refId: B
263+
relativeTimeRange:
264+
from: 600
265+
to: 0
266+
- datasourceUid: __expr__
267+
model:
268+
conditions:
269+
- evaluator:
270+
params:
271+
- 50
272+
type: gt
273+
operator:
274+
type: and
275+
query:
276+
params:
277+
- C
278+
reducer:
279+
params: []
280+
type: last
281+
type: query
282+
datasource:
283+
type: __expr__
284+
uid: __expr__
285+
expression: B
286+
intervalMs: 1000
287+
maxDataPoints: 43200
288+
refId: C
289+
type: threshold
290+
refId: C
291+
relativeTimeRange:
292+
from: 600
293+
to: 0
294+
execErrState: Error
295+
for: 5m
296+
isPaused: false
297+
labels: {}
298+
noDataState: NoData
299+
uid: abe80c6a-3add-477a-b228-f8283704570f

monitoring/alerting/rules/other_alerts.yaml

Lines changed: 0 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -5,98 +5,6 @@ groups:
55
folder: Other Alerts
66
orgId: 1
77
rules:
8-
- title: Crunchy Backrest Repo
9-
annotations:
10-
description:
11-
"Checks to see /pgbackrest/repo1 filesystem is more than 50% full.
12-
13-
Go to the URL to follow the troubleshooting steps."
14-
summary:
15-
/pgbackrest/repo1 storage > 50% full in the pgbackrest repo. This typically
16-
happens when the archived WAL logs are increasing and not being expired and
17-
cleared.
18-
condition: C
19-
data:
20-
- datasourceUid: prometheus
21-
model:
22-
editorMode: code
23-
expr:
24-
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"}
25-
- kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
26-
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
27-
* 100
28-
instant: true
29-
intervalMs: 1000
30-
legendFormat: __auto
31-
maxDataPoints: 43200
32-
range: false
33-
refId: A
34-
refId: A
35-
relativeTimeRange:
36-
from: 600
37-
to: 0
38-
- datasourceUid: __expr__
39-
model:
40-
conditions:
41-
- evaluator:
42-
params: []
43-
type: gt
44-
operator:
45-
type: and
46-
query:
47-
params:
48-
- B
49-
reducer:
50-
params: []
51-
type: last
52-
type: query
53-
datasource:
54-
type: __expr__
55-
uid: __expr__
56-
expression: A
57-
intervalMs: 1000
58-
maxDataPoints: 43200
59-
reducer: last
60-
refId: B
61-
type: reduce
62-
refId: B
63-
relativeTimeRange:
64-
from: 600
65-
to: 0
66-
- datasourceUid: __expr__
67-
model:
68-
conditions:
69-
- evaluator:
70-
params:
71-
- 50
72-
type: gt
73-
operator:
74-
type: and
75-
query:
76-
params:
77-
- C
78-
reducer:
79-
params: []
80-
type: last
81-
type: query
82-
datasource:
83-
type: __expr__
84-
uid: __expr__
85-
expression: B
86-
intervalMs: 1000
87-
maxDataPoints: 43200
88-
refId: C
89-
type: threshold
90-
refId: C
91-
relativeTimeRange:
92-
from: 600
93-
to: 0
94-
execErrState: Error
95-
for: 5m
96-
isPaused: false
97-
labels: {}
98-
noDataState: NoData
99-
uid: abe80c6a-3add-477a-b228-f8283704570f
1008
- title: NFS Share Usage High
1019
annotations:
10210
description: Checks if the NFS share attached to CAS is > 85% full.

monitoring/alerting/rules/viya_platform_alerts.yaml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -99,10 +99,8 @@ groups:
9999
- title: RabbitMQ Ready Queue Backlog
100100
annotations:
101101
description:
102-
Checks for accumulation of Rabbitmq ready messages > 10,000. It
103-
could impact Model Studio pipelines. Follow the steps in the url to help
104-
troubleshoot. The covers potential orphan queues and/or bottlenecking of
105-
queues due to catalog service.
102+
Checks for accumulation of Rabbitmq ready messages > 10,000. The covers potential orphan
103+
queues and/or bottlenecking of queues due to catalog service.
106104
summary:
107105
Rabbitmq ready messages > 10,000. This means there is a large backlog
108106
of messages due to high activity (which can be temporary) or something has

0 commit comments

Comments
 (0)