39
39
annotations :
40
40
description :
41
41
Checks for accumulation of Rabbitmq ready messages > 10,000. It
42
- could impact Model Studio pipelines. Follow the steps in the runbook url
43
- to help troubleshoot. The runbook covers potential orphan queues and/or
44
- bottlenecking of queues due to catalog service.
42
+ could impact Model Studio pipelines.
45
43
summary :
46
44
Rabbitmq ready messages > 10,000. This means there is a large backlog
47
45
of messages due to high activity (which can be temporary) or something has
85
83
- alert : catalog-dbconn
86
84
annotations :
87
85
description :
88
- " Checks the in-use catalog database connections > 21. The default
86
+ Checks the in-use catalog database connections > 21. The default
89
87
db connection pool is 22. If it reaches the limit, the rabbitmq queues
90
88
starts to fill up with ready messages causing issues with Model Studio pipelines.
91
-
92
- Click on the runbook URL on how to remediate the issue."
93
89
summary :
94
90
The active catalog database connections > 21. If it reaches the
95
91
max. db connections, it will impact the rabbitmq queues.
@@ -100,24 +96,16 @@ spec:
100
96
- alert : compute-age
101
97
annotations :
102
98
description :
103
- " It looks for compute pods > 1 day. Most likely, it is orphaned
99
+ It looks for compute pods > 1 day. Most likely, it is orphaned
104
100
compute pod that is lingering. Consider killing it.
105
-
106
- There is an airflow job that sweeps the VFL fleet regularly to look for
107
- these compute pods as well for deletion."
108
- summary :
109
- SAS compute-server pods > 1 day old. Compute pods in VFL do not need
110
- to be running longer than 1 day since there are no long running jobs.
101
+ summary : SAS compute-server pods > 1 day old.
111
102
expr : (time() - kube_pod_created{pod=~"sas-compute-server-.*"})/60/60/24
112
103
for : 5m
113
104
labels :
114
105
severity : warning
115
106
- alert : crunchy-pgdata
116
107
annotations :
117
- description :
118
- " Checks to see /pgdata filesystem is more than 50% full.
119
-
120
- Go to the Runbook URL to follow the troubleshooting steps."
108
+ description : " Checks to see /pgdata filesystem is more than 50% full."
121
109
summary :
122
110
/pgdata storage > 50% full. This typically happens when the WAL
123
111
logs are increasing and not being cleared.
@@ -132,10 +120,8 @@ spec:
132
120
- alert : crunchy-backrest-repo
133
121
annotations :
134
122
description :
135
- " Checks to see /pgbackrest/repo1 filesystem is more than 50%
123
+ Checks to see /pgbackrest/repo1 filesystem is more than 50%
136
124
full.
137
-
138
- Go to the Runbook URL to follow the troubleshooting steps."
139
125
summary :
140
126
/pgbackrest/repo1 storage > 50% full in the pgbackrest repo. This
141
127
typically happens when the archived WAL logs are increasing and not being
0 commit comments