Skip to content

Commit ce5b7bb

Browse files
committed
update refs to vfl and runbooks
1 parent b060438 commit ce5b7bb

File tree

1 file changed

+6
-20
lines changed

1 file changed

+6
-20
lines changed

monitoring/rules/viya/beta-rules-viya-health.yaml

Lines changed: 6 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,7 @@ spec:
3939
annotations:
4040
description:
4141
Checks for accumulation of Rabbitmq ready messages > 10,000. It
42-
could impact Model Studio pipelines. Follow the steps in the runbook url
43-
to help troubleshoot. The runbook covers potential orphan queues and/or
44-
bottlenecking of queues due to catalog service.
42+
could impact Model Studio pipelines.
4543
summary:
4644
Rabbitmq ready messages > 10,000. This means there is a large backlog
4745
of messages due to high activity (which can be temporary) or something has
@@ -85,11 +83,9 @@ spec:
8583
- alert: catalog-dbconn
8684
annotations:
8785
description:
88-
"Checks the in-use catalog database connections > 21. The default
86+
Checks the in-use catalog database connections > 21. The default
8987
db connection pool is 22. If it reaches the limit, the rabbitmq queues
9088
starts to fill up with ready messages causing issues with Model Studio pipelines.
91-
92-
Click on the runbook URL on how to remediate the issue."
9389
summary:
9490
The active catalog database connections > 21. If it reaches the
9591
max. db connections, it will impact the rabbitmq queues.
@@ -100,24 +96,16 @@ spec:
10096
- alert: compute-age
10197
annotations:
10298
description:
103-
"It looks for compute pods > 1 day. Most likely, it is orphaned
99+
It looks for compute pods > 1 day. Most likely, it is orphaned
104100
compute pod that is lingering. Consider killing it.
105-
106-
There is an airflow job that sweeps the VFL fleet regularly to look for
107-
these compute pods as well for deletion."
108-
summary:
109-
SAS compute-server pods > 1 day old. Compute pods in VFL do not need
110-
to be running longer than 1 day since there are no long running jobs.
101+
summary: SAS compute-server pods > 1 day old.
111102
expr: (time() - kube_pod_created{pod=~"sas-compute-server-.*"})/60/60/24
112103
for: 5m
113104
labels:
114105
severity: warning
115106
- alert: crunchy-pgdata
116107
annotations:
117-
description:
118-
"Checks to see /pgdata filesystem is more than 50% full.
119-
120-
Go to the Runbook URL to follow the troubleshooting steps."
108+
description: "Checks to see /pgdata filesystem is more than 50% full."
121109
summary:
122110
/pgdata storage > 50% full. This typically happens when the WAL
123111
logs are increasing and not being cleared.
@@ -132,10 +120,8 @@ spec:
132120
- alert: crunchy-backrest-repo
133121
annotations:
134122
description:
135-
"Checks to see /pgbackrest/repo1 filesystem is more than 50%
123+
Checks to see /pgbackrest/repo1 filesystem is more than 50%
136124
full.
137-
138-
Go to the Runbook URL to follow the troubleshooting steps."
139125
summary:
140126
/pgbackrest/repo1 storage > 50% full in the pgbackrest repo. This
141127
typically happens when the archived WAL logs are increasing and not being

0 commit comments

Comments
 (0)