Skip to content

Commit 0d1da0e

Browse files
committed
Add viya health alerts
1 parent c9656f4 commit 0d1da0e

File tree

2 files changed

+168
-3
lines changed

2 files changed

+168
-3
lines changed

CHANGELOG.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
* [UPGRADE] Prometheus Pushgateway has been upgraded from 1.11.0 to 1.11.1.
1616
* [UPGRADE] OpenSearch Data Source Plugin to Grafana upgraded from 2.23.1 to 2.24.0
1717
* [UPGRADE] Admission Webhook upgraded from v1.5.1 to v1.5.2
18+
* [CHANGE] Enable Grafana feature flag: prometheusSpecialCharsInLabelValues to improve handling of special characters in metric labels (addresses #699)
19+
1820
* **Logging**
1921
* [FIX] Resolved issue causing deploy_esexporter.sh to fail when doing an upgrade-in-place and serviceMonitor CRD is not installed.
2022

@@ -25,17 +27,17 @@
2527
needs to be installed. While this utility is *currently* only used in a few places, we expect its use to become
2628
much more extensive over time.
2729
* [FEATURE] The auto-generation of Ingress resources for the web applications has moved from *experimental*
28-
to *production* status. As noted earlier, this feature requires the `yq` utility. See the
30+
to *production* status. As noted earlier, this feature requires the `yq` utility. See the
2931
[Configure Ingress Access to Web Applications](https://documentation.sas.com/?cdcId=obsrvcdc&cdcVersion=v_003&docsetId=obsrvdply&docsetTarget=n0auhd4hutsf7xn169hfvriysz4e.htm#n0jiph3lcb5rmsn1g71be3cesmo8)
3032
topic within the Help Center documentation for further information.
3133
* [FEATURE] The auto-generation of storageClass references for PVC definitions has moved from *experimental*
32-
to *production* status. As noted earlier, this feature requires the `yq` utility. See the
34+
to *production* status. As noted earlier, this feature requires the `yq` utility. See the
3335
[Customize StorageClass](https://documentation.sas.com/?cdcId=obsrvcdc&cdcVersion=v_003&docsetId=obsrvdply&docsetTarget=n0auhd4hutsf7xn169hfvriysz4e.htm#p1lvxtk81r8jgun1d789fqaz3lq1)
3436
topic within the Help Center documentation for further information.
3537
* [FIX] Resolved an issue with the V4M Container which prevented the `oc` command from being installed properly.
3638
* [TASK] The V4M Dockerfile has been revised and simplified to speed up the build process and require less memory.
3739
* **Metrics**
38-
* [FIX] Corrected bugs related to authentication/TLS configuration of Grafana sidecars on OpenShift which prevented auto-provisioning of
40+
* [FIX] Corrected bugs related to authentication/TLS configuration of Grafana sidecars on OpenShift which prevented auto-provisioning of
3941
datasources and dashboards
4042
* **Logging**
4143
* [UPGRADE] Fluent Bit upgraded from 3.2.6 to 3.2.10 (includes security fixes)
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
apiVersion: monitoring.coreos.com/v1
2+
kind: PrometheusRule
3+
metadata:
4+
name: viya-alerts
5+
namespace: monitoring
6+
labels:
7+
sas.com/monitoring-base: kube-viya-monitoring
8+
spec:
9+
groups:
10+
- name: prod
11+
rules:
12+
- alert: cas-restart
13+
annotations:
14+
description:
15+
Check to see that the CAS pod existed for a short time. This
16+
implies that CAS pod has restarted for whatever the reason. Will need to
17+
further investigate the cause.
18+
summary:
19+
The current CAS (sas-cas-server-default-controller) pod < 15 minutes
20+
in existence. Mostly likely it is due to restart of the CAS pod.
21+
expr: cas_grid_uptime_seconds_total
22+
for: 5m
23+
labels:
24+
severity: warning
25+
- alert: viya-readiness
26+
annotations:
27+
description:
28+
Checks for the Ready state of sas-readiness pod. Will need to
29+
check the status of the Viya pods since sas-readiness pod reflects the health
30+
of the Viya services.
31+
summary:
32+
sas-readiness pod is not in Ready state. This means that one or
33+
more of the Viya services are not in a good state.
34+
expr: kube_pod_container_status_ready{container="sas-readiness"}
35+
for: 5m
36+
labels:
37+
severity: warning
38+
- alert: rabbitmq-readymessages
39+
annotations:
40+
description:
41+
Checks for accumulation of Rabbitmq ready messages > 10,000. It
42+
could impact Model Studio pipelines. Follow the steps in the runbook url
43+
to help troubleshoot. The runbook covers potential orphan queues and/or
44+
bottlenecking of queues due to catalog service.
45+
summary:
46+
Rabbitmq ready messages > 10,000. This means there is a large backlog
47+
of messages due to high activity (which can be temporary) or something has
48+
gone wrong.
49+
expr: rabbitmq_queue_messages_ready
50+
for: 5m
51+
labels:
52+
severity: warning
53+
- alert: NFS-share
54+
annotations:
55+
description:
56+
Checks if the NFS share attached to CAS is > 85% full. Use command
57+
"du -h -d 1" to to find the location where large files are located in the
58+
NFS shares. Most likely it will be one of the home directories due to runaway
59+
size of a casuser table or Viya backups.
60+
summary:
61+
NFS share > 85% full. Typically, it is due to users filling their
62+
own home directory or backups.
63+
expr:
64+
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"}
65+
- kubelet_volume_stats_available_bytes{persistentvolumeclaim="cas-default-data"})
66+
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"})
67+
* 100
68+
for: 5m
69+
labels:
70+
severity: warning
71+
- alert: cas-memory
72+
annotations:
73+
description:
74+
Checks the CAS memory usage. If it is > 300GB, it will alert. Currently,
75+
max. memory is 512GB. The expectation is that this alert will be an early
76+
warning sign to investigate large memory usage as typical usage is less
77+
than the threshold. Want to prevent OOMkill of CAS.
78+
summary:
79+
CAS memory > 300GB. This can be due to a program or pipeline taking
80+
all the available memory.
81+
expr: (cas_node_mem_size_bytes{type="physical"} - cas_node_mem_free_bytes{type="physical"})/1073741824
82+
for: 5m
83+
labels:
84+
severity: warning
85+
- alert: catalog-dbconn
86+
annotations:
87+
description:
88+
"Checks the in-use catalog database connections > 21. The default
89+
db connection pool is 22. If it reaches the limit, the rabbitmq queues
90+
starts to fill up with ready messages causing issues with Model Studio pipelines.
91+
92+
Click on the runbook URL on how to remediate the issue."
93+
summary:
94+
The active catalog database connections > 21. If it reaches the
95+
max. db connections, it will impact the rabbitmq queues.
96+
expr: sas_db_pool_connections{container="sas-catalog-services", state="inUse"}
97+
for: 5m
98+
labels:
99+
severity: warning
100+
- alert: compute-age
101+
annotations:
102+
description:
103+
"It looks for compute pods > 1 day. Most likely, it is orphaned
104+
compute pod that is lingering. Consider killing it.
105+
106+
There is an airflow job that sweeps the VFL fleet regularly to look for
107+
these compute pods as well for deletion."
108+
summary:
109+
SAS compute-server pods > 1 day old. Compute pods in VFL do not need
110+
to be running longer than 1 day since there are no long running jobs.
111+
expr: (time() - kube_pod_created{pod=~"sas-compute-server-.*"})/60/60/24
112+
for: 5m
113+
labels:
114+
severity: warning
115+
- alert: crunchy-pgdata
116+
annotations:
117+
description:
118+
"Checks to see /pgdata filesystem is more than 50% full.
119+
120+
Go to the Runbook URL to follow the troubleshooting steps."
121+
summary:
122+
/pgdata storage > 50% full. This typically happens when the WAL
123+
logs are increasing and not being cleared.
124+
expr:
125+
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"}
126+
- kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"})
127+
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"})
128+
* 100
129+
for: 5m
130+
labels:
131+
severity: warning
132+
- alert: crunchy-backrest-repo
133+
annotations:
134+
description:
135+
"Checks to see /pgbackrest/repo1 filesystem is more than 50%
136+
full.
137+
138+
Go to the Runbook URL to follow the troubleshooting steps."
139+
summary:
140+
/pgbackrest/repo1 storage > 50% full in the pgbackrest repo. This
141+
typically happens when the archived WAL logs are increasing and not being
142+
expired and cleared.
143+
expr:
144+
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"}
145+
- kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
146+
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
147+
* 100
148+
for: 5m
149+
labels:
150+
severity: warning
151+
- alert: viya-pod-restarts
152+
annotations:
153+
description:
154+
Checks the restart count of the pod(s). Will need to check why
155+
the pod(s) have restarted so many times. One possible cause is OOMkill. This
156+
means we will need to increase the memory limit.
157+
summary:
158+
The number of pod restarts > 20. The service pod(s) have restarted
159+
many times due to issues.
160+
expr: kube_pod_container_status_restarts_total{namespace="viya"}
161+
for: 5m
162+
labels:
163+
severity: warning

0 commit comments

Comments
 (0)