Skip to content

Commit 6054d2f

Browse files
authored
Merge pull request #749 from sassoftware/add-alerts
Add viya health alerts
2 parents c9656f4 + ce5b7bb commit 6054d2f

File tree

2 files changed

+154
-3
lines changed

2 files changed

+154
-3
lines changed

CHANGELOG.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
* [UPGRADE] Prometheus Pushgateway has been upgraded from 1.11.0 to 1.11.1.
1616
* [UPGRADE] OpenSearch Data Source Plugin to Grafana upgraded from 2.23.1 to 2.24.0
1717
* [UPGRADE] Admission Webhook upgraded from v1.5.1 to v1.5.2
18+
* [CHANGE] Enable Grafana feature flag: prometheusSpecialCharsInLabelValues to improve handling of special characters in metric labels (addresses #699)
19+
1820
* **Logging**
1921
* [FIX] Resolved issue causing deploy_esexporter.sh to fail when doing an upgrade-in-place and serviceMonitor CRD is not installed.
2022

@@ -25,17 +27,17 @@
2527
needs to be installed. While this utility is *currently* only used in a few places, we expect its use to become
2628
much more extensive over time.
2729
* [FEATURE] The auto-generation of Ingress resources for the web applications has moved from *experimental*
28-
to *production* status. As noted earlier, this feature requires the `yq` utility. See the
30+
to *production* status. As noted earlier, this feature requires the `yq` utility. See the
2931
[Configure Ingress Access to Web Applications](https://documentation.sas.com/?cdcId=obsrvcdc&cdcVersion=v_003&docsetId=obsrvdply&docsetTarget=n0auhd4hutsf7xn169hfvriysz4e.htm#n0jiph3lcb5rmsn1g71be3cesmo8)
3032
topic within the Help Center documentation for further information.
3133
* [FEATURE] The auto-generation of storageClass references for PVC definitions has moved from *experimental*
32-
to *production* status. As noted earlier, this feature requires the `yq` utility. See the
34+
to *production* status. As noted earlier, this feature requires the `yq` utility. See the
3335
[Customize StorageClass](https://documentation.sas.com/?cdcId=obsrvcdc&cdcVersion=v_003&docsetId=obsrvdply&docsetTarget=n0auhd4hutsf7xn169hfvriysz4e.htm#p1lvxtk81r8jgun1d789fqaz3lq1)
3436
topic within the Help Center documentation for further information.
3537
* [FIX] Resolved an issue with the V4M Container which prevented the `oc` command from being installed properly.
3638
* [TASK] The V4M Dockerfile has been revised and simplified to speed up the build process and require less memory.
3739
* **Metrics**
38-
* [FIX] Corrected bugs related to authentication/TLS configuration of Grafana sidecars on OpenShift which prevented auto-provisioning of
40+
* [FIX] Corrected bugs related to authentication/TLS configuration of Grafana sidecars on OpenShift which prevented auto-provisioning of
3941
datasources and dashboards
4042
* **Logging**
4143
* [UPGRADE] Fluent Bit upgraded from 3.2.6 to 3.2.10 (includes security fixes)
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
apiVersion: monitoring.coreos.com/v1
2+
kind: PrometheusRule
3+
metadata:
4+
name: viya-alerts
5+
namespace: monitoring
6+
labels:
7+
sas.com/monitoring-base: kube-viya-monitoring
8+
spec:
9+
groups:
10+
- name: prod
11+
rules:
12+
- alert: cas-restart
13+
annotations:
14+
description:
15+
Check to see that the CAS pod existed for a short time. This
16+
implies that CAS pod has restarted for whatever the reason. Will need to
17+
further investigate the cause.
18+
summary:
19+
The current CAS (sas-cas-server-default-controller) pod < 15 minutes
20+
in existence. Mostly likely it is due to restart of the CAS pod.
21+
expr: cas_grid_uptime_seconds_total
22+
for: 5m
23+
labels:
24+
severity: warning
25+
- alert: viya-readiness
26+
annotations:
27+
description:
28+
Checks for the Ready state of sas-readiness pod. Will need to
29+
check the status of the Viya pods since sas-readiness pod reflects the health
30+
of the Viya services.
31+
summary:
32+
sas-readiness pod is not in Ready state. This means that one or
33+
more of the Viya services are not in a good state.
34+
expr: kube_pod_container_status_ready{container="sas-readiness"}
35+
for: 5m
36+
labels:
37+
severity: warning
38+
- alert: rabbitmq-readymessages
39+
annotations:
40+
description:
41+
Checks for accumulation of Rabbitmq ready messages > 10,000. It
42+
could impact Model Studio pipelines.
43+
summary:
44+
Rabbitmq ready messages > 10,000. This means there is a large backlog
45+
of messages due to high activity (which can be temporary) or something has
46+
gone wrong.
47+
expr: rabbitmq_queue_messages_ready
48+
for: 5m
49+
labels:
50+
severity: warning
51+
- alert: NFS-share
52+
annotations:
53+
description:
54+
Checks if the NFS share attached to CAS is > 85% full. Use command
55+
"du -h -d 1" to to find the location where large files are located in the
56+
NFS shares. Most likely it will be one of the home directories due to runaway
57+
size of a casuser table or Viya backups.
58+
summary:
59+
NFS share > 85% full. Typically, it is due to users filling their
60+
own home directory or backups.
61+
expr:
62+
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"}
63+
- kubelet_volume_stats_available_bytes{persistentvolumeclaim="cas-default-data"})
64+
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"})
65+
* 100
66+
for: 5m
67+
labels:
68+
severity: warning
69+
- alert: cas-memory
70+
annotations:
71+
description:
72+
Checks the CAS memory usage. If it is > 300GB, it will alert. Currently,
73+
max. memory is 512GB. The expectation is that this alert will be an early
74+
warning sign to investigate large memory usage as typical usage is less
75+
than the threshold. Want to prevent OOMkill of CAS.
76+
summary:
77+
CAS memory > 300GB. This can be due to a program or pipeline taking
78+
all the available memory.
79+
expr: (cas_node_mem_size_bytes{type="physical"} - cas_node_mem_free_bytes{type="physical"})/1073741824
80+
for: 5m
81+
labels:
82+
severity: warning
83+
- alert: catalog-dbconn
84+
annotations:
85+
description:
86+
Checks the in-use catalog database connections > 21. The default
87+
db connection pool is 22. If it reaches the limit, the rabbitmq queues
88+
starts to fill up with ready messages causing issues with Model Studio pipelines.
89+
summary:
90+
The active catalog database connections > 21. If it reaches the
91+
max. db connections, it will impact the rabbitmq queues.
92+
expr: sas_db_pool_connections{container="sas-catalog-services", state="inUse"}
93+
for: 5m
94+
labels:
95+
severity: warning
96+
- alert: compute-age
97+
annotations:
98+
description:
99+
It looks for compute pods > 1 day. Most likely, it is orphaned
100+
compute pod that is lingering. Consider killing it.
101+
summary: SAS compute-server pods > 1 day old.
102+
expr: (time() - kube_pod_created{pod=~"sas-compute-server-.*"})/60/60/24
103+
for: 5m
104+
labels:
105+
severity: warning
106+
- alert: crunchy-pgdata
107+
annotations:
108+
description: "Checks to see /pgdata filesystem is more than 50% full."
109+
summary:
110+
/pgdata storage > 50% full. This typically happens when the WAL
111+
logs are increasing and not being cleared.
112+
expr:
113+
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"}
114+
- kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"})
115+
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"})
116+
* 100
117+
for: 5m
118+
labels:
119+
severity: warning
120+
- alert: crunchy-backrest-repo
121+
annotations:
122+
description:
123+
Checks to see /pgbackrest/repo1 filesystem is more than 50%
124+
full.
125+
summary:
126+
/pgbackrest/repo1 storage > 50% full in the pgbackrest repo. This
127+
typically happens when the archived WAL logs are increasing and not being
128+
expired and cleared.
129+
expr:
130+
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"}
131+
- kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
132+
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
133+
* 100
134+
for: 5m
135+
labels:
136+
severity: warning
137+
- alert: viya-pod-restarts
138+
annotations:
139+
description:
140+
Checks the restart count of the pod(s). Will need to check why
141+
the pod(s) have restarted so many times. One possible cause is OOMkill. This
142+
means we will need to increase the memory limit.
143+
summary:
144+
The number of pod restarts > 20. The service pod(s) have restarted
145+
many times due to issues.
146+
expr: kube_pod_container_status_restarts_total{namespace="viya"}
147+
for: 5m
148+
labels:
149+
severity: warning

0 commit comments

Comments
 (0)