Add viya health alerts

ceelias · ceelias · commit 0d1da0ea6675 · 2025-05-13T14:00:30.000-04:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,6 +15,8 @@
   * [UPGRADE] Prometheus Pushgateway has been upgraded from 1.11.0 to 1.11.1.
   * [UPGRADE] OpenSearch Data Source Plugin to Grafana upgraded from 2.23.1 to 2.24.0
   * [UPGRADE] Admission Webhook upgraded from v1.5.1 to v1.5.2
+  * [CHANGE] Enable Grafana feature flag: prometheusSpecialCharsInLabelValues to improve handling of special characters in metric labels (addresses #699)
+
 * **Logging**
   * [FIX] Resolved issue causing deploy_esexporter.sh to fail when doing an upgrade-in-place and serviceMonitor CRD is not installed.
 
@@ -25,17 +27,17 @@
 needs to be installed.  While this utility is *currently* only used in a few places, we expect its use to become
 much more extensive over time.
   * [FEATURE] The auto-generation of Ingress resources for the web applications has moved from *experimental*
-to *production* status.  As noted earlier, this feature requires the `yq` utility.  See the 
+to *production* status.  As noted earlier, this feature requires the `yq` utility.  See the
 [Configure Ingress Access to Web Applications](https://documentation.sas.com/?cdcId=obsrvcdc&cdcVersion=v_003&docsetId=obsrvdply&docsetTarget=n0auhd4hutsf7xn169hfvriysz4e.htm#n0jiph3lcb5rmsn1g71be3cesmo8)
 topic within the Help Center documentation for further information.
   * [FEATURE] The auto-generation of storageClass references for PVC definitions has moved from *experimental*
-to *production* status.  As noted earlier, this feature requires the `yq` utility.  See the 
+to *production* status.  As noted earlier, this feature requires the `yq` utility.  See the
 [Customize StorageClass](https://documentation.sas.com/?cdcId=obsrvcdc&cdcVersion=v_003&docsetId=obsrvdply&docsetTarget=n0auhd4hutsf7xn169hfvriysz4e.htm#p1lvxtk81r8jgun1d789fqaz3lq1)
 topic within the Help Center documentation for further information.
   * [FIX] Resolved an issue with the V4M Container which prevented the `oc` command from being installed properly.
   * [TASK] The V4M Dockerfile has been revised and simplified to speed up the build process and require less memory.
 * **Metrics**
-  * [FIX] Corrected bugs related to authentication/TLS configuration of Grafana sidecars on OpenShift which prevented auto-provisioning of 
+  * [FIX] Corrected bugs related to authentication/TLS configuration of Grafana sidecars on OpenShift which prevented auto-provisioning of
 datasources and dashboards
 * **Logging**
   * [UPGRADE] Fluent Bit upgraded from 3.2.6 to 3.2.10 (includes security fixes)
diff --git a/monitoring/rules/viya/rules-viya-health.yaml b/monitoring/rules/viya/rules-viya-health.yaml
@@ -0,0 +1,163 @@
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: viya-alerts
+  namespace: monitoring
+  labels:
+    sas.com/monitoring-base: kube-viya-monitoring
+spec:
+  groups:
+    - name: prod
+      rules:
+        - alert: cas-restart
+          annotations:
+            description:
+              Check to see that the CAS pod existed for a short time.  This
+              implies that CAS pod has restarted for whatever the reason.  Will need to
+              further investigate the cause.
+            summary:
+              The current CAS (sas-cas-server-default-controller) pod < 15 minutes
+              in existence.  Mostly likely it is due to restart of the CAS pod.
+          expr: cas_grid_uptime_seconds_total
+          for: 5m
+          labels:
+            severity: warning
+        - alert: viya-readiness
+          annotations:
+            description:
+              Checks for the Ready state of sas-readiness pod.  Will need to
+              check the status of the Viya pods since sas-readiness pod reflects the health
+              of the Viya services.
+            summary:
+              sas-readiness pod is not in Ready state.  This means that one or
+              more of the Viya services are not in a good state.
+          expr: kube_pod_container_status_ready{container="sas-readiness"}
+          for: 5m
+          labels:
+            severity: warning
+        - alert: rabbitmq-readymessages
+          annotations:
+            description:
+              Checks for accumulation of Rabbitmq ready messages > 10,000.  It
+              could impact Model Studio pipelines.  Follow the steps in the runbook url
+              to help troubleshoot.  The runbook covers potential orphan queues and/or
+              bottlenecking of queues due to catalog service.
+            summary:
+              Rabbitmq ready messages > 10,000.  This means there is a large backlog
+              of messages due to high activity (which can be temporary) or something has
+              gone wrong.
+          expr: rabbitmq_queue_messages_ready
+          for: 5m
+          labels:
+            severity: warning
+        - alert: NFS-share
+          annotations:
+            description:
+              Checks if the NFS share attached to CAS is > 85% full.  Use command
+              "du -h -d 1" to to find the location where large files are located in the
+              NFS shares.  Most likely it will be one of the home directories due to runaway
+              size of a casuser table or Viya backups.
+            summary:
+              NFS share > 85% full.  Typically, it is due to users filling their
+              own home directory or backups.
+          expr:
+            ((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"}
+            - kubelet_volume_stats_available_bytes{persistentvolumeclaim="cas-default-data"})
+            / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"})
+            * 100
+          for: 5m
+          labels:
+            severity: warning
+        - alert: cas-memory
+          annotations:
+            description:
+              Checks the CAS memory usage.  If it is > 300GB, it will alert.  Currently,
+              max. memory is 512GB.  The expectation is that this alert will be an early
+              warning sign to investigate large memory usage as typical usage is less
+              than the threshold.  Want to prevent OOMkill of CAS.
+            summary:
+              CAS memory > 300GB.  This can be due to a program or pipeline taking
+              all the available memory.
+          expr: (cas_node_mem_size_bytes{type="physical"} - cas_node_mem_free_bytes{type="physical"})/1073741824
+          for: 5m
+          labels:
+            severity: warning
+        - alert: catalog-dbconn
+          annotations:
+            description:
+              "Checks the in-use catalog database connections > 21.  The default
+              db connection pool is 22.   If it reaches the limit, the rabbitmq queues
+              starts to fill up with ready messages causing issues with Model Studio pipelines.
+
+              Click on the runbook URL on how to remediate the issue."
+            summary:
+              The active catalog database connections > 21.  If it reaches the
+              max. db connections, it will impact the rabbitmq queues.
+          expr: sas_db_pool_connections{container="sas-catalog-services", state="inUse"}
+          for: 5m
+          labels:
+            severity: warning
+        - alert: compute-age
+          annotations:
+            description:
+              "It looks for compute pods > 1 day.  Most likely, it is orphaned
+              compute pod that is lingering.  Consider killing it.
+
+              There is an airflow job that sweeps the VFL fleet regularly to look for
+              these compute pods as well for deletion."
+            summary:
+              SAS compute-server pods > 1 day old. Compute pods in VFL do not need
+              to be running longer than 1 day since there are no long running jobs.
+          expr: (time() - kube_pod_created{pod=~"sas-compute-server-.*"})/60/60/24
+          for: 5m
+          labels:
+            severity: warning
+        - alert: crunchy-pgdata
+          annotations:
+            description:
+              "Checks to see /pgdata filesystem is more than 50% full.
+
+              Go to the Runbook URL to follow the troubleshooting steps."
+            summary:
+              /pgdata storage > 50% full.  This typically happens when the WAL
+              logs are increasing and not being cleared.
+          expr:
+            ((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"}
+            - kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"})
+            / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-00-.*"})
+            * 100
+          for: 5m
+          labels:
+            severity: warning
+        - alert: crunchy-backrest-repo
+          annotations:
+            description:
+              "Checks to see /pgbackrest/repo1 filesystem is more than 50%
+              full.
+
+              Go to the Runbook URL to follow the troubleshooting steps."
+            summary:
+              /pgbackrest/repo1 storage > 50% full in the pgbackrest repo.  This
+              typically happens when the archived WAL logs are increasing and not being
+              expired and cleared.
+          expr:
+            ((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"}
+            - kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
+            / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sas-crunchy-platform-postgres-repo1"})
+            * 100
+          for: 5m
+          labels:
+            severity: warning
+        - alert: viya-pod-restarts
+          annotations:
+            description:
+              Checks the restart count of the pod(s).  Will need to check why
+              the pod(s) have restarted so many times.  One possible cause is OOMkill.  This
+              means we will need to increase the memory limit.
+            summary:
+              The number of pod restarts > 20.  The service pod(s) have restarted
+              many times due to issues.
+          expr: kube_pod_container_status_restarts_total{namespace="viya"}
+          for: 5m
+          labels:
+            severity: warning