Skip to content

Commit 37d0bcd

Browse files
committed
begin cleanup of alerts
1 parent ccd5905 commit 37d0bcd

File tree

1 file changed

+11
-297
lines changed

1 file changed

+11
-297
lines changed

monitoring/alerting/rules/viya-alert-rules.yaml

Lines changed: 11 additions & 297 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ groups:
9494
isPaused: false
9595
labels: {}
9696
noDataState: NoData
97-
title: cas-restart
97+
title: CAS Restart Detected
9898
uid: fc41d560-9a18-4168-8a6a-615e60dc70de
9999
- annotations:
100100
description:
@@ -184,53 +184,8 @@ groups:
184184
isPaused: false
185185
labels: {}
186186
noDataState: NoData
187-
title: cas-memory
187+
title: CAS Memory Usage High
188188
uid: ca744a08-e4e9-49b7-85a1-79e9fe05d4c1
189-
- annotations:
190-
description:
191-
Check to see that the CAS pod existed for a short time. This implies
192-
that CAS pod has restarted for whatever the reason. Will need to further investigate
193-
the cause.
194-
summary:
195-
The current CAS (sas-cas-server-default-controller) pod < 15 minutes
196-
in existence. Mostly likely it is due to restart of the CAS pod.
197-
condition: A
198-
data:
199-
- datasourceUid: prometheus
200-
model:
201-
expr: cas_grid_uptime_seconds_total
202-
instant: true
203-
refId: A
204-
relativeTimeRange:
205-
from: 300
206-
to: 0
207-
for: 5m
208-
labels:
209-
severity: warning
210-
title: CAS Restart Alert
211-
uid: cas_restart_alert
212-
- annotations:
213-
description: Checks the CAS memory usage. If it is > 300GB, it will alert.
214-
summary:
215-
CAS memory > 300GB. This can be due to a program or pipeline taking
216-
all the available memory.
217-
condition: A
218-
data:
219-
- datasourceUid: prometheus
220-
model:
221-
expr:
222-
(cas_node_mem_size_bytes{type="physical"} - cas_node_mem_free_bytes{type="physical"})
223-
/ 1073741824 > 300
224-
instant: true
225-
refId: A
226-
relativeTimeRange:
227-
from: 300
228-
to: 0
229-
for: 5m
230-
labels:
231-
severity: warning
232-
title: CAS Memory Usage
233-
uid: cas_memory_usage
234189
- annotations:
235190
description:
236191
CAS thread count is higher than 400. May indicate overloaded CAS
@@ -345,7 +300,7 @@ groups:
345300
isPaused: false
346301
labels: {}
347302
noDataState: NoData
348-
title: viya-readiness
303+
title: Viya Readiness Probe Failed
349304
uid: e45e6d74-e396-40ce-a061-2a294295e61b
350305
- annotations:
351306
description:
@@ -438,7 +393,7 @@ groups:
438393
isPaused: false
439394
labels: {}
440395
noDataState: NoData
441-
title: rabbitmq-readymessages
396+
title: RabbitMQ Ready Queue Backlog
442397
uid: efb36686-4e44-4de8-80c4-7dde9130da90
443398
- annotations:
444399
description:
@@ -527,7 +482,7 @@ groups:
527482
isPaused: true
528483
labels: {}
529484
noDataState: OK
530-
title: compute-age
485+
title: Stale Compute Pod Detected
531486
uid: ed69b8e4-ce60-44a0-8f51-83743df0e448
532487
- annotations:
533488
description:
@@ -618,91 +573,8 @@ groups:
618573
isPaused: false
619574
labels: {}
620575
noDataState: NoData
621-
title: viya-pod-restarts
576+
title: Viya Pod Restart Count High
622577
uid: e7ecb843-f1bd-48b7-8c8c-58571d1642ad
623-
- annotations:
624-
description:
625-
Checks for the Ready state of sas-readiness pod. Will need to check
626-
the status of the Viya pods since sas-readiness pod reflects the health of
627-
the Viya services.
628-
summary:
629-
sas-readiness pod is not in Ready state. This means that one or more
630-
of the Viya services are not in a good state.
631-
condition: A
632-
data:
633-
- datasourceUid: prometheus
634-
model:
635-
expr: kube_pod_container_status_ready{container="sas-readiness"}
636-
instant: true
637-
refId: A
638-
relativeTimeRange:
639-
from: 300
640-
to: 0
641-
for: 5m
642-
labels:
643-
severity: warning
644-
title: Viya Readiness
645-
uid: viya_readiness
646-
- annotations:
647-
description:
648-
Checks for accumulation of Rabbitmq ready messages > 10,000. It
649-
could impact Model Studio pipelines.
650-
summary:
651-
Rabbitmq ready messages > 10,000. This means there is a large backlog
652-
of messages due to high activity or something has gone wrong.
653-
condition: A
654-
data:
655-
- datasourceUid: prometheus
656-
model:
657-
expr: rabbitmq_queue_messages_ready > 10000
658-
instant: true
659-
refId: A
660-
relativeTimeRange:
661-
from: 300
662-
to: 0
663-
for: 5m
664-
labels:
665-
severity: warning
666-
title: RabbitMQ Ready Messages
667-
uid: rabbitmq_ready_msgs
668-
- annotations:
669-
description: Looks for compute pods > 1 day.
670-
summary: SAS compute-server pods > 1 day old.
671-
condition: A
672-
data:
673-
- datasourceUid: prometheus
674-
model:
675-
expr:
676-
(time() - kube_pod_created{pod=~"sas-compute-server-.*"}) / 60 / 60
677-
/ 24 > 1
678-
instant: true
679-
refId: A
680-
relativeTimeRange:
681-
from: 300
682-
to: 0
683-
for: 5m
684-
labels:
685-
severity: warning
686-
title: Compute Pod Age
687-
uid: compute_pod_age
688-
- annotations:
689-
description: Checks if any Viya pods have restarted > 20 times.
690-
summary: The number of pod restarts > 20. Investigate for OOM or instability.
691-
condition: A
692-
data:
693-
- datasourceUid: prometheus
694-
model:
695-
expr: kube_pod_container_status_restarts_total{namespace="viya"} > 20
696-
instant: true
697-
refId: A
698-
relativeTimeRange:
699-
from: 300
700-
to: 0
701-
for: 5m
702-
labels:
703-
severity: warning
704-
title: Viya Pod Restarts
705-
uid: viya_pod_restarts
706578
- annotations:
707579
description:
708580
RabbitMQ has a high number of unacknowledged messages. This may
@@ -721,7 +593,7 @@ groups:
721593
for: 5m
722594
labels:
723595
severity: warning
724-
title: RabbitMQ Unacked Messages High
596+
title: RabbitMQ Unacked Queue Backlog
725597
uid: rabbitmq_unacked_messages
726598
- annotations:
727599
description:
@@ -743,105 +615,13 @@ groups:
743615
for: 5m
744616
labels:
745617
severity: warning
746-
title: Viya API Latency High
618+
title: High Viya API Latency
747619
uid: viya_api_latency
748620
- folder: Other Alerts
749621
interval: 5m
750622
name: SAS Viya Alerts
751623
orgId: 1
752624
rules:
753-
- annotations:
754-
description:
755-
Checks if the NFS share attached to CAS is > 85% full. Use command
756-
"du -h -d 1" to to find the location where large files are located in the
757-
NFS shares. Most likely it will be one of the home directories due to runaway
758-
size of a casuser table or Viya backups.
759-
summary:
760-
NFS share > 85% full. Typically, it is due to users filling their own
761-
home directory or backups.
762-
condition: C
763-
data:
764-
- datasourceUid: prometheus
765-
model:
766-
editorMode: code
767-
expr:
768-
((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"}
769-
- kubelet_volume_stats_available_bytes{persistentvolumeclaim="cas-default-data"})
770-
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"})
771-
* 100
772-
instant: true
773-
intervalMs: 1000
774-
legendFormat: __auto
775-
maxDataPoints: 43200
776-
range: false
777-
refId: A
778-
refId: A
779-
relativeTimeRange:
780-
from: 600
781-
to: 0
782-
- datasourceUid: __expr__
783-
model:
784-
conditions:
785-
- evaluator:
786-
params: []
787-
type: gt
788-
operator:
789-
type: and
790-
query:
791-
params:
792-
- B
793-
reducer:
794-
params: []
795-
type: last
796-
type: query
797-
datasource:
798-
type: __expr__
799-
uid: __expr__
800-
expression: A
801-
intervalMs: 1000
802-
maxDataPoints: 43200
803-
reducer: last
804-
refId: B
805-
type: reduce
806-
refId: B
807-
relativeTimeRange:
808-
from: 600
809-
to: 0
810-
- datasourceUid: __expr__
811-
model:
812-
conditions:
813-
- evaluator:
814-
params:
815-
- 85
816-
type: gt
817-
operator:
818-
type: and
819-
query:
820-
params:
821-
- C
822-
reducer:
823-
params: []
824-
type: last
825-
type: query
826-
datasource:
827-
type: __expr__
828-
uid: __expr__
829-
expression: B
830-
intervalMs: 1000
831-
maxDataPoints: 43200
832-
refId: C
833-
type: threshold
834-
refId: C
835-
relativeTimeRange:
836-
from: 600
837-
to: 0
838-
execErrState: Error
839-
for: 5m
840-
isPaused: false
841-
labels: {}
842-
noDataState: NoData
843-
title: NFS-share
844-
uid: d52b3c24-acf4-4b5e-ae52-31ff8f167330
845625
- annotations:
846626
description:
847627
"Checks to see /pgbackrest/repo1 filesystem is more than 50% full.
@@ -956,32 +736,8 @@ groups:
956736
for: 5m
957737
labels:
958738
severity: warning
959-
title: NFS Share Usage
739+
title: NFS Share Usage High
960740
uid: nfs_share_usage
961-
- annotations:
962-
description: Checks if /pgbackrest/repo1 is more than 50% full.
963-
summary:
964-
/pgbackrest/repo1 storage > 50% full. Possibly due to unexpired WAL
965-
logs.
966-
condition: A
967-
data:
968-
- datasourceUid: prometheus
969-
model:
970-
expr:
971-
"((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~\"sas-crunchy-platform-postgres-repo1\"\
972-
}\n - kubelet_volume_stats_available_bytes{persistentvolumeclaim=~\"sas-crunchy-platform-postgres-repo1\"\
973-
})\n / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~\"sas-crunchy-platform-postgres-repo1\"\
974-
}) * 100 > 50"
975-
instant: true
976-
refId: A
977-
relativeTimeRange:
978-
from: 300
979-
to: 0
980-
for: 5m
981-
labels:
982-
severity: warning
983-
title: Crunchy Backrest Repo Usage
984-
uid: pgbackrest_repo_usage
985741
- folder: Database Alerts
986742
interval: 5m
987743
name: SAS Viya Alerts
@@ -1078,7 +834,7 @@ groups:
1078834
isPaused: false
1079835
labels: {}
1080836
noDataState: NoData
1081-
title: catalog-dbconn
837+
title: Catalog DB Connections High
1082838
uid: fc65fbaf-c196-4eb4-a130-f45cc46b775b
1083839
- annotations:
1084840
description: "Checks to see /pgdata filesystem is more than 50% full.
@@ -1172,50 +928,8 @@ groups:
1172928
isPaused: false
1173929
labels: {}
1174930
noDataState: NoData
1175-
title: crunchy-pgdata
931+
title: Crunchy PGData Usage High
1176932
uid: fb411e28-b2e5-43d0-a413-e6dedbf154c4
1177-
- annotations:
1178-
description: Checks the in-use catalog database connections > 21.
1179-
summary: The active catalog database connections > 21. May impact RabbitMQ queues.
1180-
condition: A
1181-
data:
1182-
- datasourceUid: prometheus
1183-
model:
1184-
expr:
1185-
sas_db_pool_connections{container="sas-catalog-services", state="inUse"}
1186-
> 21
1187-
instant: true
1188-
refId: A
1189-
relativeTimeRange:
1190-
from: 300
1191-
to: 0
1192-
for: 5m
1193-
labels:
1194-
severity: warning
1195-
title: Catalog DB Connections
1196-
uid: catalog_db_connections
1197-
- annotations:
1198-
description: Checks if /pgdata is more than 50% full.
1199-
summary: /pgdata storage > 50% full. Often due to WAL logs not being cleared.
1200-
condition: A
1201-
data:
1202-
- datasourceUid: prometheus
1203-
model:
1204-
expr:
1205-
"((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~\"sas-crunchy-platform-postgres-00-.*\"\
1206-
}\n - kubelet_volume_stats_available_bytes{persistentvolumeclaim=~\"sas-crunchy-platform-postgres-00-.*\"\
1207-
})\n / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~\"sas-crunchy-platform-postgres-00-.*\"\
1208-
}) * 100 > 50"
1209-
instant: true
1210-
refId: A
1211-
relativeTimeRange:
1212-
from: 300
1213-
to: 0
1214-
for: 5m
1215-
labels:
1216-
severity: warning
1217-
title: Crunchy PGData Usage
1218-
uid: pgdata_usage
1219933
- annotations:
1220934
description: PostgreSQL database connection usage is above 85% of max connections.
1221935
summary: Database is nearing connection limit.

0 commit comments

Comments
 (0)