@@ -94,7 +94,7 @@ groups:
94
94
isPaused : false
95
95
labels : {}
96
96
noDataState : NoData
97
- title : cas-restart
97
+ title : CAS Restart Detected
98
98
uid : fc41d560-9a18-4168-8a6a-615e60dc70de
99
99
- annotations :
100
100
description :
@@ -184,53 +184,8 @@ groups:
184
184
isPaused : false
185
185
labels : {}
186
186
noDataState : NoData
187
- title : cas-memory
187
+ title : CAS Memory Usage High
188
188
uid : ca744a08-e4e9-49b7-85a1-79e9fe05d4c1
189
- - annotations :
190
- description :
191
- Check to see that the CAS pod existed for a short time. This implies
192
- that CAS pod has restarted for whatever the reason. Will need to further investigate
193
- the cause.
194
- summary :
195
- The current CAS (sas-cas-server-default-controller) pod < 15 minutes
196
- in existence. Mostly likely it is due to restart of the CAS pod.
197
- condition : A
198
- data :
199
- - datasourceUid : prometheus
200
- model :
201
- expr : cas_grid_uptime_seconds_total
202
- instant : true
203
- refId : A
204
- relativeTimeRange :
205
- from : 300
206
- to : 0
207
- for : 5m
208
- labels :
209
- severity : warning
210
- title : CAS Restart Alert
211
- uid : cas_restart_alert
212
- - annotations :
213
- description : Checks the CAS memory usage. If it is > 300GB, it will alert.
214
- summary :
215
- CAS memory > 300GB. This can be due to a program or pipeline taking
216
- all the available memory.
217
- condition : A
218
- data :
219
- - datasourceUid : prometheus
220
- model :
221
- expr :
222
- (cas_node_mem_size_bytes{type="physical"} - cas_node_mem_free_bytes{type="physical"})
223
- / 1073741824 > 300
224
- instant : true
225
- refId : A
226
- relativeTimeRange :
227
- from : 300
228
- to : 0
229
- for : 5m
230
- labels :
231
- severity : warning
232
- title : CAS Memory Usage
233
- uid : cas_memory_usage
234
189
- annotations :
235
190
description :
236
191
CAS thread count is higher than 400. May indicate overloaded CAS
@@ -345,7 +300,7 @@ groups:
345
300
isPaused : false
346
301
labels : {}
347
302
noDataState : NoData
348
- title : viya-readiness
303
+ title : Viya Readiness Probe Failed
349
304
uid : e45e6d74-e396-40ce-a061-2a294295e61b
350
305
- annotations :
351
306
description :
@@ -438,7 +393,7 @@ groups:
438
393
isPaused : false
439
394
labels : {}
440
395
noDataState : NoData
441
- title : rabbitmq-readymessages
396
+ title : RabbitMQ Ready Queue Backlog
442
397
uid : efb36686-4e44-4de8-80c4-7dde9130da90
443
398
- annotations :
444
399
description :
@@ -527,7 +482,7 @@ groups:
527
482
isPaused : true
528
483
labels : {}
529
484
noDataState : OK
530
- title : compute-age
485
+ title : Stale Compute Pod Detected
531
486
uid : ed69b8e4-ce60-44a0-8f51-83743df0e448
532
487
- annotations :
533
488
description :
@@ -618,91 +573,8 @@ groups:
618
573
isPaused : false
619
574
labels : {}
620
575
noDataState : NoData
621
- title : viya-pod-restarts
576
+ title : Viya Pod Restart Count High
622
577
uid : e7ecb843-f1bd-48b7-8c8c-58571d1642ad
623
- - annotations :
624
- description :
625
- Checks for the Ready state of sas-readiness pod. Will need to check
626
- the status of the Viya pods since sas-readiness pod reflects the health of
627
- the Viya services.
628
- summary :
629
- sas-readiness pod is not in Ready state. This means that one or more
630
- of the Viya services are not in a good state.
631
- condition : A
632
- data :
633
- - datasourceUid : prometheus
634
- model :
635
- expr : kube_pod_container_status_ready{container="sas-readiness"}
636
- instant : true
637
- refId : A
638
- relativeTimeRange :
639
- from : 300
640
- to : 0
641
- for : 5m
642
- labels :
643
- severity : warning
644
- title : Viya Readiness
645
- uid : viya_readiness
646
- - annotations :
647
- description :
648
- Checks for accumulation of Rabbitmq ready messages > 10,000. It
649
- could impact Model Studio pipelines.
650
- summary :
651
- Rabbitmq ready messages > 10,000. This means there is a large backlog
652
- of messages due to high activity or something has gone wrong.
653
- condition : A
654
- data :
655
- - datasourceUid : prometheus
656
- model :
657
- expr : rabbitmq_queue_messages_ready > 10000
658
- instant : true
659
- refId : A
660
- relativeTimeRange :
661
- from : 300
662
- to : 0
663
- for : 5m
664
- labels :
665
- severity : warning
666
- title : RabbitMQ Ready Messages
667
- uid : rabbitmq_ready_msgs
668
- - annotations :
669
- description : Looks for compute pods > 1 day.
670
- summary : SAS compute-server pods > 1 day old.
671
- condition : A
672
- data :
673
- - datasourceUid : prometheus
674
- model :
675
- expr :
676
- (time() - kube_pod_created{pod=~"sas-compute-server-.*"}) / 60 / 60
677
- / 24 > 1
678
- instant : true
679
- refId : A
680
- relativeTimeRange :
681
- from : 300
682
- to : 0
683
- for : 5m
684
- labels :
685
- severity : warning
686
- title : Compute Pod Age
687
- uid : compute_pod_age
688
- - annotations :
689
- description : Checks if any Viya pods have restarted > 20 times.
690
- summary : The number of pod restarts > 20. Investigate for OOM or instability.
691
- condition : A
692
- data :
693
- - datasourceUid : prometheus
694
- model :
695
- expr : kube_pod_container_status_restarts_total{namespace="viya"} > 20
696
- instant : true
697
- refId : A
698
- relativeTimeRange :
699
- from : 300
700
- to : 0
701
- for : 5m
702
- labels :
703
- severity : warning
704
- title : Viya Pod Restarts
705
- uid : viya_pod_restarts
706
578
- annotations :
707
579
description :
708
580
RabbitMQ has a high number of unacknowledged messages. This may
@@ -721,7 +593,7 @@ groups:
721
593
for : 5m
722
594
labels :
723
595
severity : warning
724
- title : RabbitMQ Unacked Messages High
596
+ title : RabbitMQ Unacked Queue Backlog
725
597
uid : rabbitmq_unacked_messages
726
598
- annotations :
727
599
description :
@@ -743,105 +615,13 @@ groups:
743
615
for : 5m
744
616
labels :
745
617
severity : warning
746
- title : Viya API Latency High
618
+ title : High Viya API Latency
747
619
uid : viya_api_latency
748
620
- folder : Other Alerts
749
621
interval : 5m
750
622
name : SAS Viya Alerts
751
623
orgId : 1
752
624
rules :
753
- - annotations :
754
- description :
755
- Checks if the NFS share attached to CAS is > 85% full. Use command
756
- " du -h -d 1" to to find the location where large files are located in the
757
- NFS shares. Most likely it will be one of the home directories due to runaway
758
- size of a casuser table or Viya backups.
759
- summary :
760
- NFS share > 85% full. Typically, it is due to users filling their own
761
- home directory or backups.
762
- condition : C
763
- data :
764
- - datasourceUid : prometheus
765
- model :
766
- editorMode : code
767
- expr :
768
- ((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"}
769
- - kubelet_volume_stats_available_bytes{persistentvolumeclaim="cas-default-data"})
770
- / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="cas-default-data"})
771
- * 100
772
- instant : true
773
- intervalMs : 1000
774
- legendFormat : __auto
775
- maxDataPoints : 43200
776
- range : false
777
- refId : A
778
- refId : A
779
- relativeTimeRange :
780
- from : 600
781
- to : 0
782
- - datasourceUid : __expr__
783
- model :
784
- conditions :
785
- - evaluator :
786
- params : []
787
- type : gt
788
- operator :
789
- type : and
790
- query :
791
- params :
792
- - B
793
- reducer :
794
- params : []
795
- type : last
796
- type : query
797
- datasource :
798
- type : __expr__
799
- uid : __expr__
800
- expression : A
801
- intervalMs : 1000
802
- maxDataPoints : 43200
803
- reducer : last
804
- refId : B
805
- type : reduce
806
- refId : B
807
- relativeTimeRange :
808
- from : 600
809
- to : 0
810
- - datasourceUid : __expr__
811
- model :
812
- conditions :
813
- - evaluator :
814
- params :
815
- - 85
816
- type : gt
817
- operator :
818
- type : and
819
- query :
820
- params :
821
- - C
822
- reducer :
823
- params : []
824
- type : last
825
- type : query
826
- datasource :
827
- type : __expr__
828
- uid : __expr__
829
- expression : B
830
- intervalMs : 1000
831
- maxDataPoints : 43200
832
- refId : C
833
- type : threshold
834
- refId : C
835
- relativeTimeRange :
836
- from : 600
837
- to : 0
838
- execErrState : Error
839
- for : 5m
840
- isPaused : false
841
- labels : {}
842
- noDataState : NoData
843
- title : NFS-share
844
- uid : d52b3c24-acf4-4b5e-ae52-31ff8f167330
845
625
- annotations :
846
626
description :
847
627
" Checks to see /pgbackrest/repo1 filesystem is more than 50% full.
@@ -956,32 +736,8 @@ groups:
956
736
for : 5m
957
737
labels :
958
738
severity : warning
959
- title : NFS Share Usage
739
+ title : NFS Share Usage High
960
740
uid : nfs_share_usage
961
- - annotations :
962
- description : Checks if /pgbackrest/repo1 is more than 50% full.
963
- summary :
964
- /pgbackrest/repo1 storage > 50% full. Possibly due to unexpired WAL
965
- logs.
966
- condition : A
967
- data :
968
- - datasourceUid : prometheus
969
- model :
970
- expr :
971
- " ((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~\" sas-crunchy-platform-postgres-repo1\" \
972
- }\n - kubelet_volume_stats_available_bytes{persistentvolumeclaim=~\" sas-crunchy-platform-postgres-repo1\" \
973
- })\n / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~\" sas-crunchy-platform-postgres-repo1\" \
974
- }) * 100 > 50"
975
- instant : true
976
- refId : A
977
- relativeTimeRange :
978
- from : 300
979
- to : 0
980
- for : 5m
981
- labels :
982
- severity : warning
983
- title : Crunchy Backrest Repo Usage
984
- uid : pgbackrest_repo_usage
985
741
- folder : Database Alerts
986
742
interval : 5m
987
743
name : SAS Viya Alerts
@@ -1078,7 +834,7 @@ groups:
1078
834
isPaused : false
1079
835
labels : {}
1080
836
noDataState : NoData
1081
- title : catalog-dbconn
837
+ title : Catalog DB Connections High
1082
838
uid : fc65fbaf-c196-4eb4-a130-f45cc46b775b
1083
839
- annotations :
1084
840
description : " Checks to see /pgdata filesystem is more than 50% full.
@@ -1172,50 +928,8 @@ groups:
1172
928
isPaused : false
1173
929
labels : {}
1174
930
noDataState : NoData
1175
- title : crunchy-pgdata
931
+ title : Crunchy PGData Usage High
1176
932
uid : fb411e28-b2e5-43d0-a413-e6dedbf154c4
1177
- - annotations :
1178
- description : Checks the in-use catalog database connections > 21.
1179
- summary : The active catalog database connections > 21. May impact RabbitMQ queues.
1180
- condition : A
1181
- data :
1182
- - datasourceUid : prometheus
1183
- model :
1184
- expr :
1185
- sas_db_pool_connections{container="sas-catalog-services", state="inUse"}
1186
- > 21
1187
- instant : true
1188
- refId : A
1189
- relativeTimeRange :
1190
- from : 300
1191
- to : 0
1192
- for : 5m
1193
- labels :
1194
- severity : warning
1195
- title : Catalog DB Connections
1196
- uid : catalog_db_connections
1197
- - annotations :
1198
- description : Checks if /pgdata is more than 50% full.
1199
- summary : /pgdata storage > 50% full. Often due to WAL logs not being cleared.
1200
- condition : A
1201
- data :
1202
- - datasourceUid : prometheus
1203
- model :
1204
- expr :
1205
- " ((kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~\" sas-crunchy-platform-postgres-00-.*\" \
1206
- }\n - kubelet_volume_stats_available_bytes{persistentvolumeclaim=~\" sas-crunchy-platform-postgres-00-.*\" \
1207
- })\n / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~\" sas-crunchy-platform-postgres-00-.*\" \
1208
- }) * 100 > 50"
1209
- instant : true
1210
- refId : A
1211
- relativeTimeRange :
1212
- from : 300
1213
- to : 0
1214
- for : 5m
1215
- labels :
1216
- severity : warning
1217
- title : Crunchy PGData Usage
1218
- uid : pgdata_usage
1219
933
- annotations :
1220
934
description : PostgreSQL database connection usage is above 85% of max connections.
1221
935
summary : Database is nearing connection limit.
0 commit comments