Skip to content

Commit f17ee7f

Browse files
committed
created alerts directory in samples
1 parent c63002b commit f17ee7f

15 files changed

+1060
-0
lines changed

samples/alerts/README.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Alert Rules Structure
2+
3+
This directory contains Grafana alert rules for monitoring SAS Viya environments. The alerts are organized into subdirectories by component/category:
4+
5+
- `cas/` - Alerts for CAS (Cloud Analytic Services)
6+
- `database/` - Alerts for database services
7+
- `platform/` - Alerts for Viya platform components
8+
- `other/` - Miscellaneous alerts
9+
10+
## Alert Files Structure
11+
12+
Each alert is stored in its own YAML file with a descriptive name. This modular approach makes it easier to:
13+
14+
- Manage individual alerts
15+
- Track changes in version control
16+
- Enable/disable specific alerts
17+
- Customize alerts for specific environments
18+
19+
## Alert File Format
20+
21+
Each alert file follows this structure:
22+
23+
```yaml
24+
apiVersion: 1
25+
groups:
26+
- interval: 5m # How often the alert is evaluated
27+
folder: Category Name # The folder where the alert appears in Grafana
28+
name: SAS Viya Alerts # The alert group name
29+
orgId: 1
30+
rules:
31+
- title: Alert Title # The name of the alert
32+
annotations:
33+
description: Detailed explanation of the alert condition
34+
summary: Brief summary of the alert
35+
condition: C # The condition reference letter
36+
data:
37+
# The alert query and evaluation conditions
38+
execErrState: Error
39+
for: 5m # Duration before alert fires
40+
labels:
41+
severity: warning # Alert severity
42+
noDataState: NoData
43+
uid: unique-alert-id # Unique identifier for the alert
44+
```
45+
46+
## Legacy Alert Files
47+
48+
The original monolithic alert files (cas_alerts.yaml, database_alerts.yaml, etc.) are still present for backward compatibility. These files will be deprecated in future releases, so we recommend using the individual alert files going forward.
49+
50+
## Customizing Alerts
51+
52+
To customize an alert:
53+
54+
1. Copy the alert file to your user directory
55+
2. Modify the alert parameters as needed (thresholds, evaluation intervals, etc.)
56+
3. Deploy the monitoring components to apply your custom alerts
57+
58+
For more detailed information on Grafana alerting, see the [Grafana documentation](https://grafana.com/docs/grafana/latest/alerting/).
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
apiVersion: 1
2+
groups:
3+
- interval: 5m
4+
folder: CAS Alerts
5+
name: SAS Viya Alerts
6+
orgId: 1
7+
rules:
8+
- title: CAS Memory Usage High
9+
annotations:
10+
description:
11+
Checks the CAS memory usage. If it is > 300GB, it will alert. Currently,
12+
max. memory is 512GB. The expectation is that this alert will be an early
13+
warning sign to investigate large memory usage as typical usage is less than
14+
the threshold. Want to prevent OOMkill of CAS.
15+
summary:
16+
CAS memory > 300GB. This can be due to a program or pipeline taking
17+
all the available memory.
18+
condition: C
19+
data:
20+
- datasourceUid: prometheus
21+
model:
22+
editorMode: code
23+
exemplar: false
24+
expr: (cas_node_mem_size_bytes{type="physical"} - cas_node_mem_free_bytes{type="physical"})/1073741824
25+
instant: true
26+
interval: ""
27+
intervalMs: 1000
28+
legendFormat: __auto
29+
maxDataPoints: 43200
30+
range: false
31+
refId: A
32+
refId: A
33+
relativeTimeRange:
34+
from: 600
35+
to: 0
36+
- datasourceUid: __expr__
37+
model:
38+
conditions:
39+
- evaluator:
40+
params: []
41+
type: gt
42+
operator:
43+
type: and
44+
query:
45+
params:
46+
- B
47+
reducer:
48+
params: []
49+
type: last
50+
type: query
51+
datasource:
52+
type: __expr__
53+
uid: __expr__
54+
expression: A
55+
intervalMs: 1000
56+
maxDataPoints: 43200
57+
reducer: last
58+
refId: B
59+
type: reduce
60+
refId: B
61+
relativeTimeRange:
62+
from: 600
63+
to: 0
64+
- datasourceUid: __expr__
65+
model:
66+
conditions:
67+
- evaluator:
68+
params:
69+
- 300
70+
type: gt
71+
operator:
72+
type: and
73+
query:
74+
params:
75+
- C
76+
reducer:
77+
params: []
78+
type: last
79+
type: query
80+
datasource:
81+
type: __expr__
82+
uid: __expr__
83+
expression: B
84+
intervalMs: 1000
85+
maxDataPoints: 43200
86+
refId: C
87+
type: threshold
88+
refId: C
89+
relativeTimeRange:
90+
from: 600
91+
to: 0
92+
execErrState: Error
93+
for: 5m
94+
isPaused: false
95+
labels: {}
96+
noDataState: NoData
97+
uid: ca744a08-e4e9-49b7-85a1-79e9fe05d4c1
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
apiVersion: 1
2+
groups:
3+
- interval: 5m
4+
folder: CAS Alerts
5+
name: SAS Viya Alerts
6+
orgId: 1
7+
rules:
8+
- title: CAS Restart Detected
9+
annotations:
10+
description:
11+
Check to see that the CAS pod existed for a short time. This implies
12+
that CAS pod has restarted for whatever the reason. Will need to further investigate
13+
the cause.
14+
summary:
15+
The current CAS (sas-cas-server-default-controller) pod < 15 minutes
16+
in existence. Mostly likely it is due to restart of the CAS pod.
17+
condition: C
18+
data:
19+
- datasourceUid: prometheus
20+
model:
21+
disableTextWrap: false
22+
editorMode: code
23+
expr: cas_grid_uptime_seconds_total
24+
fullMetaSearch: false
25+
includeNullMetadata: true
26+
instant: true
27+
intervalMs: 1000
28+
legendFormat: __auto
29+
maxDataPoints: 43200
30+
range: false
31+
refId: A
32+
useBackend: false
33+
refId: A
34+
relativeTimeRange:
35+
from: 600
36+
to: 0
37+
- datasourceUid: __expr__
38+
model:
39+
conditions:
40+
- evaluator:
41+
params: []
42+
type: gt
43+
operator:
44+
type: and
45+
query:
46+
params:
47+
- B
48+
reducer:
49+
params: []
50+
type: last
51+
type: query
52+
datasource:
53+
type: __expr__
54+
uid: __expr__
55+
expression: A
56+
intervalMs: 1000
57+
maxDataPoints: 43200
58+
reducer: last
59+
refId: B
60+
type: reduce
61+
refId: B
62+
relativeTimeRange:
63+
from: 600
64+
to: 0
65+
- datasourceUid: __expr__
66+
model:
67+
conditions:
68+
- evaluator:
69+
params:
70+
- 900
71+
type: lt
72+
operator:
73+
type: and
74+
query:
75+
params:
76+
- C
77+
reducer:
78+
params: []
79+
type: last
80+
type: query
81+
datasource:
82+
type: __expr__
83+
uid: __expr__
84+
expression: B
85+
intervalMs: 1000
86+
maxDataPoints: 43200
87+
refId: C
88+
type: threshold
89+
refId: C
90+
relativeTimeRange:
91+
from: 600
92+
to: 0
93+
execErrState: Error
94+
for: 5m
95+
isPaused: false
96+
labels: {}
97+
noDataState: NoData
98+
uid: fc41d560-9a18-4168-8a6a-615e60dc70de
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: 1
2+
groups:
3+
- interval: 5m
4+
folder: CAS Alerts
5+
name: SAS Viya Alerts
6+
orgId: 1
7+
rules:
8+
- title: CAS Thread Count High
9+
annotations:
10+
description:
11+
CAS thread count is higher than 400. May indicate overloaded CAS
12+
server.
13+
summary: CAS is using more than 400 threads.
14+
condition: A
15+
data:
16+
- datasourceUid: prometheus
17+
model:
18+
expr: cas_thread_count > 400
19+
instant: true
20+
refId: A
21+
relativeTimeRange:
22+
from: 300
23+
to: 0
24+
for: 5m
25+
labels:
26+
severity: warning
27+
uid: cas_thread_count
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
apiVersion: 1
2+
groups:
3+
- interval: 5m
4+
name: SAS Viya Alerts
5+
folder: Database Alerts
6+
orgId: 1
7+
rules:
8+
- title: Catalog DB Connections High
9+
annotations:
10+
description:
11+
Checks the in-use catalog database connections > 21. The default
12+
db connection pool is 22. If it reaches the limit, the rabbitmq queues starts
13+
to fill up with ready messages causing issues with Model Studio pipelines.
14+
summary:
15+
The active catalog database connections > 21. If it reaches the max.
16+
db connections, it will impact the rabbitmq queues.
17+
condition: C
18+
data:
19+
- datasourceUid: prometheus
20+
model:
21+
disableTextWrap: false
22+
editorMode: builder
23+
expr: sas_db_pool_connections{container="sas-catalog-services", state="inUse"}
24+
fullMetaSearch: false
25+
includeNullMetadata: true
26+
instant: true
27+
intervalMs: 1000
28+
legendFormat: __auto
29+
maxDataPoints: 43200
30+
range: false
31+
refId: A
32+
useBackend: false
33+
refId: A
34+
relativeTimeRange:
35+
from: 600
36+
to: 0
37+
- datasourceUid: __expr__
38+
model:
39+
conditions:
40+
- evaluator:
41+
params: []
42+
type: gt
43+
operator:
44+
type: and
45+
query:
46+
params:
47+
- B
48+
reducer:
49+
params: []
50+
type: last
51+
type: query
52+
datasource:
53+
type: __expr__
54+
uid: __expr__
55+
expression: A
56+
intervalMs: 1000
57+
maxDataPoints: 43200
58+
reducer: last
59+
refId: B
60+
type: reduce
61+
refId: B
62+
relativeTimeRange:
63+
from: 600
64+
to: 0
65+
- datasourceUid: __expr__
66+
model:
67+
conditions:
68+
- evaluator:
69+
params:
70+
- 21
71+
type: gt
72+
operator:
73+
type: and
74+
query:
75+
params:
76+
- C
77+
reducer:
78+
params: []
79+
type: last
80+
type: query
81+
datasource:
82+
type: __expr__
83+
uid: __expr__
84+
expression: B
85+
intervalMs: 1000
86+
maxDataPoints: 43200
87+
refId: C
88+
type: threshold
89+
refId: C
90+
relativeTimeRange:
91+
from: 600
92+
to: 0
93+
execErrState: Error
94+
for: 5m
95+
isPaused: false
96+
labels: {}
97+
noDataState: NoData
98+
uid: fc65fbaf-c196-4eb4-a130-f45cc46b775b

0 commit comments

Comments
 (0)