Skip to content

Commit 2a1cadd

Browse files
authored
Merge pull request #759 from sassoftware/grafana-alerts
Grafana alerts
2 parents ba41dc6 + 007ea75 commit 2a1cadd

10 files changed

+1008
-44
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ SAS Viya Workload node placement strategy.
3737
* [UPGRADE] OpenSearch Data Source Plugin to Grafana upgraded from 2.23.1 to 2.24.0
3838
* [UPGRADE] Admission Webhook upgraded from v1.5.1 to v1.5.2
3939
* [CHANGE] Enable Grafana feature flag: prometheusSpecialCharsInLabelValues to improve handling of special characters in metric labels (addresses #699)
40+
* [FEATURE] A set of SAS Viya specific alerts is now deployed with Grafana. Administrators can configure notifiers (which trigger messages via e-mail, Slack, SMS, etc. based on these alerts) and additional alerts via the Grafana web application after deployment. Or, alternatively, notifiers and/or additional alerts can be defined prior to running the monitoring deployment script ( `deploy_monitoring_cluster.sh` ) by placing yaml files in `$USER_DIR/monitoring/alerting/` Note: Due to Grafana's use of a single folder namespace, the folders used to organize these new Alerts will also appear when viewing Dashboards and will appear to be empty. When working with Dashboards, these folders can be ignored.
41+
4042

4143
* **Logging**
4244
* [FIX] Resolved issue causing deploy_esexporter.sh to fail when doing an upgrade-in-place and serviceMonitor CRD is not installed.
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
apiVersion: 1
2+
groups:
3+
- interval: 5m
4+
folder: CAS Alerts
5+
name: SAS Viya Alerts
6+
orgId: 1
7+
rules:
8+
- title: CAS Restart Detected
9+
annotations:
10+
description:
11+
Check to see that the CAS pod existed for a short time. This implies
12+
that CAS pod has restarted for whatever the reason. Will need to further investigate
13+
the cause.
14+
summary:
15+
The current CAS (sas-cas-server-default-controller) pod < 15 minutes
16+
in existence. Mostly likely it is due to restart of the CAS pod.
17+
condition: C
18+
data:
19+
- datasourceUid: prometheus
20+
model:
21+
disableTextWrap: false
22+
editorMode: code
23+
expr: cas_grid_uptime_seconds_total
24+
fullMetaSearch: false
25+
includeNullMetadata: true
26+
instant: true
27+
intervalMs: 1000
28+
legendFormat: __auto
29+
maxDataPoints: 43200
30+
range: false
31+
refId: A
32+
useBackend: false
33+
refId: A
34+
relativeTimeRange:
35+
from: 600
36+
to: 0
37+
- datasourceUid: __expr__
38+
model:
39+
conditions:
40+
- evaluator:
41+
params: []
42+
type: gt
43+
operator:
44+
type: and
45+
query:
46+
params:
47+
- B
48+
reducer:
49+
params: []
50+
type: last
51+
type: query
52+
datasource:
53+
type: __expr__
54+
uid: __expr__
55+
expression: A
56+
intervalMs: 1000
57+
maxDataPoints: 43200
58+
reducer: last
59+
refId: B
60+
type: reduce
61+
refId: B
62+
relativeTimeRange:
63+
from: 600
64+
to: 0
65+
- datasourceUid: __expr__
66+
model:
67+
conditions:
68+
- evaluator:
69+
params:
70+
- 900
71+
type: lt
72+
operator:
73+
type: and
74+
query:
75+
params:
76+
- C
77+
reducer:
78+
params: []
79+
type: last
80+
type: query
81+
datasource:
82+
type: __expr__
83+
uid: __expr__
84+
expression: B
85+
intervalMs: 1000
86+
maxDataPoints: 43200
87+
refId: C
88+
type: threshold
89+
refId: C
90+
relativeTimeRange:
91+
from: 600
92+
to: 0
93+
execErrState: Error
94+
for: 5m
95+
isPaused: false
96+
labels: {}
97+
noDataState: NoData
98+
uid: fc41d560-9a18-4168-8a6a-615e60dc70de
99+
- title: CAS Memory Usage High
100+
annotations:
101+
description:
102+
Checks the CAS memory usage. If it is > 300GB, it will alert. Currently,
103+
max. memory is 512GB. The expectation is that this alert will be an early
104+
warning sign to investigate large memory usage as typical usage is less than
105+
the threshold. Want to prevent OOMkill of CAS.
106+
summary:
107+
CAS memory > 300GB. This can be due to a program or pipeline taking
108+
all the available memory.
109+
condition: C
110+
data:
111+
- datasourceUid: prometheus
112+
model:
113+
editorMode: code
114+
exemplar: false
115+
expr: (cas_node_mem_size_bytes{type="physical"} - cas_node_mem_free_bytes{type="physical"})/1073741824
116+
instant: true
117+
interval: ""
118+
intervalMs: 1000
119+
legendFormat: __auto
120+
maxDataPoints: 43200
121+
range: false
122+
refId: A
123+
refId: A
124+
relativeTimeRange:
125+
from: 600
126+
to: 0
127+
- datasourceUid: __expr__
128+
model:
129+
conditions:
130+
- evaluator:
131+
params: []
132+
type: gt
133+
operator:
134+
type: and
135+
query:
136+
params:
137+
- B
138+
reducer:
139+
params: []
140+
type: last
141+
type: query
142+
datasource:
143+
type: __expr__
144+
uid: __expr__
145+
expression: A
146+
intervalMs: 1000
147+
maxDataPoints: 43200
148+
reducer: last
149+
refId: B
150+
type: reduce
151+
refId: B
152+
relativeTimeRange:
153+
from: 600
154+
to: 0
155+
- datasourceUid: __expr__
156+
model:
157+
conditions:
158+
- evaluator:
159+
params:
160+
- 300
161+
type: gt
162+
operator:
163+
type: and
164+
query:
165+
params:
166+
- C
167+
reducer:
168+
params: []
169+
type: last
170+
type: query
171+
datasource:
172+
type: __expr__
173+
uid: __expr__
174+
expression: B
175+
intervalMs: 1000
176+
maxDataPoints: 43200
177+
refId: C
178+
type: threshold
179+
refId: C
180+
relativeTimeRange:
181+
from: 600
182+
to: 0
183+
execErrState: Error
184+
for: 5m
185+
isPaused: false
186+
labels: {}
187+
noDataState: NoData
188+
uid: ca744a08-e4e9-49b7-85a1-79e9fe05d4c1
189+
- title: CAS Thread Count High
190+
annotations:
191+
description:
192+
CAS thread count is higher than 400. May indicate overloaded CAS
193+
server.
194+
summary: CAS is using more than 400 threads.
195+
condition: A
196+
data:
197+
- datasourceUid: prometheus
198+
model:
199+
expr: cas_thread_count > 400
200+
instant: true
201+
refId: A
202+
relativeTimeRange:
203+
from: 300
204+
to: 0
205+
for: 5m
206+
labels:
207+
severity: warning
208+
uid: cas_thread_count

0 commit comments

Comments
 (0)