Skip to content

Commit 2226cd9

Browse files
committed
broke alerts folders into sep files
1 parent c7a3d5a commit 2226cd9

File tree

5 files changed

+867
-861
lines changed

5 files changed

+867
-861
lines changed
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
apiVersion: 1
2+
groups:
3+
- interval: 5m
4+
folder: CAS Alerts
5+
name: SAS Viya Alerts
6+
orgId: 1
7+
rules:
8+
- title: CAS Restart Detected
9+
annotations:
10+
description:
11+
Check to see that the CAS pod existed for a short time. This implies
12+
that CAS pod has restarted for whatever the reason. Will need to further investigate
13+
the cause.
14+
summary:
15+
The current CAS (sas-cas-server-default-controller) pod < 15 minutes
16+
in existence. Mostly likely it is due to restart of the CAS pod.
17+
condition: C
18+
data:
19+
- datasourceUid: prometheus
20+
model:
21+
disableTextWrap: false
22+
editorMode: code
23+
expr: cas_grid_uptime_seconds_total
24+
fullMetaSearch: false
25+
includeNullMetadata: true
26+
instant: true
27+
intervalMs: 1000
28+
legendFormat: __auto
29+
maxDataPoints: 43200
30+
range: false
31+
refId: A
32+
useBackend: false
33+
refId: A
34+
relativeTimeRange:
35+
from: 600
36+
to: 0
37+
- datasourceUid: __expr__
38+
model:
39+
conditions:
40+
- evaluator:
41+
params: []
42+
type: gt
43+
operator:
44+
type: and
45+
query:
46+
params:
47+
- B
48+
reducer:
49+
params: []
50+
type: last
51+
type: query
52+
datasource:
53+
type: __expr__
54+
uid: __expr__
55+
expression: A
56+
intervalMs: 1000
57+
maxDataPoints: 43200
58+
reducer: last
59+
refId: B
60+
type: reduce
61+
refId: B
62+
relativeTimeRange:
63+
from: 600
64+
to: 0
65+
- datasourceUid: __expr__
66+
model:
67+
conditions:
68+
- evaluator:
69+
params:
70+
- 900
71+
type: lt
72+
operator:
73+
type: and
74+
query:
75+
params:
76+
- C
77+
reducer:
78+
params: []
79+
type: last
80+
type: query
81+
datasource:
82+
type: __expr__
83+
uid: __expr__
84+
expression: B
85+
intervalMs: 1000
86+
maxDataPoints: 43200
87+
refId: C
88+
type: threshold
89+
refId: C
90+
relativeTimeRange:
91+
from: 600
92+
to: 0
93+
execErrState: Error
94+
for: 5m
95+
isPaused: false
96+
labels: {}
97+
noDataState: NoData
98+
uid: fc41d560-9a18-4168-8a6a-615e60dc70de
99+
- title: CAS Memory Usage High
100+
annotations:
101+
description:
102+
Checks the CAS memory usage. If it is > 300GB, it will alert. Currently,
103+
max. memory is 512GB. The expectation is that this alert will be an early
104+
warning sign to investigate large memory usage as typical usage is less than
105+
the threshold. Want to prevent OOMkill of CAS.
106+
summary:
107+
CAS memory > 300GB. This can be due to a program or pipeline taking
108+
all the available memory.
109+
condition: C
110+
data:
111+
- datasourceUid: prometheus
112+
model:
113+
editorMode: code
114+
exemplar: false
115+
expr: (cas_node_mem_size_bytes{type="physical"} - cas_node_mem_free_bytes{type="physical"})/1073741824
116+
instant: true
117+
interval: ""
118+
intervalMs: 1000
119+
legendFormat: __auto
120+
maxDataPoints: 43200
121+
range: false
122+
refId: A
123+
refId: A
124+
relativeTimeRange:
125+
from: 600
126+
to: 0
127+
- datasourceUid: __expr__
128+
model:
129+
conditions:
130+
- evaluator:
131+
params: []
132+
type: gt
133+
operator:
134+
type: and
135+
query:
136+
params:
137+
- B
138+
reducer:
139+
params: []
140+
type: last
141+
type: query
142+
datasource:
143+
type: __expr__
144+
uid: __expr__
145+
expression: A
146+
intervalMs: 1000
147+
maxDataPoints: 43200
148+
reducer: last
149+
refId: B
150+
type: reduce
151+
refId: B
152+
relativeTimeRange:
153+
from: 600
154+
to: 0
155+
- datasourceUid: __expr__
156+
model:
157+
conditions:
158+
- evaluator:
159+
params:
160+
- 300
161+
type: gt
162+
operator:
163+
type: and
164+
query:
165+
params:
166+
- C
167+
reducer:
168+
params: []
169+
type: last
170+
type: query
171+
datasource:
172+
type: __expr__
173+
uid: __expr__
174+
expression: B
175+
intervalMs: 1000
176+
maxDataPoints: 43200
177+
refId: C
178+
type: threshold
179+
refId: C
180+
relativeTimeRange:
181+
from: 600
182+
to: 0
183+
execErrState: Error
184+
for: 5m
185+
isPaused: false
186+
labels: {}
187+
noDataState: NoData
188+
uid: ca744a08-e4e9-49b7-85a1-79e9fe05d4c1
189+
- title: CAS Thread Count High
190+
annotations:
191+
description:
192+
CAS thread count is higher than 400. May indicate overloaded CAS
193+
server.
194+
summary: CAS is using more than 400 threads.
195+
condition: A
196+
data:
197+
- datasourceUid: prometheus
198+
model:
199+
expr: cas_thread_count > 400
200+
instant: true
201+
refId: A
202+
relativeTimeRange:
203+
from: 300
204+
to: 0
205+
for: 5m
206+
labels:
207+
severity: warning
208+
uid: cas_thread_count

0 commit comments

Comments
 (0)