Skip to content

Commit c7a3d5a

Browse files
committed
remove stale pop alert
1 parent 37d0bcd commit c7a3d5a

File tree

1 file changed

+0
-89
lines changed

1 file changed

+0
-89
lines changed

monitoring/alerting/rules/viya-alert-rules.yaml

Lines changed: 0 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -395,95 +395,6 @@ groups:
395395
noDataState: NoData
396396
title: RabbitMQ Ready Queue Backlog
397397
uid: efb36686-4e44-4de8-80c4-7dde9130da90
398-
- annotations:
399-
description:
400-
"It looks for compute pods > 1 day. Most likely, it is orphaned
401-
compute pod that is lingering. Consider killing it.
402-
403-
There is an airflow job that sweeps the VFL fleet regularly to look for these
404-
compute pods as well for deletion."
405-
summary:
406-
SAS compute-server pods > 1 day old. Compute pods in VFL do not need
407-
to be running longer than 1 day since there are no long running jobs.
408-
condition: C
409-
data:
410-
- datasourceUid: prometheus
411-
model:
412-
editorMode: code
413-
expr: (time() - kube_pod_created{pod=~"sas-compute-server-.*"})/60/60/24
414-
instant: true
415-
intervalMs: 1000
416-
legendFormat: __auto
417-
maxDataPoints: 43200
418-
range: false
419-
refId: A
420-
refId: A
421-
relativeTimeRange:
422-
from: 600
423-
to: 0
424-
- datasourceUid: __expr__
425-
model:
426-
conditions:
427-
- evaluator:
428-
params: []
429-
type: gt
430-
operator:
431-
type: and
432-
query:
433-
params:
434-
- B
435-
reducer:
436-
params: []
437-
type: last
438-
type: query
439-
datasource:
440-
type: __expr__
441-
uid: __expr__
442-
expression: A
443-
intervalMs: 1000
444-
maxDataPoints: 43200
445-
reducer: last
446-
refId: B
447-
type: reduce
448-
refId: B
449-
relativeTimeRange:
450-
from: 600
451-
to: 0
452-
- datasourceUid: __expr__
453-
model:
454-
conditions:
455-
- evaluator:
456-
params:
457-
- 1
458-
type: gt
459-
operator:
460-
type: and
461-
query:
462-
params:
463-
- C
464-
reducer:
465-
params: []
466-
type: last
467-
type: query
468-
datasource:
469-
type: __expr__
470-
uid: __expr__
471-
expression: B
472-
intervalMs: 1000
473-
maxDataPoints: 43200
474-
refId: C
475-
type: threshold
476-
refId: C
477-
relativeTimeRange:
478-
from: 600
479-
to: 0
480-
execErrState: Error
481-
for: 5m
482-
isPaused: true
483-
labels: {}
484-
noDataState: OK
485-
title: Stale Compute Pod Detected
486-
uid: ed69b8e4-ce60-44a0-8f51-83743df0e448
487398
- annotations:
488399
description:
489400
Checks the restart count of the pod(s). Will need to check why

0 commit comments

Comments
 (0)