File tree Expand file tree Collapse file tree 1 file changed +0
-89
lines changed
monitoring/alerting/rules Expand file tree Collapse file tree 1 file changed +0
-89
lines changed Original file line number Diff line number Diff line change @@ -395,95 +395,6 @@ groups:
395
395
noDataState : NoData
396
396
title : RabbitMQ Ready Queue Backlog
397
397
uid : efb36686-4e44-4de8-80c4-7dde9130da90
398
- - annotations :
399
- description :
400
- " It looks for compute pods > 1 day. Most likely, it is orphaned
401
- compute pod that is lingering. Consider killing it.
402
-
403
- There is an airflow job that sweeps the VFL fleet regularly to look for these
404
- compute pods as well for deletion."
405
- summary :
406
- SAS compute-server pods > 1 day old. Compute pods in VFL do not need
407
- to be running longer than 1 day since there are no long running jobs.
408
- condition : C
409
- data :
410
- - datasourceUid : prometheus
411
- model :
412
- editorMode : code
413
- expr : (time() - kube_pod_created{pod=~"sas-compute-server-.*"})/60/60/24
414
- instant : true
415
- intervalMs : 1000
416
- legendFormat : __auto
417
- maxDataPoints : 43200
418
- range : false
419
- refId : A
420
- refId : A
421
- relativeTimeRange :
422
- from : 600
423
- to : 0
424
- - datasourceUid : __expr__
425
- model :
426
- conditions :
427
- - evaluator :
428
- params : []
429
- type : gt
430
- operator :
431
- type : and
432
- query :
433
- params :
434
- - B
435
- reducer :
436
- params : []
437
- type : last
438
- type : query
439
- datasource :
440
- type : __expr__
441
- uid : __expr__
442
- expression : A
443
- intervalMs : 1000
444
- maxDataPoints : 43200
445
- reducer : last
446
- refId : B
447
- type : reduce
448
- refId : B
449
- relativeTimeRange :
450
- from : 600
451
- to : 0
452
- - datasourceUid : __expr__
453
- model :
454
- conditions :
455
- - evaluator :
456
- params :
457
- - 1
458
- type : gt
459
- operator :
460
- type : and
461
- query :
462
- params :
463
- - C
464
- reducer :
465
- params : []
466
- type : last
467
- type : query
468
- datasource :
469
- type : __expr__
470
- uid : __expr__
471
- expression : B
472
- intervalMs : 1000
473
- maxDataPoints : 43200
474
- refId : C
475
- type : threshold
476
- refId : C
477
- relativeTimeRange :
478
- from : 600
479
- to : 0
480
- execErrState : Error
481
- for : 5m
482
- isPaused : true
483
- labels : {}
484
- noDataState : OK
485
- title : Stale Compute Pod Detected
486
- uid : ed69b8e4-ce60-44a0-8f51-83743df0e448
487
398
- annotations :
488
399
description :
489
400
Checks the restart count of the pod(s). Will need to check why
You can’t perform that action at this time.
0 commit comments