Skip to content

Commit a56d1c1

Browse files
committed
Add CortexRolloutStuck alert
Signed-off-by: Marco Pracucci <[email protected]>
1 parent db05c86 commit a56d1c1

File tree

3 files changed

+72
-0
lines changed

3 files changed

+72
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@
6464
* [ENHANCEMENT] Add support for running Alertmanager in sharding mode. #394
6565
* [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
6666
* [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
67+
* [ENHANCEMENT] Add `CortexRolloutStuck` alert. #405
6768
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
6869
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
6970
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,67 @@
412412
},
413413
],
414414
},
415+
{
416+
name: 'cortex-rollout-alerts',
417+
rules: [
418+
{
419+
alert: 'CortexRolloutStuck',
420+
expr: |||
421+
(
422+
max without (revision) (
423+
kube_statefulset_status_current_revision
424+
unless
425+
kube_statefulset_status_update_revision
426+
)
427+
*
428+
(
429+
kube_statefulset_replicas
430+
!=
431+
kube_statefulset_status_replicas_updated
432+
)
433+
) and (
434+
changes(kube_statefulset_status_replicas_updated[15m])
435+
==
436+
0
437+
)
438+
* on(%s) group_left max by(%s) (cortex_build_info)
439+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
440+
'for': '15m',
441+
labels: {
442+
severity: 'warning',
443+
},
444+
annotations: {
445+
message: |||
446+
The {{ $labels.statefulset }} rollout is stuck in %(alert_aggregation_variables)s.
447+
||| % $._config,
448+
},
449+
},
450+
{
451+
alert: 'CortexRolloutStuck',
452+
expr: |||
453+
(
454+
kube_deployment_spec_replicas
455+
!=
456+
kube_deployment_status_replicas_updated
457+
) and (
458+
changes(kube_deployment_status_replicas_updated[15m])
459+
==
460+
0
461+
)
462+
* on(%s) group_left max by(%s) (cortex_build_info)
463+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
464+
'for': '15m',
465+
labels: {
466+
severity: 'warning',
467+
},
468+
annotations: {
469+
message: |||
470+
The {{ $labels.deployment }} rollout is stuck in %(alert_aggregation_variables)s.
471+
||| % $._config,
472+
},
473+
},
474+
],
475+
},
415476
{
416477
name: 'cortex-provisioning',
417478
rules: [

cortex-mixin/docs/playbooks.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,16 @@ How to **investigate**:
231231
232232
_If the alert `CortexIngesterTSDBHeadCompactionFailed` fired as well, then give priority to it because that could be the cause._
233233
234+
### CortexRolloutStuck
235+
236+
This alert fires when a Cortex service rollout is stuck, which means the number of updated replicas doesn't match the expected one and looks there's no progress in the rollout. The alert monitors services deployed as Kubernetes `StatefulSet` and `Deployment`.
237+
238+
How to **investigate**:
239+
- Run `kubectl -n <namespace> get pods -l name=<statefulset|deployment>` to get a list of running pods
240+
- Ensure there's no pod in a failing state (eg. `Error`, `OOMKilled`, `CrashLoopBackOff`)
241+
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
242+
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
243+
234244
#### Ingester hit the disk capacity
235245
236246
If the ingester hit the disk capacity, any attempt to append samples will fail. You should:

0 commit comments

Comments
 (0)