Skip to content

Commit 306c081

Browse files
authored
Merge pull request grafana#405 from grafana/alert-on-stuck-rollout
Add CortexRolloutStuck alert
2 parents 189e0c7 + f11cc2c commit 306c081

File tree

3 files changed

+71
-0
lines changed

3 files changed

+71
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
* [ENHANCEMENT] Add support for running Alertmanager in sharding mode. #394
6666
* [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
6767
* [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
68+
* [ENHANCEMENT] Add `CortexRolloutStuck` alert. #405
6869
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
6970
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
7071
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,67 @@
412412
},
413413
],
414414
},
415+
{
416+
name: 'cortex-rollout-alerts',
417+
rules: [
418+
{
419+
alert: 'CortexRolloutStuck',
420+
expr: |||
421+
(
422+
max without (revision) (
423+
kube_statefulset_status_current_revision
424+
unless
425+
kube_statefulset_status_update_revision
426+
)
427+
*
428+
(
429+
kube_statefulset_replicas
430+
!=
431+
kube_statefulset_status_replicas_updated
432+
)
433+
) and (
434+
changes(kube_statefulset_status_replicas_updated[15m])
435+
==
436+
0
437+
)
438+
* on(%s) group_left max by(%s) (cortex_build_info)
439+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
440+
'for': '15m',
441+
labels: {
442+
severity: 'warning',
443+
},
444+
annotations: {
445+
message: |||
446+
The {{ $labels.statefulset }} rollout is stuck in %(alert_aggregation_variables)s.
447+
||| % $._config,
448+
},
449+
},
450+
{
451+
alert: 'CortexRolloutStuck',
452+
expr: |||
453+
(
454+
kube_deployment_spec_replicas
455+
!=
456+
kube_deployment_status_replicas_updated
457+
) and (
458+
changes(kube_deployment_status_replicas_updated[15m])
459+
==
460+
0
461+
)
462+
* on(%s) group_left max by(%s) (cortex_build_info)
463+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
464+
'for': '15m',
465+
labels: {
466+
severity: 'warning',
467+
},
468+
annotations: {
469+
message: |||
470+
The {{ $labels.deployment }} rollout is stuck in %(alert_aggregation_variables)s.
471+
||| % $._config,
472+
},
473+
},
474+
],
475+
},
415476
{
416477
name: 'cortex-provisioning',
417478
rules: [

cortex-mixin/docs/playbooks.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -724,6 +724,15 @@ When an alertmanager cannot read the state for a tenant from storage it gets log
724724
- The state could not be merged because it might be invalid and could not be decoded. This could indicate data corruption and therefore a bug in the reading or writing of the state, and would need further investigation.
725725
- The state could not be read from storage. This could be due to a networking issue such as a timeout or an authentication and authorization issue with the remote object store.
726726
727+
### CortexRolloutStuck
728+
729+
This alert fires when a Cortex service rollout is stuck, which means the number of updated replicas doesn't match the expected one and looks there's no progress in the rollout. The alert monitors services deployed as Kubernetes `StatefulSet` and `Deployment`.
730+
731+
How to **investigate**:
732+
- Run `kubectl -n <namespace> get pods -l name=<statefulset|deployment>` to get a list of running pods
733+
- Ensure there's no pod in a failing state (eg. `Error`, `OOMKilled`, `CrashLoopBackOff`)
734+
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
735+
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
727736
728737
## Cortex routes by path
729738

0 commit comments

Comments
 (0)