Skip to content

Commit 4d5df5c

Browse files
committed
Added CortexFailingToTalkToConsul alert
Signed-off-by: Marco Pracucci <[email protected]>
1 parent 306c081 commit 4d5df5c

File tree

3 files changed

+39
-0
lines changed

3 files changed

+39
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@
6666
* [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
6767
* [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
6868
* [ENHANCEMENT] Add `CortexRolloutStuck` alert. #405
69+
* [ENHANCEMENT] Added `CortexFailingToTalkToConsul` alert. #406
6970
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
7071
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
7172
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -715,5 +715,31 @@
715715
},
716716
],
717717
},
718+
{
719+
name: 'cortex-consul-alerts',
720+
rules: [
721+
{
722+
alert: 'CortexFailingToTalkToConsul',
723+
expr: |||
724+
(
725+
sum by(%s, pod, status_code, kv_name) (rate(cortex_consul_request_duration_seconds_count{status_code!~"2.+"}[1m]))
726+
/
727+
sum by(%s, pod, status_code, kv_name) (rate(cortex_consul_request_duration_seconds_count[1m]))
728+
)
729+
# We want to get alerted only in case there's a constant failure.
730+
== 1
731+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
732+
'for': '5m',
733+
labels: {
734+
severity: 'warning',
735+
},
736+
annotations: {
737+
message: |||
738+
Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to Consul store ${{ labels.kv_name }}.
739+
||| % $._config,
740+
},
741+
},
742+
],
743+
},
718744
],
719745
}

cortex-mixin/docs/playbooks.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -734,6 +734,18 @@ How to **investigate**:
734734
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
735735
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
736736
737+
### CortexFailingToTalkToConsul
738+
739+
This alert fires if a Cortex instance is failing to run any operation on Consul.
740+
741+
How it **works**:
742+
- Consul is typically used to store the hash ring state.
743+
- If an instance is failing to talk to Consul, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
744+
745+
How to **investigate**:
746+
- Ensure Consul is up and running.
747+
- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul.
748+
737749
## Cortex routes by path
738750
739751
**Write path**:

0 commit comments

Comments
 (0)