Skip to content

Commit be5af20

Browse files
committed
Update alert to be generic to KV stores
Signed-off-by: Marco Pracucci <[email protected]>
1 parent c6e8d4e commit be5af20

File tree

3 files changed

+29
-32
lines changed

3 files changed

+29
-32
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@
6666
* [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
6767
* [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
6868
* [ENHANCEMENT] Add `CortexRolloutStuck` alert. #405
69-
* [ENHANCEMENT] Added `CortexFailingToTalkToConsul` alert. #406
69+
* [ENHANCEMENT] Added `CortexKVStoreFailure` alert. #406
7070
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
7171
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
7272
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 21 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,27 @@
235235
|||,
236236
},
237237
},
238+
{
239+
alert: 'CortexKVStoreFailure',
240+
expr: |||
241+
(
242+
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
243+
/
244+
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
245+
)
246+
# We want to get alerted only in case there's a constant failure.
247+
== 1
248+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
249+
'for': '5m',
250+
labels: {
251+
severity: 'warning',
252+
},
253+
annotations: {
254+
message: |||
255+
Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to the KV store {{ $labels.kv_name }}.
256+
||| % $._config,
257+
},
258+
},
238259
{
239260
alert: 'CortexMemoryMapAreasTooHigh',
240261
expr: |||
@@ -715,31 +736,5 @@
715736
},
716737
],
717738
},
718-
{
719-
name: 'cortex-consul-alerts',
720-
rules: [
721-
{
722-
alert: 'CortexFailingToTalkToConsul',
723-
expr: |||
724-
(
725-
sum by(%s, pod, status_code, kv_name) (rate(cortex_consul_request_duration_seconds_count{status_code!~"2.+"}[1m]))
726-
/
727-
sum by(%s, pod, status_code, kv_name) (rate(cortex_consul_request_duration_seconds_count[1m]))
728-
)
729-
# We want to get alerted only in case there's a constant failure.
730-
== 1
731-
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
732-
'for': '5m',
733-
labels: {
734-
severity: 'warning',
735-
},
736-
annotations: {
737-
message: |||
738-
Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to Consul store {{ $labels.kv_name }}.
739-
||| % $._config,
740-
},
741-
},
742-
],
743-
},
744739
],
745740
}

cortex-mixin/docs/playbooks.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -734,17 +734,19 @@ How to **investigate**:
734734
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
735735
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
736736
737-
### CortexFailingToTalkToConsul
737+
### CortexKVStoreFailure
738738
739-
This alert fires if a Cortex instance is failing to run any operation on Consul.
739+
This alert fires if a Cortex instance is failing to run any operation on a KV store (eg. consul or etcd).
740740
741741
How it **works**:
742742
- Consul is typically used to store the hash ring state.
743-
- If an instance is failing to talk to Consul, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
743+
- Etcd is typically used to store by the HA tracker (distributor) to deduplicate samples.
744+
- If an instance is failing operations on the **hash ring**, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
745+
- If an instance is failing operations on the **HA tracker** backend, either the instance can't update the authoritative replica or is failing to receive updates.
744746
745747
How to **investigate**:
746-
- Ensure Consul is up and running.
747-
- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul.
748+
- Ensure Consul/Etcd is up and running.
749+
- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul/Etcd.
748750
749751
## Cortex routes by path
750752

0 commit comments

Comments
 (0)