Skip to content

Commit 567320d

Browse files
authored
Merge pull request grafana#406 from grafana/alert-on-consul-failures
Added CortexFailingToTalkToConsul alert
2 parents 306c081 + be5af20 commit 567320d

File tree

3 files changed

+36
-0
lines changed

3 files changed

+36
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@
6666
* [ENHANCEMENT] Allow to customize PromQL engine settings via `queryEngineConfig`. #399
6767
* [ENHANCEMENT] Add recording rules to improve responsiveness of Alertmanager dashboard. #387
6868
* [ENHANCEMENT] Add `CortexRolloutStuck` alert. #405
69+
* [ENHANCEMENT] Added `CortexKVStoreFailure` alert. #406
6970
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
7071
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
7172
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,27 @@
235235
|||,
236236
},
237237
},
238+
{
239+
alert: 'CortexKVStoreFailure',
240+
expr: |||
241+
(
242+
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
243+
/
244+
sum by(%s, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
245+
)
246+
# We want to get alerted only in case there's a constant failure.
247+
== 1
248+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
249+
'for': '5m',
250+
labels: {
251+
severity: 'warning',
252+
},
253+
annotations: {
254+
message: |||
255+
Cortex {{ $labels.pod }} in %(alert_aggregation_variables)s is failing to talk to the KV store {{ $labels.kv_name }}.
256+
||| % $._config,
257+
},
258+
},
238259
{
239260
alert: 'CortexMemoryMapAreasTooHigh',
240261
expr: |||

cortex-mixin/docs/playbooks.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -734,6 +734,20 @@ How to **investigate**:
734734
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
735735
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
736736
737+
### CortexKVStoreFailure
738+
739+
This alert fires if a Cortex instance is failing to run any operation on a KV store (eg. consul or etcd).
740+
741+
How it **works**:
742+
- Consul is typically used to store the hash ring state.
743+
- Etcd is typically used to store by the HA tracker (distributor) to deduplicate samples.
744+
- If an instance is failing operations on the **hash ring**, either the instance can't update the heartbeat in the ring or is failing to receive ring updates.
745+
- If an instance is failing operations on the **HA tracker** backend, either the instance can't update the authoritative replica or is failing to receive updates.
746+
747+
How to **investigate**:
748+
- Ensure Consul/Etcd is up and running.
749+
- Investigate the logs of the affected instance to find the specific error occurring when talking to Consul/Etcd.
750+
737751
## Cortex routes by path
738752
739753
**Write path**:

0 commit comments

Comments
 (0)