-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
What happened?
When the Storage of the TSDB is full, no alerts are fired related to it and the Watchdog
alert continues to fire. This gives away the impression of Prometheus beeing fine despite it no longer beeing able to evaluate Alert-Expressions, and therefore not firing any alerts anymore.
In our environment, we fixed this by modifying the Watchdog
alert-expression from vector(1)
to present_over_time(prometheus_tsdb_head_max_time[1m]) != 0
.
I already submitted this as a suggestion in the PR 2467, but I recognize this may not be the cleanest approach to fix this issue.
The alerts which AFAIK should fire are PrometheusMissingRuleEvaluations
, PrometheusRuleFailures
and PrometheusNotIngestingSamples
. But because the metrics of these alerts are no longer scraped, their Expression-Evaluation fails and the alerts are not firing.
Did you expect to see some different?
Yes, either the Watchdog
alert not firing anymore (because the alerting-chain is disrupted), one of the mentioned alerts firing (as they, from their description, make the most sense) or any other critical alert firing to alert of the situation.
How to reproduce it (as minimally and precisely as possible):
Let Prometheus scrape so many data so that it runs full, or fill up the storage of the TSDB manually:
kubectl exec -ti prometheus-prometheus-prometheus-0 -- sh
dd if=/dev/zero of=/prometheus/fillfile bs=1M
Environment
- Prometheus Operator version:
v0.76.1
- Kubernetes version information:
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.13", GitCommit:"7ba444e261616cb572b2c9e3aa6ee8876140f46a", GitTreeState:"clean", BuildDate:"2024-01-17T13:45:13Z", GoVersion:"go1.20.13", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15+vmware.1", GitCommit:"caaf37c79da07093b65edd62edb1d35b89f4e5c7", GitTreeState:"clean", BuildDate:"2024-03-27T05:25:15Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"
-
Kubernetes cluster kind:
It's a VMware Tanzu Kubernetes Grid Cluster. -
Manifests:
kubePrometheus-prometheusRule.yaml:
...
- alert: Watchdog
annotations:
description: |
This is an alert meant to ensure that the entire alerting pipeline is functional.
This alert is always firing, therefore it should always be firing in Alertmanager
and always fire against a receiver. There are integrations with various notification
mechanisms that send a notification when this alert is not firing. For example the
"DeadMansSnitch" integration in PagerDuty.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
summary: An alert that should always be firing to certify that Alertmanager is working properly.
expr: vector(1)
labels:
severity: none
...
prometheus-prometheusRule.yaml:
...
- alert: PrometheusMissingRuleEvaluations
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusmissingruleevaluations
summary: Prometheus is missing rule evaluations due to slow rule group evaluation.
expr: |
increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 15m
labels:
severity: warning
...
- alert: PrometheusRuleFailures
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusrulefailures
summary: Prometheus is failing rule evaluations.
expr: |
increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 15m
labels:
severity: critical
...
- alert: PrometheusNotIngestingSamples
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotingestingsamples
summary: Prometheus is not ingesting samples.
expr: |
(
sum without(type) (rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m])) <= 0
and
(
sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="prometheus-k8s",namespace="monitoring"}) > 0
or
sum without(rule_group) (prometheus_rule_group_rules{job="prometheus-k8s",namespace="monitoring"}) > 0
)
)
for: 10m
labels:
severity: warning
...
-
Prometheus Operator Logs:
None -
Prometheus Logs:
The prometheus Pod logs the full storage, as expected:
ts=2024-10-14T07:51:30.338Z caller=scrape.go:1225 level=error component="scrape manager" scrape_pool=podMonitor/istio-system/istio-sidecars/0 target=http://11.32.17.13:15090/stats/prometheus msg="Scrape commit failed" err="write to WAL: log samples: write /prometheus/wal/00004293: no space left on device"