Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed

**What happened?**

When the Storage of the TSDB is full, no alerts are fired related to it and the `Watchdog` alert continues to fire. This gives away the impression of Prometheus beeing fine despite it no longer beeing able to evaluate Alert-Expressions, and therefore not firing any alerts anymore.

In our environment, we fixed this by modifying the `Watchdog` alert-expression from `vector(1)` to  `present_over_time(prometheus_tsdb_head_max_time[1m]) != 0`.

I already submitted this as a suggestion in the PR [2467](https://github.com/prometheus-operator/kube-prometheus/pull/2467), but I recognize this may not be the cleanest approach to fix this issue.

The alerts which AFAIK should fire are `PrometheusMissingRuleEvaluations`, `PrometheusRuleFailures` and `PrometheusNotIngestingSamples`. But because the metrics of these alerts are no longer scraped, their Expression-Evaluation fails and the alerts are not firing.

**Did you expect to see some different?**
Yes, either the `Watchdog` alert not firing anymore (because the alerting-chain is disrupted), one of the mentioned alerts firing (as they, from their description, make the most sense) or any other critical alert firing to alert of the situation.

**How to reproduce it (as minimally and precisely as possible)**:
Let Prometheus scrape so many data so that it runs full, or fill up the storage of the TSDB manually:
```
kubectl exec -ti prometheus-prometheus-prometheus-0 -- sh
dd if=/dev/zero of=/prometheus/fillfile bs=1M
```

**Environment**

* Prometheus Operator version:

v0.76.1

* Kubernetes version information:
```
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.13", GitCommit:"7ba444e261616cb572b2c9e3aa6ee8876140f46a", GitTreeState:"clean", BuildDate:"2024-01-17T13:45:13Z", GoVersion:"go1.20.13", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15+vmware.1", GitCommit:"caaf37c79da07093b65edd62edb1d35b89f4e5c7", GitTreeState:"clean", BuildDate:"2024-03-27T05:25:15Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"
```

* Kubernetes cluster kind:
It's a VMware Tanzu Kubernetes Grid Cluster.

* Manifests:
kubePrometheus-prometheusRule.yaml:
```
...

    - alert: Watchdog
      annotations:
        description: |
          This is an alert meant to ensure that the entire alerting pipeline is functional.
          This alert is always firing, therefore it should always be firing in Alertmanager
          and always fire against a receiver. There are integrations with various notification
          mechanisms that send a notification when this alert is not firing. For example the
          "DeadMansSnitch" integration in PagerDuty.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
        summary: An alert that should always be firing to certify that Alertmanager is working properly.
      expr: vector(1)
      labels:
        severity: none

...
```

prometheus-prometheusRule.yaml:
```
...

    - alert: PrometheusMissingRuleEvaluations
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusmissingruleevaluations
        summary: Prometheus is missing rule evaluations due to slow rule group evaluation.
      expr: |
        increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: warning

...

    - alert: PrometheusRuleFailures
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusrulefailures
        summary: Prometheus is failing rule evaluations.
      expr: |
        increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
      for: 15m
      labels:
        severity: critical

...

    - alert: PrometheusNotIngestingSamples
      annotations:
        description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotingestingsamples
        summary: Prometheus is not ingesting samples.
      expr: |
        (
          sum without(type) (rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m])) <= 0
        and
          (
            sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="prometheus-k8s",namespace="monitoring"}) > 0
          or
            sum without(rule_group) (prometheus_rule_group_rules{job="prometheus-k8s",namespace="monitoring"}) > 0
          )
        )
      for: 10m
      labels:
        severity: warning

...
```

* Prometheus Operator Logs:
None

* Prometheus Logs:
The prometheus Pod logs the full storage, as expected:

```
ts=2024-10-14T07:51:30.338Z caller=scrape.go:1225 level=error component="scrape manager" scrape_pool=podMonitor/istio-system/istio-sidecars/0 target=http://11.32.17.13:15090/stats/prometheus msg="Scrape commit failed" err="write to WAL: log samples: write /prometheus/wal/00004293: no space left on device"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions