From 46cb96d231c10e31f9a46bfa67858572c1f5d9de Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?S=C3=B6ren=20K=C3=B6nig?= <537955+skoenig@users.noreply.github.com> Date: Wed, 21 May 2025 17:17:15 +0200 Subject: [PATCH] Add runbook for PrometheusSDRefreshFailure Add the missing runbook for PrometheusSDRefreshFailure. Some upstream repositories are linking to this page: - https://github.com/prometheus-community/helm-charts/blob/b5d46f5b7488215fcef6ba6c0c1001fb6ca741dd/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/prometheus.yaml#L70 - https://github.com/prometheus-operator/kube-prometheus/blob/2c1dffebb7419f092b8eea40983f86e74fe41860/manifests/prometheus-prometheusRule.yaml#L33 --- .../prometheus/PrometheusSDRefreshFailure.md | 27 +++++++++++++++++++ 1 file changed, 27 insertions(+) create mode 100644 content/runbooks/prometheus/PrometheusSDRefreshFailure.md diff --git a/content/runbooks/prometheus/PrometheusSDRefreshFailure.md b/content/runbooks/prometheus/PrometheusSDRefreshFailure.md new file mode 100644 index 0000000..a560137 --- /dev/null +++ b/content/runbooks/prometheus/PrometheusSDRefreshFailure.md @@ -0,0 +1,27 @@ +# PrometheusSDRefreshFailure + +## Meaning + +Prometheus fails to refresh its service discovery (SD) targets. This might indicate connectivity issues to SD providers (such as Kubernetes, Consul, DNS), misconfiguration, or internal errors within Prometheus. + +## Impact + +Missing or stale service discovery information leads new targets not being scraped and old target are not removed from the configuration, resulting in invalid metrics and alerts. + +## Diagnosis + +The root causes for this issue can be pretty diverse, but here are some hints where to start: +- Check Prometheus logs for specific SD-related error messages. +- Ensure your Prometheus SD configuraion is formally correct. +- Verify network connectivity between Prometheus and SD endpoints. +- Based on the info from the steps before, check whether there are recent changes in Prometheus' configuration, the SD endpoints, network policies, or authentication credentials. + +## Mitigation + +Fix network issues that prevent Prometheus from reaching SD endpoints. + +Check whether DNS resolution is working correctly in the cluster. + +Check credentials and permissions are correct. + +As a last resort, remove configuration for failing SD endpoints.