-
Notifications
You must be signed in to change notification settings - Fork 629
Description
What happened?
I got a prometheus rule evaluation failure alert for the kubelet alert:
found duplicate series for the match group {namespace="monitoring", pod="alertmanager-kube-prometheus-stack-alertmanager-0"} on the right hand-side of the operation: [{namespace="monitoring", node="node-01.mydomain.de", pod="alertmanager-kube-prometheus-stack-alertmanager-0"}, {namespace="monitoring", node="node-02.mydomain.de", pod="alertmanager-kube-prometheus-stack-alertmanager-0"}];many-to-many matching not allowed: matching labels must be unique on one side
I would expect the alert to work with the default labels. I think the grouping has to be rechecked.
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/1011/files#r1905954328 states there is an existing test for the change, but my problem seems to be not covered by it.
It seems to expect the pod-id to change, but it doesn't for a StatefulSet.
A solution here would be to use the uid of the pod, I provided a snippet that works for me.
Please provide any helpful snippets.
# The promrule as applied in the cluster
- alert: KubeletTooManyPods
annotations:
description: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage
}} of its Pod capacity on cluster {{ $labels.cluster }}.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods
summary: Kubelet is running at capacity.
expr: |-
count by (cluster, node) (
(kube_pod_status_phase{job="kube-state-metrics", phase="Running"} == 1)
* on (cluster, namespace, pod) group_left (node)
group by (cluster, namespace, pod, node) (
kube_pod_info{job="kube-state-metrics"}
)
)
/
max by (cluster, node) (
kube_node_status_capacity{job="kube-state-metrics", resource="pods"} != 1
) > 0.95
for: 15m
labels:
severity: warning
# suggested change
count by (cluster, node) (
(kube_pod_status_phase{job="kube-state-metrics", phase="Running"} == 1)
* on (cluster, namespace, pod, uid) group_left (node)
group by (cluster, namespace, pod, uid, node) (
kube_pod_info{job="kube-state-metrics"}
)
)
/
max by (cluster, node) (
kube_node_status_capacity{job="kube-state-metrics", resource="pods"} != 1
)What parts of the codebase are affected?
Alerts
I agree to the following terms:
- I agree to follow this project's Code of Conduct.
- I have filled out all the required information above to the best of my ability.
- I have searched the issues of this repository and believe that this is not a duplicate.
- I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.