Skip to content

[Bug]: found duplicate series in KubeletTooManyPods alert #1015

@jonasbadstuebner

Description

@jonasbadstuebner

What happened?

I got a prometheus rule evaluation failure alert for the kubelet alert:

found duplicate series for the match group {namespace="monitoring", pod="alertmanager-kube-prometheus-stack-alertmanager-0"} on the right hand-side of the operation: [{namespace="monitoring", node="node-01.mydomain.de", pod="alertmanager-kube-prometheus-stack-alertmanager-0"}, {namespace="monitoring", node="node-02.mydomain.de", pod="alertmanager-kube-prometheus-stack-alertmanager-0"}];many-to-many matching not allowed: matching labels must be unique on one side

I would expect the alert to work with the default labels. I think the grouping has to be rechecked.

https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/1011/files#r1905954328 states there is an existing test for the change, but my problem seems to be not covered by it.
It seems to expect the pod-id to change, but it doesn't for a StatefulSet.

A solution here would be to use the uid of the pod, I provided a snippet that works for me.

Please provide any helpful snippets.

# The promrule as applied in the cluster
- alert: KubeletTooManyPods
  annotations:
    description: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage
      }} of its Pod capacity on cluster {{ $labels.cluster }}.
    runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods
    summary: Kubelet is running at capacity.
  expr: |-
    count by (cluster, node) (
      (kube_pod_status_phase{job="kube-state-metrics", phase="Running"} == 1)
      * on (cluster, namespace, pod) group_left (node)
      group by (cluster, namespace, pod, node) (
        kube_pod_info{job="kube-state-metrics"}
      )
    )
    /
    max by (cluster, node) (
      kube_node_status_capacity{job="kube-state-metrics", resource="pods"} != 1
    ) > 0.95
  for: 15m
  labels:
    severity: warning
# suggested change
count by (cluster, node) (
  (kube_pod_status_phase{job="kube-state-metrics", phase="Running"} == 1)
  * on (cluster, namespace, pod, uid) group_left (node)
  group by (cluster, namespace, pod, uid, node) (
    kube_pod_info{job="kube-state-metrics"}
  )
)
/
max by (cluster, node) (
  kube_node_status_capacity{job="kube-state-metrics", resource="pods"} != 1
)

What parts of the codebase are affected?

Alerts

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions