Skip to content

metrics-server addon-resizer (nanny) experiences CPU throttling in EKS cluster #1701

@rajeevpnair

Description

@rajeevpnair

What happened: metrics-server addon-resizer (nanny) experiences CPU throttling in EKS cluster

What you expected to happen: Metrics-server nanny container auto-scale metrics-server container smoothly based on the number of nodes in the cluster.

Anything else we need to know?: We are running the metrics server in our EKS cluster with addonResizer enabled to auto-scale resource requests/limits based on cluster size.
However, the nanny container (addon-resizer) is experiencing CPU throttling, even if it's configured with 1000m CPU—much higher than what the official Helm chart default suggests (~40m)

I also created issue in addon-resizer - kubernetes/autoscaler#8409, but no response.

Environment:

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): EKS

  • Container Network Setup (flannel, calico, etc.): aws CNI

  • Kubernetes version (use kubectl version): (kubectl version)?: v1.32.5-eks-5d4a308

  • Metrics Server manifest

spoiler for Metrics Server manifest:

apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2025-08-18T14:20:39Z"
generateName: metrics-server-7d94f9cbcf-
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/name: metrics-server
pod-template-hash: 7d94f9cbcf
name: metrics-server-7d94f9cbcf-rfxzs
namespace: kube-system
ownerReferences:

  • apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: metrics-server-7d94f9cbcf
    uid: ba9e5689-7159-4d35-88f9-25a6e1717a74
    resourceVersion: "6814675"
    uid: d8f7c9d1-1e3c-4390-b5cf-4d5f94da11b7
    spec:
    containers:
  • args:
    • --secure-port=8444
    • --cert-dir=/tmp
    • --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
    • --kubelet-use-node-status-port
    • --metric-resolution=15s
      image: some-registry/registry.k8s.io/metrics-server/metrics-server:v0.7.2
      imagePullPolicy: IfNotPresent
      livenessProbe:
      failureThreshold: 3
      httpGet:
      path: /healthz
      port: 8444
      scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
      name: metrics-server
      ports:
    • containerPort: 8444
      name: https
      protocol: TCP
      readinessProbe:
      failureThreshold: 3
      httpGet:
      path: /healthz
      port: 8444
      scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
      resources:
      limits:
      cpu: 210m
      memory: 186Mi
      requests:
      cpu: 210m
      memory: 186Mi
      securityContext:
      allowPrivilegeEscalation: false
      capabilities:
      drop:
      • ALL
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
        type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
    • mountPath: /tmp
      name: tmp
    • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qhn52
      readOnly: true
  • command:
    • /pod_nanny
    • --config-dir=/etc/config
    • --deployment=metrics-server
    • --container=metrics-server
    • --threshold=5
    • --poll-period=300000
    • --estimator=exponential
    • --minClusterSize=80
    • --use-metrics=true
      env:
    • name: MY_POD_NAME
      valueFrom:
      fieldRef:
      apiVersion: v1
      fieldPath: metadata.name
    • name: MY_POD_NAMESPACE
      valueFrom:
      fieldRef:
      apiVersion: v1
      fieldPath: metadata.namespace
      image: som-registryregistry.k8s.io/autoscaling/addon-resizer:1.8.21
      imagePullPolicy: IfNotPresent
      name: metrics-server-nanny
      resources:
      limits:
      cpu: 50m
      memory: 100Mi
      requests:
      cpu: 30m
      memory: 80Mi
      securityContext:
      allowPrivilegeEscalation: false
      capabilities:
      drop:
      • ALL
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
        type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
    • mountPath: /etc/config
      name: nanny-config-volume
    • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qhn52
      readOnly: true
      dnsPolicy: ClusterFirst
      enableServiceLinks: true
      nodeName: ip-10-128-3-251.ec2.internal
      nodeSelector:
      workload: on-demand
      preemptionPolicy: PreemptLowerPriority
      priority: 2000000000
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: metrics-server
      serviceAccountName: metrics-server
      terminationGracePeriodSeconds: 30
      tolerations:
  • effect: NoSchedule
    key: workload
    operator: Equal
    value: on-demand
  • effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  • effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
    volumes:
  • emptyDir: {}
    name: tmp
  • configMap:
    defaultMode: 420
    name: metrics-server-nanny-config
    name: nanny-config-volume
  • name: kube-api-access-qhn52
    projected:
    defaultMode: 420
    sources:
    • serviceAccountToken:
      expirationSeconds: 3607
      path: token
    • configMap:
      items:
      • key: ca.crt
        path: ca.crt
        name: kube-root-ca.crt
    • downwardAPI:
      items:
      • fieldRef:
        apiVersion: v1
        fieldPath: metadata.namespace
        path: namespace
        status:
        conditions:
  • lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:30Z"
    status: "True"
    type: PodReadyToStartContainers
  • lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:24Z"
    status: "True"
    type: Initialized
  • lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:51Z"
    status: "True"
    type: Ready
  • lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:51Z"
    status: "True"
    type: ContainersReady
  • lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:24Z"
    status: "True"
    type: PodScheduled
    containerStatuses:
  • containerID: containerd://b868d46c7b7b5acb8f242389fea57464054fe6b22160f7c7193b2564fda3d9cf
    image: some-registry/registry.k8s.io/metrics-server/metrics-server:v0.7.2
    imageID: some-registry/registry.k8s.io/metrics-server/metrics-server@sha256:ffcb2bf004d6aa0a17d90e0247cf94f2865c8901dcab4427034c341951c239f9
    lastState: {}
    name: metrics-server
    ready: true
    restartCount: 0
    started: true
    state:
    running:
    startedAt: "2025-08-18T14:21:27Z"
    volumeMounts:
    • mountPath: /tmp
      name: tmp
    • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qhn52
      readOnly: true
      recursiveReadOnly: Disabled
  • containerID: containerd://8a62b45b82930edcd4ea84c036cf957c92d7323fa4a4e0c8faf2854cf341aab8
    image: some-registry/registry.k8s.io/autoscaling/addon-resizer:1.8.21
    imageID: some-registry/registry.k8s.io/autoscaling/addon-resizer@sha256:583fd0c434a9be781acd9348d23c268f01ec453528ebef0b4a2140ea703f43f8
    lastState: {}
    name: metrics-server-nanny
    ready: true
    restartCount: 0
    started: true
    state:
    running:
    startedAt: "2025-08-18T14:21:29Z"
    volumeMounts:
    • mountPath: /etc/config
      name: nanny-config-volume
    • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qhn52
      readOnly: true
      recursiveReadOnly: Disabled
      hostIP: 10.128.3.251
      hostIPs:
  • ip: 10.128.3.251
    phase: Running
    podIP: 100.71.1.176
    podIPs:
  • ip: 100.71.1.176
    qosClass: Burstable
    startTime: "2025-08-18T14:21:24Z"
  • Kubelet config:
spoiler for Kubelet config:
  • Metrics server logs:
spoiler for Metrics Server logs:

:kube-system:devel [/tenant/cluster/]$ kubectl logs metrics-server-7d94f9cbcf-rfxzs -c metrics-server-nanny
I0818 14:21:29.919910 1 pod_nanny.go:83] Invoked by [/pod_nanny --config-dir=/etc/config --deployment=metrics-server --container=metrics-server --threshold=5 --poll-period=300000 --estimator=exponential --minClusterSize=80 --use-metrics=true]
I0818 14:21:30.020891 1 pod_nanny.go:84] Version: 1.8.21
I0818 14:21:30.021029 1 pod_nanny.go:100] Watching namespace: kube-system, pod: metrics-server-7d94f9cbcf-rfxzs, container: metrics-server.
I0818 14:21:30.021054 1 pod_nanny.go:101] storage: MISSING, extra_storage: 0Gi
I0818 14:21:30.021999 1 pod_nanny.go:135] cpu: 130m, extra_cpu: 1m, memory: 26Mi, extra_memory: 2Mi
I0818 14:21:30.022116 1 pod_nanny.go:269] Resources: [{Base:{i:{value:130 scale:-3} d:{Dec:} s:130m Format:DecimalSI} ExtraPerResource:{i:{value:1 scale:-3} d:{Dec:} s:1m Format:DecimalSI} Name:cpu} {Base:{i:{value:27262976 scale:0} d:{Dec:} s:26Mi Format:BinarySI} ExtraPerResource:{i:{value:2097152 scale:0} d:{Dec:} s:2Mi Format:BinarySI} Name:memory}]
:kube-system:devel [/tenant/cluster/]$

:kube-system:devel [/tenant/cluster/]$ kubectl logs metrics-server-7d94f9cbcf-rfxzs
Defaulted container "metrics-server" out of: metrics-server, metrics-server-nanny
I0818 14:21:30.423915 1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0818 14:21:34.030213 1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
I0818 14:21:34.140387 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0818 14:21:34.140422 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0818 14:21:34.140443 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0818 14:21:34.140458 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0818 14:21:34.140476 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0818 14:21:34.140481 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0818 14:21:34.231658 1 secure_serving.go:213] Serving securely on [::]:8444
I0818 14:21:34.231784 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0818 14:21:34.231921 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0818 14:21:34.241258 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0818 14:21:34.241263 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0818 14:21:34.241482 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
E0818 14:21:44.212059 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:21:44.220341 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:21:44.220473 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:21:44.220531 1 scraper.go:149] "Failed to scrape node" err="Get "https://10.128.13.83:10250/metrics/resource\": dial tcp 10.128.13.83:10250: i/o timeout" node="fargate-ip-10-128-13-83.ec2.internal"
E0818 14:21:59.142558 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:21:59.152063 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:21:59.165544 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:21:59.192032 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:14.136276 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:22:14.145578 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:14.149884 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:22:14.176141 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:22:29.142511 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:22:29.145735 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:22:29.179042 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:29.190307 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"

  • Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io

:kube-system:devel [/tenant/cluster/]$ kubectl describe apiservice v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/instance=metrics-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=metrics-server
app.kubernetes.io/version=0.7.2
dp-healthcheck-monitoring=critical
dp-healthcheck-monitoring-priority=critical
helm.sh/chart=metrics-server-3.12.2
Annotations: meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2025-08-07T04:33:28Z
Resource Version: 6815125
UID: 277083fc-b95a-4c34-a9c8-d2672d971392
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 8444
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2025-08-08T09:29:37Z
Message: failing or missing response from https://100.71.1.176:8444/apis/metrics.k8s.io/v1beta1: Get "https://100.71.1.176:8444/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Reason: FailedDiscoveryCheck
Status: False
Type: Available
Events:

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions