metrics-server addon-resizer (nanny) experiences CPU throttling in EKS cluster

**What happened**: metrics-server addon-resizer (nanny) experiences CPU throttling in EKS cluster

**What you expected to happen**: Metrics-server nanny container auto-scale metrics-server container smoothly based on the number of nodes in the cluster.

**Anything else we need to know?**: We are running the metrics server in our EKS cluster with addonResizer enabled to auto-scale resource requests/limits based on cluster size.
However, the nanny container (addon-resizer) is experiencing CPU throttling, even if it's configured with 1000m CPU—much higher than what the official Helm chart default suggests (~40m)

I also created issue in addon-resizer - https://github.com/kubernetes/autoscaler/issues/8409, but no response.

**Environment**: 

- Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): EKS
- Container Network Setup (flannel, calico, etc.): aws CNI
- Kubernetes version (use `kubectl version`):  (kubectl version)?: v1.32.5-eks-5d4a308

- Metrics Server manifest

<details>
  <summary>spoiler for Metrics Server manifest:</summary>

  apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2025-08-18T14:20:39Z"
  generateName: metrics-server-7d94f9cbcf-
  labels:
    app.kubernetes.io/instance: metrics-server
    app.kubernetes.io/name: metrics-server
    pod-template-hash: 7d94f9cbcf
  name: metrics-server-7d94f9cbcf-rfxzs
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: metrics-server-7d94f9cbcf
    uid: ba9e5689-7159-4d35-88f9-25a6e1717a74
  resourceVersion: "6814675"
  uid: d8f7c9d1-1e3c-4390-b5cf-4d5f94da11b7
spec:
  containers:
  - args:
    - --secure-port=8444
    - --cert-dir=/tmp
    - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
    - --kubelet-use-node-status-port
    - --metric-resolution=15s
    image: some-registry/registry.k8s.io/metrics-server/metrics-server:v0.7.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8444
        scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: metrics-server
    ports:
    - containerPort: 8444
      name: https
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8444
        scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 210m
        memory: 186Mi
      requests:
        cpu: 210m
        memory: 186Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qhn52
      readOnly: true
  - command:
    - /pod_nanny
    - --config-dir=/etc/config
    - --deployment=metrics-server
    - --container=metrics-server
    - --threshold=5
    - --poll-period=300000
    - --estimator=exponential
    - --minClusterSize=80
    - --use-metrics=true
    env:
    - name: MY_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: MY_POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    image: som-registryregistry.k8s.io/autoscaling/addon-resizer:1.8.21
    imagePullPolicy: IfNotPresent
    name: metrics-server-nanny
    resources:
      limits:
        cpu: 50m
        memory: 100Mi
      requests:
        cpu: 30m
        memory: 80Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/config
      name: nanny-config-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qhn52
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-128-3-251.ec2.internal
  nodeSelector:
    workload: on-demand
  preemptionPolicy: PreemptLowerPriority
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: metrics-server
  serviceAccountName: metrics-server
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: workload
    operator: Equal
    value: on-demand
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: tmp
  - configMap:
      defaultMode: 420
      name: metrics-server-nanny-config
    name: nanny-config-volume
  - name: kube-api-access-qhn52
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:30Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:24Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:51Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:51Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-08-18T14:21:24Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://b868d46c7b7b5acb8f242389fea57464054fe6b22160f7c7193b2564fda3d9cf
    image: some-registry/registry.k8s.io/metrics-server/metrics-server:v0.7.2
    imageID: some-registry/registry.k8s.io/metrics-server/metrics-server@sha256:ffcb2bf004d6aa0a17d90e0247cf94f2865c8901dcab4427034c341951c239f9
    lastState: {}
    name: metrics-server
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-08-18T14:21:27Z"
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qhn52
      readOnly: true
      recursiveReadOnly: Disabled
  - containerID: containerd://8a62b45b82930edcd4ea84c036cf957c92d7323fa4a4e0c8faf2854cf341aab8
    image: some-registry/registry.k8s.io/autoscaling/addon-resizer:1.8.21
    imageID: some-registry/registry.k8s.io/autoscaling/addon-resizer@sha256:583fd0c434a9be781acd9348d23c268f01ec453528ebef0b4a2140ea703f43f8
    lastState: {}
    name: metrics-server-nanny
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-08-18T14:21:29Z"
    volumeMounts:
    - mountPath: /etc/config
      name: nanny-config-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qhn52
      readOnly: true
      recursiveReadOnly: Disabled
  hostIP: 10.128.3.251
  hostIPs:
  - ip: 10.128.3.251
  phase: Running
  podIP: 100.71.1.176
  podIPs:
  - ip: 100.71.1.176
  qosClass: Burstable
  startTime: "2025-08-18T14:21:24Z"

</details>

- Kubelet config:

<details>
  <summary>spoiler for Kubelet config:</summary>

  

</details>

- Metrics server logs:

<details>
  <summary>spoiler for Metrics Server logs:</summary>

:kube-system:devel [/tenant/cluster/]$ kubectl logs metrics-server-7d94f9cbcf-rfxzs  -c metrics-server-nanny
I0818 14:21:29.919910       1 pod_nanny.go:83] Invoked by [/pod_nanny --config-dir=/etc/config --deployment=metrics-server --container=metrics-server --threshold=5 --poll-period=300000 --estimator=exponential --minClusterSize=80 --use-metrics=true]
I0818 14:21:30.020891       1 pod_nanny.go:84] Version: 1.8.21
I0818 14:21:30.021029       1 pod_nanny.go:100] Watching namespace: kube-system, pod: metrics-server-7d94f9cbcf-rfxzs, container: metrics-server.
I0818 14:21:30.021054       1 pod_nanny.go:101] storage: MISSING, extra_storage: 0Gi
I0818 14:21:30.021999       1 pod_nanny.go:135] cpu: 130m, extra_cpu: 1m, memory: 26Mi, extra_memory: 2Mi
I0818 14:21:30.022116       1 pod_nanny.go:269] Resources: [{Base:{i:{value:130 scale:-3} d:{Dec:<nil>} s:130m Format:DecimalSI} ExtraPerResource:{i:{value:1 scale:-3} d:{Dec:<nil>} s:1m Format:DecimalSI} Name:cpu} {Base:{i:{value:27262976 scale:0} d:{Dec:<nil>} s:26Mi Format:BinarySI} ExtraPerResource:{i:{value:2097152 scale:0} d:{Dec:<nil>} s:2Mi Format:BinarySI} Name:memory}]
:kube-system:devel [/tenant/cluster/]$


:kube-system:devel [/tenant/cluster/]$ kubectl logs metrics-server-7d94f9cbcf-rfxzs
Defaulted container "metrics-server" out of: metrics-server, metrics-server-nanny
I0818 14:21:30.423915       1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0818 14:21:34.030213       1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
I0818 14:21:34.140387       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0818 14:21:34.140422       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0818 14:21:34.140443       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0818 14:21:34.140458       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0818 14:21:34.140476       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0818 14:21:34.140481       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0818 14:21:34.231658       1 secure_serving.go:213] Serving securely on [::]:8444
I0818 14:21:34.231784       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0818 14:21:34.231921       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0818 14:21:34.241258       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0818 14:21:34.241263       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0818 14:21:34.241482       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
E0818 14:21:44.212059       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:21:44.220341       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:21:44.220473       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:21:44.220531       1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.128.13.83:10250/metrics/resource\": dial tcp 10.128.13.83:10250: i/o timeout" node="fargate-ip-10-128-13-83.ec2.internal"
E0818 14:21:59.142558       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:21:59.152063       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:21:59.165544       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:21:59.192032       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:14.136276       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:22:14.145578       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:14.149884       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:22:14.176141       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:22:29.142511       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:22:29.145735       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:22:29.179042       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:29.190307       1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get \"https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
</details>

- Status of Metrics API:

<details>
  <summary>spolier for Status of Metrics API:</summary>

  ```sh
  kubectl describe apiservice v1beta1.metrics.k8s.io
  ```

:kube-system:devel [/tenant/cluster/]$ kubectl describe apiservice v1beta1.metrics.k8s.io
Name:         v1beta1.metrics.k8s.io
Namespace:
Labels:       app.kubernetes.io/instance=metrics-server
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=metrics-server
              app.kubernetes.io/version=0.7.2
              dp-healthcheck-monitoring=critical
              dp-healthcheck-monitoring-priority=critical
              helm.sh/chart=metrics-server-3.12.2
Annotations:  meta.helm.sh/release-name: metrics-server
              meta.helm.sh/release-namespace: kube-system
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2025-08-07T04:33:28Z
  Resource Version:    6815125
  UID:                 277083fc-b95a-4c34-a9c8-d2672d971392
Spec:
  Group:                     metrics.k8s.io
  Group Priority Minimum:    100
  Insecure Skip TLS Verify:  true
  Service:
    Name:            metrics-server
    Namespace:       kube-system
    Port:            8444
  Version:           v1beta1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2025-08-08T09:29:37Z
    Message:               failing or missing response from https://100.71.1.176:8444/apis/metrics.k8s.io/v1beta1: Get "https://100.71.1.176:8444/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available
Events:                    <none>
</details>


/kind bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metrics-server addon-resizer (nanny) experiences CPU throttling in EKS cluster #1701

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

metrics-server addon-resizer (nanny) experiences CPU throttling in EKS cluster #1701

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions