-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
What happened: metrics-server addon-resizer (nanny) experiences CPU throttling in EKS cluster
What you expected to happen: Metrics-server nanny container auto-scale metrics-server container smoothly based on the number of nodes in the cluster.
Anything else we need to know?: We are running the metrics server in our EKS cluster with addonResizer enabled to auto-scale resource requests/limits based on cluster size.
However, the nanny container (addon-resizer) is experiencing CPU throttling, even if it's configured with 1000m CPU—much higher than what the official Helm chart default suggests (~40m)
I also created issue in addon-resizer - kubernetes/autoscaler#8409, but no response.
Environment:
-
Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): EKS
-
Container Network Setup (flannel, calico, etc.): aws CNI
-
Kubernetes version (use
kubectl version): (kubectl version)?: v1.32.5-eks-5d4a308 -
Metrics Server manifest
spoiler for Metrics Server manifest:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2025-08-18T14:20:39Z"
generateName: metrics-server-7d94f9cbcf-
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/name: metrics-server
pod-template-hash: 7d94f9cbcf
name: metrics-server-7d94f9cbcf-rfxzs
namespace: kube-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: metrics-server-7d94f9cbcf
uid: ba9e5689-7159-4d35-88f9-25a6e1717a74
resourceVersion: "6814675"
uid: d8f7c9d1-1e3c-4390-b5cf-4d5f94da11b7
spec:
containers: - args:
- --secure-port=8444
- --cert-dir=/tmp
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
image: some-registry/registry.k8s.io/metrics-server/metrics-server:v0.7.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8444
scheme: HTTPS
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: metrics-server
ports: - containerPort: 8444
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8444
scheme: HTTPS
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 210m
memory: 186Mi
requests:
cpu: 210m
memory: 186Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- ALL
- mountPath: /tmp
name: tmp - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-qhn52
readOnly: true
- command:
- /pod_nanny
- --config-dir=/etc/config
- --deployment=metrics-server
- --container=metrics-server
- --threshold=5
- --poll-period=300000
- --estimator=exponential
- --minClusterSize=80
- --use-metrics=true
env: - name: MY_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name - name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: som-registryregistry.k8s.io/autoscaling/addon-resizer:1.8.21
imagePullPolicy: IfNotPresent
name: metrics-server-nanny
resources:
limits:
cpu: 50m
memory: 100Mi
requests:
cpu: 30m
memory: 80Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- ALL
- mountPath: /etc/config
name: nanny-config-volume - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-qhn52
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-128-3-251.ec2.internal
nodeSelector:
workload: on-demand
preemptionPolicy: PreemptLowerPriority
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: metrics-server
serviceAccountName: metrics-server
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: workload
operator: Equal
value: on-demand - effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300 - effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes: - emptyDir: {}
name: tmp - configMap:
defaultMode: 420
name: metrics-server-nanny-config
name: nanny-config-volume - name: kube-api-access-qhn52
projected:
defaultMode: 420
sources:- serviceAccountToken:
expirationSeconds: 3607
path: token - configMap:
items:- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- key: ca.crt
- downwardAPI:
items:- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- fieldRef:
- serviceAccountToken:
- lastProbeTime: null
lastTransitionTime: "2025-08-18T14:21:30Z"
status: "True"
type: PodReadyToStartContainers - lastProbeTime: null
lastTransitionTime: "2025-08-18T14:21:24Z"
status: "True"
type: Initialized - lastProbeTime: null
lastTransitionTime: "2025-08-18T14:21:51Z"
status: "True"
type: Ready - lastProbeTime: null
lastTransitionTime: "2025-08-18T14:21:51Z"
status: "True"
type: ContainersReady - lastProbeTime: null
lastTransitionTime: "2025-08-18T14:21:24Z"
status: "True"
type: PodScheduled
containerStatuses: - containerID: containerd://b868d46c7b7b5acb8f242389fea57464054fe6b22160f7c7193b2564fda3d9cf
image: some-registry/registry.k8s.io/metrics-server/metrics-server:v0.7.2
imageID: some-registry/registry.k8s.io/metrics-server/metrics-server@sha256:ffcb2bf004d6aa0a17d90e0247cf94f2865c8901dcab4427034c341951c239f9
lastState: {}
name: metrics-server
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2025-08-18T14:21:27Z"
volumeMounts:- mountPath: /tmp
name: tmp - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-qhn52
readOnly: true
recursiveReadOnly: Disabled
- mountPath: /tmp
- containerID: containerd://8a62b45b82930edcd4ea84c036cf957c92d7323fa4a4e0c8faf2854cf341aab8
image: some-registry/registry.k8s.io/autoscaling/addon-resizer:1.8.21
imageID: some-registry/registry.k8s.io/autoscaling/addon-resizer@sha256:583fd0c434a9be781acd9348d23c268f01ec453528ebef0b4a2140ea703f43f8
lastState: {}
name: metrics-server-nanny
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2025-08-18T14:21:29Z"
volumeMounts:- mountPath: /etc/config
name: nanny-config-volume - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-qhn52
readOnly: true
recursiveReadOnly: Disabled
hostIP: 10.128.3.251
hostIPs:
- mountPath: /etc/config
- ip: 10.128.3.251
phase: Running
podIP: 100.71.1.176
podIPs: - ip: 100.71.1.176
qosClass: Burstable
startTime: "2025-08-18T14:21:24Z"
- Kubelet config:
spoiler for Kubelet config:
- Metrics server logs:
spoiler for Metrics Server logs:
:kube-system:devel [/tenant/cluster/]$ kubectl logs metrics-server-7d94f9cbcf-rfxzs -c metrics-server-nanny
I0818 14:21:29.919910 1 pod_nanny.go:83] Invoked by [/pod_nanny --config-dir=/etc/config --deployment=metrics-server --container=metrics-server --threshold=5 --poll-period=300000 --estimator=exponential --minClusterSize=80 --use-metrics=true]
I0818 14:21:30.020891 1 pod_nanny.go:84] Version: 1.8.21
I0818 14:21:30.021029 1 pod_nanny.go:100] Watching namespace: kube-system, pod: metrics-server-7d94f9cbcf-rfxzs, container: metrics-server.
I0818 14:21:30.021054 1 pod_nanny.go:101] storage: MISSING, extra_storage: 0Gi
I0818 14:21:30.021999 1 pod_nanny.go:135] cpu: 130m, extra_cpu: 1m, memory: 26Mi, extra_memory: 2Mi
I0818 14:21:30.022116 1 pod_nanny.go:269] Resources: [{Base:{i:{value:130 scale:-3} d:{Dec:} s:130m Format:DecimalSI} ExtraPerResource:{i:{value:1 scale:-3} d:{Dec:} s:1m Format:DecimalSI} Name:cpu} {Base:{i:{value:27262976 scale:0} d:{Dec:} s:26Mi Format:BinarySI} ExtraPerResource:{i:{value:2097152 scale:0} d:{Dec:} s:2Mi Format:BinarySI} Name:memory}]
:kube-system:devel [/tenant/cluster/]$
:kube-system:devel [/tenant/cluster/]$ kubectl logs metrics-server-7d94f9cbcf-rfxzs
Defaulted container "metrics-server" out of: metrics-server, metrics-server-nanny
I0818 14:21:30.423915 1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0818 14:21:34.030213 1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
I0818 14:21:34.140387 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0818 14:21:34.140422 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0818 14:21:34.140443 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0818 14:21:34.140458 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0818 14:21:34.140476 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0818 14:21:34.140481 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0818 14:21:34.231658 1 secure_serving.go:213] Serving securely on [::]:8444
I0818 14:21:34.231784 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0818 14:21:34.231921 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0818 14:21:34.241258 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0818 14:21:34.241263 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0818 14:21:34.241482 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
E0818 14:21:44.212059 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:21:44.220341 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:21:44.220473 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:21:44.220531 1 scraper.go:149] "Failed to scrape node" err="Get "https://10.128.13.83:10250/metrics/resource\": dial tcp 10.128.13.83:10250: i/o timeout" node="fargate-ip-10-128-13-83.ec2.internal"
E0818 14:21:59.142558 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:21:59.152063 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:21:59.165544 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:21:59.192032 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:14.136276 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:22:14.145578 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:14.149884 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
E0818 14:22:14.176141 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:22:29.142511 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.83:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-83.ec2.internal" timeout="10s"
E0818 14:22:29.145735 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.12.181:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-12-181.ec2.internal" timeout="10s"
E0818 14:22:29.179042 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.14.218:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-14-218.ec2.internal" timeout="10s"
E0818 14:22:29.190307 1 scraper.go:147] "Failed to scrape node, timeout to access kubelet" err="Get "https://10.128.13.98:10250/metrics/resource\": context deadline exceeded" node="fargate-ip-10-128-13-98.ec2.internal" timeout="10s"
- Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io:kube-system:devel [/tenant/cluster/]$ kubectl describe apiservice v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/instance=metrics-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=metrics-server
app.kubernetes.io/version=0.7.2
dp-healthcheck-monitoring=critical
dp-healthcheck-monitoring-priority=critical
helm.sh/chart=metrics-server-3.12.2
Annotations: meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2025-08-07T04:33:28Z
Resource Version: 6815125
UID: 277083fc-b95a-4c34-a9c8-d2672d971392
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 8444
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2025-08-08T09:29:37Z
Message: failing or missing response from https://100.71.1.176:8444/apis/metrics.k8s.io/v1beta1: Get "https://100.71.1.176:8444/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Reason: FailedDiscoveryCheck
Status: False
Type: Available
Events:
/kind bug
Metadata
Metadata
Assignees
Labels
Type
Projects
Status