EKS e2e tests permanently failing

/kind bug

**What steps did you take and what happened:**
[A clear and concise description of what the bug is.]

CI is failing with errors like this: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5211/pull-cluster-api-provider-aws-e2e-eks/1856371404925046784

It appears that the EKS control plane with addons test is never succeeding.

[Further investigation](https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5215#discussion_r1854572614) showed that the control plane was constantly blocked on CoreDNS updating.

**What did you expect to happen:**

Tests pass

**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]

I have tried running the test locally and have found that the CoreDNS pods never get scheduled.

A sample `kubectl describe` output:

```
Name:                 coredns-787cb67946-g9qjz
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 <none>
Labels:               eks.amazonaws.com/component=coredns
                      k8s-app=kube-dns
                      pod-template-hash=787cb67946
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-787cb67946
Containers:
  coredns:
    Image:       602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/coredns:v1.11.1-eksbuild.8
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kf9wg (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  kube-api-access-kf9wg:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  CriticalAddonsOnly op=Exists
                              node-role.kubernetes.io/control-plane:NoSchedule
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector k8s-app=kube-dns
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  2m38s (x64 over 12m)  default-scheduler  no nodes available to schedule pods
```

I'm trying to get access to the EKS nodes to validate the taints defined on them, but so far haven't been able to do so.

Also, changing the version of Kubernetes to 1.29 also results in this behavior. 

**Environment:**

- Cluster-api-provider-aws version: `main`
- Kubernetes version: (use `kubectl version`): v1.30 
- OS (e.g. from `/etc/os-release`): 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

EKS e2e tests permanently failing #5237

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

EKS e2e tests permanently failing #5237

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions