CAPI pivot test always case failing in e2es

/kind failing-test

**What steps did you take and what happened:**

Both pull request jobs and periodic jobs are regularly failing on the `capa-e2e.[It] [unmanaged] [Cluster API Framework] Self Hosted Spec Should pivot the bootstrap cluster to a self-hosted cluster` test case.

A sample periodic job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464

A sample pull request job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5250/pull-cluster-api-provider-aws-e2e/1867146874104844288


**What did you expect to happen:**

Test case would pass more often


**Anything else you would like to add:**

Having dug into this a few times (see PRs #5249 and #5251), I've come to the conclusion that, for some reason, the container image for the CAPA manager that's built during the test run isn't present on the Kubeadm control plane node during a clusterctl move.

The below samples are pulling information from the periodic job at https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464

### build log output

```
   [FAILED] Timed out after 1200.001s.
  Timed out waiting for all MachineDeployment self-hosted-rjpecj/self-hosted-lv1y15-md-0 Machines to be upgraded to kubernetes version v1.29.9
  The function passed to Eventually returned the following error:
      <*errors.fundamental | 0xc003693da0>: 
      old Machines remain
      {
          msg: "old Machines remain",
          stack: [0x25eeeaa, 0x4f0046, 0x4ef159, 0xa6931f, 0xa6a3ec, 0xa67a46, 0x25eeb93, 0x25f2ece, 0x26aaa6b, 0xa45593, 0xa5974d, 0x47b3a1],
      }
  In [It] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.8.4/framework/machine_helpers.go:221 @ 12/11/24 22:12:08.155 
```

### clusterctl move output

From https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/logs/self-hosted-rjpecj/clusterctl-move.log

```
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
Retrying with backoff cause="error adding delete-for-move annotation from \"infrastructure.cluster.x-k8s.io/v1beta2, Kind=AWSMachine\" self-hosted-rjpecj/self-hosted-lv1y15-md-0-9xwxz-5hxvg: Internal error occurred: failed calling webhook \"mutation.awsmachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capa-webhook-service.capa-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-awsmachine?timeout=10s\": dial tcp 10.106.211.204:443: connect: connection refused"
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
Retrying with backoff cause="error adding delete-for-move annotation from \"infrastructure.cluster.x-k8s.io/v1beta2, Kind=AWSMachine\" self-hosted-rjpecj/self-hosted-lv1y15-md-0-9xwxz-5hxvg: Internal error occurred: failed calling webhook \"mutation.awsmachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capa-webhook-service.capa-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-awsmachine?timeout=10s\": dial tcp 10.106.211.204:443: connect: connection refused"
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"

(retries continue until the job's terminated)
```


Since this failing to reach webhooks, I looked at the CAPA control plane.

### capa-manager Pod

This is the most obvious problem; the container image isn't found, sending the pod into `CrashLoopBackOff`.

https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/capa-system/Pod/capa-controller-manager-7f5964cb58-wmvb5.yaml

```yaml
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:58Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-12-11T21:52:55Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: gcr.io/k8s-staging-cluster-api/capa-manager:e2e
    imageID: ""
    lastState: {}
    name: manager
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "gcr.io/k8s-staging-cluster-api/capa-manager:e2e"
        reason: ImagePullBackOff
  hostIP: 10.0.136.158
  hostIPs:
  - ip: 10.0.136.158
  phase: Pending
  podIP: 192.168.74.199
  podIPs:
  - ip: 192.168.74.199
  qosClass: BestEffort
  startTime: "2024-12-11T21:52:55Z"
```

### Associated Node

The node associated with the pod does not list the `gcr.io/k8s-staging-cluster-api/capa-manager:e2e` image as being present.

From https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/Node/ip-10-0-136-158.us-west-2.compute.internal.yaml

```yaml
 images:
  - names:
    - docker.io/calico/cni@sha256:e60b90d7861e872efa720ead575008bc6eca7bee41656735dcaa8210b688fcd9
    - docker.io/calico/cni:v3.24.1
    sizeBytes: 87382462
  - names:
    - docker.io/calico/node@sha256:43f6cee5ca002505ea142b3821a76d585aa0c8d22bc58b7e48589ca7deb48c13
    - docker.io/calico/node:v3.24.1
    sizeBytes: 80180860
  - names:
    - registry.k8s.io/etcd@sha256:29901446ff08461789b7cd8565fc5b538134e58f81ca1f50fd65d0371cf6571e
    - registry.k8s.io/etcd:3.5.11-0
    sizeBytes: 57232947
  - names:
    - registry.k8s.io/kube-apiserver@sha256:b88538e7fdf73583c8670540eec5b3620af75c9ec200434a5815ee7fba5021f3
    - registry.k8s.io/kube-apiserver:v1.29.9
    sizeBytes: 35210641
  - names:
    - registry.k8s.io/kube-controller-manager@sha256:f2f18973ccb6996687d10ba5bd1b8f303e3dd2fed80f831a44d2ac8191e5bb9b
    - registry.k8s.io/kube-controller-manager:v1.29.9
    sizeBytes: 33739229
  - names:
    - docker.io/calico/kube-controllers@sha256:4010b2739792ae5e77a750be909939c0a0a372e378f3c81020754efcf4a91efa
    - docker.io/calico/kube-controllers:v3.24.1
    sizeBytes: 31125927
  - names:
    - registry.k8s.io/provider-aws/aws-ebs-csi-driver@sha256:02c42645c7a672bbf313ed420e384507dbf0b04992624a3979b87aa4b3f9228e
    - registry.k8s.io/provider-aws/aws-ebs-csi-driver:v1.17.0
    sizeBytes: 30172691
  - names:
    - registry.k8s.io/kube-proxy@sha256:124040dbe6b5294352355f5d34c692ecbc940cdc57a8fd06d0f38f76b6138906
    - registry.k8s.io/kube-proxy:v1.29.9
    sizeBytes: 28600769
  - names:
    - registry.k8s.io/kube-proxy@sha256:559a093080f70ca863922f5e4bb90d6926d52653a91edb5b72c685ebb65f1858
    - registry.k8s.io/kube-proxy:v1.29.8
    sizeBytes: 28599399
  - names:
    - registry.k8s.io/sig-storage/csi-provisioner@sha256:e468dddcd275163a042ab297b2d8c2aca50d5e148d2d22f3b6ba119e2f31fa79
    - registry.k8s.io/sig-storage/csi-provisioner:v3.4.0
    sizeBytes: 27427836
  - names:
    - registry.k8s.io/sig-storage/csi-resizer@sha256:3a7bdf5d105783d05d0962fa06ca53032b01694556e633f27366201c2881e01d
    - registry.k8s.io/sig-storage/csi-resizer:v1.7.0
    sizeBytes: 25809460
  - names:
    - registry.k8s.io/sig-storage/csi-snapshotter@sha256:714aa06ccdd3781f1a76487e2dc7592ece9a12ae9e0b726e4f93d1639129b771
    - registry.k8s.io/sig-storage/csi-snapshotter:v6.2.1
    sizeBytes: 25537921
  - names:
    - registry.k8s.io/sig-storage/csi-attacher@sha256:34cf9b32736c6624fc9787fb149ea6e0fbeb45415707ac2f6440ac960f1116e6
    - registry.k8s.io/sig-storage/csi-attacher:v4.2.0
    sizeBytes: 25508181
  - names:
    - registry.k8s.io/kube-scheduler@sha256:9c164076eebaefdaebad46a5ccd550e9f38c63588c02d35163c6a09e164ab8a8
    - registry.k8s.io/kube-scheduler:v1.29.9
    sizeBytes: 18851030
  - names:
    - registry.k8s.io/coredns/coredns@sha256:1eeb4c7316bacb1d4c8ead65571cd92dd21e27359f0d4917f1a5822a73b75db1
    - registry.k8s.io/coredns/coredns:v1.11.1
    sizeBytes: 18182961
  - names:
    - gcr.io/k8s-staging-provider-aws/cloud-controller-manager@sha256:533d2d64c213719da59c5791835ba05e55ddaaeb2b220ecf7cc3d88823580fc7
    - gcr.io/k8s-staging-provider-aws/cloud-controller-manager:v1.20.0-alpha.0
    sizeBytes: 15350315
  - names:
    - registry.k8s.io/sig-storage/csi-node-driver-registrar@sha256:4a4cae5118c4404e35d66059346b7fa0835d7e6319ff45ed73f4bba335cf5183
    - registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.7.0
    sizeBytes: 10147874
  - names:
    - registry.k8s.io/sig-storage/livenessprobe@sha256:2b10b24dafdc3ba94a03fc94d9df9941ca9d6a9207b927f5dfd21d59fbe05ba0
    - registry.k8s.io/sig-storage/livenessprobe:v2.9.0
    sizeBytes: 9194114
  - names:
    - registry.k8s.io/pause@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097
    - registry.k8s.io/pause:3.9
    sizeBytes: 321520
```

### KubeadmConfig

The KubeadmConfig shows that the containerd runtime should be copying a container image from ECR before joining the cluster.

https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464/artifacts/clusters/self-hosted-lv1y15/resources/self-hosted-rjpecj/KubeadmConfig/self-hosted-lv1y15-control-plane-qhfvf.yaml

```yaml
  preKubeadmCommands:
  - mkdir -p /opt/cluster-api
  - ctr -n k8s.io images pull "public.ecr.aws/m3v9m3w5/capa/update:e2e"
  - ctr -n k8s.io images tag "public.ecr.aws/m3v9m3w5/capa/update:e2e" gcr.io/k8s-staging-cluster-api/capa-manager:e2e
```

The KubeadmControlPlane has the same entry.

### Creating the test image

Based on our end-to-end test definitions, the image is successfully created and uploaded to ECR. All other tests seem to be able to find it.

The `ensureTestImageUploaded` function is what logs in to ECR and uploads the image so that the nodes may then download it. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/test/e2e/shared/aws.go#L676

The ginkgo suites require this function to pass. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/3a646b3184d6cc97b681dc17e8d17686761c5120/test/e2e/shared/suite.go#L159

**Environment:**

- Cluster-api-provider-aws version: `main`
- Kubernetes version: (use `kubectl version`): 
- OS (e.g. from `/etc/os-release`): Ubuntu on Kube CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CAPI pivot test always case failing in e2es #5252

build log output

clusterctl move output

capa-manager Pod

Associated Node

KubeadmConfig

Creating the test image

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CAPI pivot test always case failing in e2es #5252

Description

build log output

clusterctl move output

capa-manager Pod

Associated Node

KubeadmConfig

Creating the test image

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions