-
Couldn't load subscription status.
- Fork 636
Description
/kind failing-test
What steps did you take and what happened:
Both pull request jobs and periodic jobs are regularly failing on the capa-e2e.[It] [unmanaged] [Cluster API Framework] Self Hosted Spec Should pivot the bootstrap cluster to a self-hosted cluster test case.
A sample periodic job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464
A sample pull request job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-aws/5250/pull-cluster-api-provider-aws-e2e/1867146874104844288
What did you expect to happen:
Test case would pass more often
Anything else you would like to add:
Having dug into this a few times (see PRs #5249 and #5251), I've come to the conclusion that, for some reason, the container image for the CAPA manager that's built during the test run isn't present on the Kubeadm control plane node during a clusterctl move.
The below samples are pulling information from the periodic job at https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-provider-aws-e2e/1866955108403646464
build log output
[FAILED] Timed out after 1200.001s.
Timed out waiting for all MachineDeployment self-hosted-rjpecj/self-hosted-lv1y15-md-0 Machines to be upgraded to kubernetes version v1.29.9
The function passed to Eventually returned the following error:
<*errors.fundamental | 0xc003693da0>:
old Machines remain
{
msg: "old Machines remain",
stack: [0x25eeeaa, 0x4f0046, 0x4ef159, 0xa6931f, 0xa6a3ec, 0xa67a46, 0x25eeb93, 0x25f2ece, 0x26aaa6b, 0xa45593, 0xa5974d, 0x47b3a1],
}
In [It] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/machine_helpers.go:221 @ 12/11/24 22:12:08.155
clusterctl move output
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
Retrying with backoff cause="error adding delete-for-move annotation from \"infrastructure.cluster.x-k8s.io/v1beta2, Kind=AWSMachine\" self-hosted-rjpecj/self-hosted-lv1y15-md-0-9xwxz-5hxvg: Internal error occurred: failed calling webhook \"mutation.awsmachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capa-webhook-service.capa-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-awsmachine?timeout=10s\": dial tcp 10.106.211.204:443: connect: connection refused"
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
Retrying with backoff cause="error adding delete-for-move annotation from \"infrastructure.cluster.x-k8s.io/v1beta2, Kind=AWSMachine\" self-hosted-rjpecj/self-hosted-lv1y15-md-0-9xwxz-5hxvg: Internal error occurred: failed calling webhook \"mutation.awsmachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capa-webhook-service.capa-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta2-awsmachine?timeout=10s\": dial tcp 10.106.211.204:443: connect: connection refused"
Deleting AWSMachine="self-hosted-lv1y15-md-0-9xwxz-5hxvg" Namespace="self-hosted-rjpecj"
(retries continue until the job's terminated)
Since this failing to reach webhooks, I looked at the CAPA control plane.
capa-manager Pod
This is the most obvious problem; the container image isn't found, sending the pod into CrashLoopBackOff.
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-12-11T21:52:58Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2024-12-11T21:52:55Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-12-11T21:52:55Z"
message: 'containers with unready status: [manager]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-12-11T21:52:55Z"
message: 'containers with unready status: [manager]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-12-11T21:52:55Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: gcr.io/k8s-staging-cluster-api/capa-manager:e2e
imageID: ""
lastState: {}
name: manager
ready: false
restartCount: 0
started: false
state:
waiting:
message: Back-off pulling image "gcr.io/k8s-staging-cluster-api/capa-manager:e2e"
reason: ImagePullBackOff
hostIP: 10.0.136.158
hostIPs:
- ip: 10.0.136.158
phase: Pending
podIP: 192.168.74.199
podIPs:
- ip: 192.168.74.199
qosClass: BestEffort
startTime: "2024-12-11T21:52:55Z"Associated Node
The node associated with the pod does not list the gcr.io/k8s-staging-cluster-api/capa-manager:e2e image as being present.
images:
- names:
- docker.io/calico/cni@sha256:e60b90d7861e872efa720ead575008bc6eca7bee41656735dcaa8210b688fcd9
- docker.io/calico/cni:v3.24.1
sizeBytes: 87382462
- names:
- docker.io/calico/node@sha256:43f6cee5ca002505ea142b3821a76d585aa0c8d22bc58b7e48589ca7deb48c13
- docker.io/calico/node:v3.24.1
sizeBytes: 80180860
- names:
- registry.k8s.io/etcd@sha256:29901446ff08461789b7cd8565fc5b538134e58f81ca1f50fd65d0371cf6571e
- registry.k8s.io/etcd:3.5.11-0
sizeBytes: 57232947
- names:
- registry.k8s.io/kube-apiserver@sha256:b88538e7fdf73583c8670540eec5b3620af75c9ec200434a5815ee7fba5021f3
- registry.k8s.io/kube-apiserver:v1.29.9
sizeBytes: 35210641
- names:
- registry.k8s.io/kube-controller-manager@sha256:f2f18973ccb6996687d10ba5bd1b8f303e3dd2fed80f831a44d2ac8191e5bb9b
- registry.k8s.io/kube-controller-manager:v1.29.9
sizeBytes: 33739229
- names:
- docker.io/calico/kube-controllers@sha256:4010b2739792ae5e77a750be909939c0a0a372e378f3c81020754efcf4a91efa
- docker.io/calico/kube-controllers:v3.24.1
sizeBytes: 31125927
- names:
- registry.k8s.io/provider-aws/aws-ebs-csi-driver@sha256:02c42645c7a672bbf313ed420e384507dbf0b04992624a3979b87aa4b3f9228e
- registry.k8s.io/provider-aws/aws-ebs-csi-driver:v1.17.0
sizeBytes: 30172691
- names:
- registry.k8s.io/kube-proxy@sha256:124040dbe6b5294352355f5d34c692ecbc940cdc57a8fd06d0f38f76b6138906
- registry.k8s.io/kube-proxy:v1.29.9
sizeBytes: 28600769
- names:
- registry.k8s.io/kube-proxy@sha256:559a093080f70ca863922f5e4bb90d6926d52653a91edb5b72c685ebb65f1858
- registry.k8s.io/kube-proxy:v1.29.8
sizeBytes: 28599399
- names:
- registry.k8s.io/sig-storage/csi-provisioner@sha256:e468dddcd275163a042ab297b2d8c2aca50d5e148d2d22f3b6ba119e2f31fa79
- registry.k8s.io/sig-storage/csi-provisioner:v3.4.0
sizeBytes: 27427836
- names:
- registry.k8s.io/sig-storage/csi-resizer@sha256:3a7bdf5d105783d05d0962fa06ca53032b01694556e633f27366201c2881e01d
- registry.k8s.io/sig-storage/csi-resizer:v1.7.0
sizeBytes: 25809460
- names:
- registry.k8s.io/sig-storage/csi-snapshotter@sha256:714aa06ccdd3781f1a76487e2dc7592ece9a12ae9e0b726e4f93d1639129b771
- registry.k8s.io/sig-storage/csi-snapshotter:v6.2.1
sizeBytes: 25537921
- names:
- registry.k8s.io/sig-storage/csi-attacher@sha256:34cf9b32736c6624fc9787fb149ea6e0fbeb45415707ac2f6440ac960f1116e6
- registry.k8s.io/sig-storage/csi-attacher:v4.2.0
sizeBytes: 25508181
- names:
- registry.k8s.io/kube-scheduler@sha256:9c164076eebaefdaebad46a5ccd550e9f38c63588c02d35163c6a09e164ab8a8
- registry.k8s.io/kube-scheduler:v1.29.9
sizeBytes: 18851030
- names:
- registry.k8s.io/coredns/coredns@sha256:1eeb4c7316bacb1d4c8ead65571cd92dd21e27359f0d4917f1a5822a73b75db1
- registry.k8s.io/coredns/coredns:v1.11.1
sizeBytes: 18182961
- names:
- gcr.io/k8s-staging-provider-aws/cloud-controller-manager@sha256:533d2d64c213719da59c5791835ba05e55ddaaeb2b220ecf7cc3d88823580fc7
- gcr.io/k8s-staging-provider-aws/cloud-controller-manager:v1.20.0-alpha.0
sizeBytes: 15350315
- names:
- registry.k8s.io/sig-storage/csi-node-driver-registrar@sha256:4a4cae5118c4404e35d66059346b7fa0835d7e6319ff45ed73f4bba335cf5183
- registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.7.0
sizeBytes: 10147874
- names:
- registry.k8s.io/sig-storage/livenessprobe@sha256:2b10b24dafdc3ba94a03fc94d9df9941ca9d6a9207b927f5dfd21d59fbe05ba0
- registry.k8s.io/sig-storage/livenessprobe:v2.9.0
sizeBytes: 9194114
- names:
- registry.k8s.io/pause@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097
- registry.k8s.io/pause:3.9
sizeBytes: 321520KubeadmConfig
The KubeadmConfig shows that the containerd runtime should be copying a container image from ECR before joining the cluster.
preKubeadmCommands:
- mkdir -p /opt/cluster-api
- ctr -n k8s.io images pull "public.ecr.aws/m3v9m3w5/capa/update:e2e"
- ctr -n k8s.io images tag "public.ecr.aws/m3v9m3w5/capa/update:e2e" gcr.io/k8s-staging-cluster-api/capa-manager:e2eThe KubeadmControlPlane has the same entry.
Creating the test image
Based on our end-to-end test definitions, the image is successfully created and uploaded to ECR. All other tests seem to be able to find it.
The ensureTestImageUploaded function is what logs in to ECR and uploads the image so that the nodes may then download it. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/test/e2e/shared/aws.go#L676
The ginkgo suites require this function to pass.
| Expect(ensureTestImageUploaded(e2eCtx)).NotTo(HaveOccurred()) |
Environment:
- Cluster-api-provider-aws version:
main - Kubernetes version: (use
kubectl version): - OS (e.g. from
/etc/os-release): Ubuntu on Kube CI