-
Notifications
You must be signed in to change notification settings - Fork 635
Description
/kind bug
What steps did you take and what happened:
Creating an EKS cluster with a custom CNI (such as Cilium) is currently problematic because CAPA does not correctly remove the automatically preinstalled VPC CNI. That can be seen by aws-node pods running on the cluster.
CAPA has code to delete the AWS VPC CNI resources if AWSManagedControlPlane.spec.vpcCni.disable=true, but only deletes if they're not managed by Helm. I presume that is on purpose so that users of CAPA could deploy the VPC CNI with Helm in their own way.
cluster-api-provider-aws/pkg/cloud/services/awsnode/cni.go
Lines 269 to 293 in 3618d1c
| func (s *Service) deleteResource(ctx context.Context, remoteClient client.Client, key client.ObjectKey, obj client.Object) error { | |
| if err := remoteClient.Get(ctx, key, obj); err != nil { | |
| if !apierrors.IsNotFound(err) { | |
| return fmt.Errorf("deleting resource %s: %w", key, err) | |
| } | |
| s.scope.Debug(fmt.Sprintf("resource %s was not found, no action", key)) | |
| } else { | |
| // resource found, delete if no label or not managed by helm | |
| if val, ok := obj.GetLabels()[konfig.ManagedbyLabelKey]; !ok || val != "Helm" { | |
| if err := remoteClient.Delete(ctx, obj, &client.DeleteOptions{}); err != nil { | |
| if !apierrors.IsNotFound(err) { | |
| return fmt.Errorf("deleting %s: %w", key, err) | |
| } | |
| s.scope.Debug(fmt.Sprintf( | |
| "resource %s was not found, not deleted", key)) | |
| } else { | |
| s.scope.Debug(fmt.Sprintf("resource %s was deleted", key)) | |
| } | |
| } else { | |
| s.scope.Debug(fmt.Sprintf("resource %s is managed by helm, not deleted", key)) | |
| } | |
| } | |
| return nil | |
| } |
Unfortunately, it seems that AWS introduced a breaking change by switching their own automagic deployment method to Helm, including the relevant labels. This is what a newly-created EKS cluster looks like (VPC CNI not disabled, cluster created by CAPA ~v2.3.0):
$ kubectl get ds -n kube-system aws-node -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
creationTimestamp: "2024-01-18T15:40:42Z"
generation: 1
labels:
app.kubernetes.io/instance: aws-vpc-cni
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: aws-node
app.kubernetes.io/version: v1.15.1
helm.sh/chart: aws-vpc-cni-1.15.1
k8s-app: aws-node
name: aws-node
namespace: kube-system
# [...]
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: aws-node
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: aws-vpc-cni
app.kubernetes.io/name: aws-node
k8s-app: aws-node
The deletion code must be fixed. Sadly, AWS does not provide extra labels to denote that the deployment is AWS-managed. And this breaking change even applies to older Kubernetes versions like 1.24.
Related: E2E test wanted to cover this feature (issue)
Environment:
- Cluster-api-provider-aws version: ~v2.3.0 (fork with some backports)
- Kubernetes version: (use
kubectl version): 1.24