Skip to content

EKS VPC CNI cannot be disabled because AWS now installs via HelmΒ #4743

@AndiDog

Description

@AndiDog

/kind bug

What steps did you take and what happened:

Creating an EKS cluster with a custom CNI (such as Cilium) is currently problematic because CAPA does not correctly remove the automatically preinstalled VPC CNI. That can be seen by aws-node pods running on the cluster.

CAPA has code to delete the AWS VPC CNI resources if AWSManagedControlPlane.spec.vpcCni.disable=true, but only deletes if they're not managed by Helm. I presume that is on purpose so that users of CAPA could deploy the VPC CNI with Helm in their own way.

func (s *Service) deleteResource(ctx context.Context, remoteClient client.Client, key client.ObjectKey, obj client.Object) error {
if err := remoteClient.Get(ctx, key, obj); err != nil {
if !apierrors.IsNotFound(err) {
return fmt.Errorf("deleting resource %s: %w", key, err)
}
s.scope.Debug(fmt.Sprintf("resource %s was not found, no action", key))
} else {
// resource found, delete if no label or not managed by helm
if val, ok := obj.GetLabels()[konfig.ManagedbyLabelKey]; !ok || val != "Helm" {
if err := remoteClient.Delete(ctx, obj, &client.DeleteOptions{}); err != nil {
if !apierrors.IsNotFound(err) {
return fmt.Errorf("deleting %s: %w", key, err)
}
s.scope.Debug(fmt.Sprintf(
"resource %s was not found, not deleted", key))
} else {
s.scope.Debug(fmt.Sprintf("resource %s was deleted", key))
}
} else {
s.scope.Debug(fmt.Sprintf("resource %s is managed by helm, not deleted", key))
}
}
return nil
}

Unfortunately, it seems that AWS introduced a breaking change by switching their own automagic deployment method to Helm, including the relevant labels. This is what a newly-created EKS cluster looks like (VPC CNI not disabled, cluster created by CAPA ~v2.3.0):

$ kubectl get ds -n kube-system aws-node -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
  creationTimestamp: "2024-01-18T15:40:42Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: aws-vpc-cni
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-node
    app.kubernetes.io/version: v1.15.1
    helm.sh/chart: aws-vpc-cni-1.15.1
    k8s-app: aws-node
  name: aws-node
  namespace: kube-system
  # [...]
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: aws-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: aws-vpc-cni
        app.kubernetes.io/name: aws-node
        k8s-app: aws-node

The deletion code must be fixed. Sadly, AWS does not provide extra labels to denote that the deployment is AWS-managed. And this breaking change even applies to older Kubernetes versions like 1.24.

Related: E2E test wanted to cover this feature (issue)

Environment:

  • Cluster-api-provider-aws version: ~v2.3.0 (fork with some backports)
  • Kubernetes version: (use kubectl version): 1.24

Metadata

Metadata

Assignees

Labels

help wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions