Skip to content

Creating the vpc-cni addon after creating a new EKS cluster fails due to PolicyEndpoint CRD conflict. #200

@tommyf-smilecdr

Description

@tommyf-smilecdr

Problem Summary

Creating the vpc-cni addon after creating a new EKS cluster fails due to PolicyEndpoint CRD conflict.

Problem Details

If you create a new EKS cluster after November 13th 2025 without the vpc-cni addon, attempting to add it after the fact, will result in the following error:

code: ConfigurationConflict
message: Conflicts found when trying to apply. Will not continue due to resolve conflicts mode. Conflicts: CustomResourceDefinition.apiextensions.k8s.io policyendpoints.networking.k8s.aws - .spec.versions

Steps to reproduce

  1. Create EKS Cluster without the vpc-cni addon. The EKS version does not seem to matter. I have only tried with 1.34 and 1.33.
    This can be done in a number of ways with the same result:
    • Using eksctl or the aws cli.
    • Using the AWS console using EKS Auto mode (This will not install the vpc-cni addon)
    • Using the AWS console WITHOUT using EKS Auto mode. Ensure to de-select the vpc-cni addon on the addons page.
    • Using the Terraform eks resource, without adding any cluster_addon items.
  2. Create the vpc-cni addon. As above, this can be done in a number of ways with the same result. I will only demonstrate the aws cli method that follows the official docs here:
    • Create the vpc-cni addon.
      aws eks create-addon --cluster-name my-cluster --addon-name vpc-cni --addon-version v1.20.3-eksbuild.1
    • Either observe the addon status in the AWS Console or check using the aws cli...
      aws eks describe-addon --cluster-name my-cluster --addon-name vpc-cni
    • Observe that the addon has the CREATE_FAILED status with the following:
    {
        "addon": {
            "addonName": "vpc-cni",
            "clusterName": "my-cluster",
            "status": "CREATE_FAILED",
            "addonVersion": "v1.20.4-eksbuild.1",
            "health": {
                "issues": [
                    {
                        "code": "ConfigurationConflict",
                        "message": "Conflicts found when trying to apply. Will not continue due to resolve conflicts mode. Conflicts:\nCustomResourceDefinition.apiextensions.k8s.io policyendpoints.networking.k8s.aws - .spec.versions"
                    }
                ]
            },
            ...
        }
    }
    

Affected Versions

  • In my existing workflow I was using the terraform method mentioned above.
  • I had deployed some time on November 12th/13th without issue (i.e. Create cluster without the addon, and then create the addon separately).
  • Attempting this deployment on November 14th failed as desribed above.
  • I discovered that a similar issue (PolicyEndpoint CRD deletion on addon upgrade can lead to indefinite network interruption in rare circumstances #199 ) was raised on Nov 13th, I do wonder if they are somehow related.
  • I see that the domainName field was added to the CRD in the as-of-yet unreleased release-v1.1 branch early in the morning on November 14th 2025 ( 7a70271 )

Workaround

It is easy to work around this issue if you set the conflict resolution method to Override, but this is not an intuitive thing to do when adding a NEW addon, and is not included in the official docs for doing so. It also skirts around the core problem of the mismatched PolicyEndpoint CRDs in use.

Suspected Cause of Problem

Upon further investigation, it appears that when creating a new EKS cluster, it uses the version of the CRD that is included in the upcoming 1.1 release of the network policy controller. This new version of the CRD contains a new domainName field inside the egress and ingress rule schemas, but the version of the API remains at v1alpha1.

I don't know the mechanisms used by the AWS Control Plane when creating an EKS cluster, but based on the evidence, it seems that it likely uses the Helm Chart in this repository to deploy the policy controller?

If this is the case, then it is worth noting that v1.1 of the Helm Chart release installs the v1.1 CRD, but also installs the existing v1.0.x of the CRD watcher component. The result of this is that if you manually remove the PolicyEndpoint CRD, when it gets reconciled, it is replaced by the v1.0.x of the CRD and NOT the v1.1 of the CRD that gets installed with a new EKS cluster.

It seems to me that changes in the schema for a CRD should be accompanied by a bump in the schema version to avoid such conflicts. I do understand that this is an alpha schema version, which means that breaking changes may occur at any time. This would be fine if it were not for the fact that this, potentially breaking, schema is included in all production EKS clusters that may rely on stable schemas.

Summary

  • Release 1.1 of the network policy controller should not be released until the CRD created at deploy time match the CRD created by the watcher.
  • EKS clusters should not be created using this unreleased version of the network policy controller.
  • Production EKS clusters should not be created with core components that rely on CRDs with version v1alpha1 as this can lead to unpredictable behaviour as we see here.
    • If they do, then there needs to be guarantees that the v1alpha1 CRD never changes.
    • Structural changes to any CRD should result in a new version (e.g. v1alpha2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions