Skip to content

Conversation

vincentmrg
Copy link

@vincentmrg vincentmrg commented Sep 25, 2025

Description

Fix issue described here: #3525

Motivation and Context

Addons with before_compute = true should be applied before any compute resources.

The modules module.eks_managed_node_group and module.self_managed_node_group should include a depends_on on the aws_eks_addon.before_compute resources.

Breaking Changes

No breaking change

How Has This Been Tested?

  • I have updated at least one of the examples/* to demonstrate and validate my change(s)
  • I have tested and validated these changes using one or more of the provided examples/* projects
  • I have executed pre-commit run -a on my pull request

@vincentmrg vincentmrg changed the title fix: ensure addons with before_compute are created before compute resources fix: Ensure addons with before_compute are created before compute resources Sep 25, 2025
@bryantbiggs
Copy link
Member

This is incorrect and will cause disruptions for users

@vincentmrg
Copy link
Author

I tested it myself without finding any issues.
Perhaps I missed something. Is it possible to elaborate a little on how this proposal is incorrect and could cause disruptions for users?
The issue that led to this PR remains relevant; it would be nice to be able to move forward on the issue to help continue improving this module.

@bryantbiggs
Copy link
Member

I tested it myself without finding any issues.

here you said its difficult to reproduce, so how do you know this resolves anything?

Is it possible to elaborate a little on how this proposal is incorrect and could cause disruptions for users?

See hashicorp/terraform#30340

The issue that led to this PR remains relevant; it would be nice to be able to move forward on the issue to help continue improving this module.

Not saying the module is not without imperfections; there are scenarios where the Terraform language simply does not support the desired outcome. However, a module that is seeing over a million downloads a week I would expect to see more users reporting issues if there is truly an issue. I don't see that, I just see this one edge case that is not very well defined and eagerly jumping to a code solution that has known defects. and that is ultimately why the PR is not valid - it would cause more harm than good

@vincentmrg
Copy link
Author

I'm fully aware that this module is widely used and that caution is obviously required.
I opened this PR to illustrate a possible solution to the issue #3525, which contains more details about the problem encountered.

The problem occurs when nodegroups are deployed before the vpc-cni cluster_addon.
When this happens, nodes from nodegroups never go into Ready, remaining in this state indefinitely:

 # CNI should be installed before worker nodes to prevent the following error on nodes:
KubeletNotReady              container runtime network not ready: NetworkReady=false reason: NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

#3525 (comment) you said its difficult to reproduce, so how do you know this resolves anything?

This ensures that addons deployed with before_compute = true are deployed before any compute resource.

Below is a graph of the module before this commit

graph_before

And here the same graph after the commit

graph_after

We can clearly see from the graphs that aws_eks_addon.before_compute resources are not guaranteed to be deployed before compute resources (module.self_managed_node_group and module.eks_managed_node_group, where my proposal allows it.

@bryantbiggs
Copy link
Member

This ensures that addons deployed with before_compute = true are deployed before any compute resource.

Yes, but it also guarantees that any changes to addons cause *ALL compute modules to be re-computed and cause disruptions. This is why we have taken more of a "best effort" route that is not disruptive

terraform-aws-eks/main.tf

Lines 799 to 803 in de2aa10

depends_on = [
module.fargate_profile,
module.eks_managed_node_group,
module.self_managed_node_group,
]

If you have compute already deployed, it will not work - its not designed for this scenario. If you are creating a new cluster from scratch, it should work 99/100 times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants