Releases · cloudposse/terraform-aws-eks-node-group

Surface variable for boolean flag of launch_template_disk_encryption
Use launch_template_disk_encryption to flip flag of generated launch_template.ebs.encryption

why

Allow EBS encryption

references

Closes #35

Assets 2

18 Sep 19:23

github-actions

0.12.0

6d012b4

v0.12.0: Remove autoscaler permissions from worker role

Potentially breaking changes

Terraform 0.13.3 or later required

This release requires Terraform 0.13.3 or later because it is affected by these bugs that are fixed in 0.13.3:

It remains possibly affected by hashicorp/terraform#25631 but we hope we have worked around that for now.

Securing the Cluster Autoscaler

Previously, setting enable_cluster_autoscaler = true turned on tagging sufficient for the Kubernetes Cluster Autoscaler to discover and manage the node group, and also added a policy to the node group worker role that allowed the workers to perform the autoscaling function. Since pods by default use the EC2 instance role, which in EKS node groups is the node group worker role, this allowed the Kubernetes Cluster Autoscaler to work from any node, but also allowed any rogue pod to perform autoscaling actions.

With this release, enable_cluster_autoscaler is deprecated and its functions are replaced with 2 new variables:

cluster_autoscaler_enabled, when true, causes this module to perform the labeling and tagging needed for the Kubernetes Cluster Autoscaler to discover and manage the node group
worker_role_autoscale_iam_enabled, when true, causes this module to add the IAM policy to the worker IAM role to enable the workers (and by default, any pods running on the workers) to perform autoscaling operations

Going forward, we recommend not using enable_cluster_autoscaler (it will eventually be removed) and leaving worker_role_autoscale_iam_enabled at its default value of false. If you want to use the Kubernetes Cluster Autoscaler, set cluster_autoscaler_enabled = true and use EKS IAM roles for service accounts to give the Cluster Autoscaler service account IAM permissions to perform autoscaling operations. Our Terraform module terraform-aws-eks-iam-role is available to help with this.

Known issues

There remains a bug in amazon-vpc-cni-k8s (a.k.a. amazon-k8s-cni:v1.6.3) where after deleting a node group, some ENIs for that node group may be left behind. If any are left behind, they will prevent any security group they are attached to (such as the security group created by this module to enable remote SSH access) from being deleted, and Terrform will relay an error message like

Error deleting security group: DependencyViolation: resource sg-067899abcdef01234 has a dependent object

There is a feature request that should resolve this issue for our use case. Meanwhile the good news is that the trigger is deleting a security group, which does not often happen, and even when the security group is deleted we have been able to reduce the chance the problem occurs. When it does happen, there are some workarounds:

Workarounds

Since this is a known problem, there are some processes at Amazon that attempt to clean up these abandoned ENIs. We have seen them disappear after 1-2 hours, after which Terraform apply will succeed in deleting the security group.
You can find and delete the dangling ENIs on your own. We have observed the dangling ENIs to have AWS tags of the form Name=node.k8s.amazonaws.com/instance_id,Value=<instance-id> where <instance-id> is the EC2 instance ID of the instance the ENI is supposed to be associated with. A cleanup script could fine ENIs with state = AVAILABLE and tagged as belonging to instances that are terminated or do not exist and delete them.
You can also delete the security group through the AWS Web Console, which will guide you to the other resources that need to be deleted in order for the security group to be free to delete. The security group created by this module to enable SSH access with have a name ending with -remoteAccess so you can easily identify it. If you delete it inappropriately, Terraform will re-create it on the next plan/apply cycle, so this is a relatively save operation.

Fortunately, this should be a rare occurrence, and we hope it will be definitively fixed in the next few months.

Reminder from 0.11.0: `create_before_destroy`

Starting with 0.11.0 you have the option of enabling create_before_destroy behavior for the node groups. We recommend doing it, as destroying a node group before creating its replacement can result in a significant cluster outage, but it is not without its downsides. Read the description and discussion in PR #31 for more details .

Additional Release Notes

Remove autoscaler permissions from worker role @Nuru (#34) (click to expand)

what

Disable by default the permission for workers to perform autoscaling operations
Workaround hashicorp/terraform#25631 by not keeping a reference to the remote access security group ID in random_pet "keepers"
Attempt to work around failure of AWS EKS and/or AWS Terraform provider to detach instances from a security group automatically when deleting the security group by forcing the node group to be deleted before the security group. Not entirely successful (see "Known issues")

why

General security principle of least privilege, plus Cloud Posse convention of boolean feature flags having names ending with _enabled.
Without the workaround of hashicorp/terraform#25631, terraform apply would fail with error like

Error: Provider produced inconsistent final plan

When expanding the plan for
module.region_node_group["main"].module.node_group["us-west-2b"].module.eks_node_group.random_pet.cbd[0]
to include new values learned so far during apply, provider
"registry.terraform.io/hashicorp/random" produced an invalid new value for
.keepers["source_security_group_ids"]: was cty.StringVal(""), but now
cty.StringVal("sg-0465427f44089a888").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Assets 2

14 Sep 18:26

github-actions

0.12.0-rc1

cab5114

v0.12.0-rc1 Pre-release

Pre-release

Warning

This release is known to have issues when adding or removing SSH access while using features requiring a Launch Template. Issues include:

Unable to make changes due to detected dependency cycle. This appears to be more than a pair of bugs in Terraform.
Unable to destroy security group created for SSH access when it is no longer needed, because it is still in use. It appears EKS has the same problem with deleting managed node groups. This module attempted a fix for this issue but it could not be tested because of the above mentioned dependency cycle issue.

breaking changes

With this release, enable_cluster_autoscaler is deprecated and its functions are replaced with 2 new variables:

cluster_autoscaler_enabled, when true, causes this module to perform the labeling and tagging needed for the Kubernetes Cluster Autoscaler to discover and manage the node group
worker_role_autoscale_iam_enabled, when true, causes this module to add the IAM policy to the worker IAM role to enable the workers (and by default, any pods running on the workers) to perform autoscaling operations

Refactor for clarity @Nuru (#33)

what

Refactor, separating out launch template and IAM parts
Rename some things for clarity and consistency
Refine random_pet keepers
Disable by default the permission for workers to perform autoscaling operations

why

main.tf was too complex
Cloud Posse standard is for feature selection booleans to be named with _enabled at the end
Change in any keepers will cause the node group to be replaced
Workers should not be performing autoscaling operations, those should be done only by a specific service account

references

This module appears to be affected by Terraform issue 26166 and should be considered pre-release, not ready for production, until that is resolved. It appears on track to be fixed in Terraform 0.13.3.

Assets 2

12 Sep 00:23

github-actions

0.11.2

863d4ab

v0.11.2 Pre-release

Pre-release

Warning

This release is known to have issues when adding or removing SSH access while using features requiring a Launch Template. Issues include:

Unable to make changes due to detected dependency cycle. This appears to be a bug in Terraform.
Unable to destroy security group created for SSH access when it is no longer needed, because it is still in use. It appears EKS has the same problem with deleting managed node groups.

🐛 Bug Fixes

Fix for remote access and lifecycle issues @woz5999 (#30)

what

Create remote access security group for launch template when necessary to enable SSH access
Resolve some lifecycle dependency issues caused by create-before-destroy

why

Solves the issue where remote_access is not valid for node groups that specify launch templates. If you specify an ssh key or source security group ids with the current state of the module, it will throw an error and prevent node group creation.
Fixes some situations where Terraform would fail to apply a plan due to a dependency cycle

Assets 2

08 Sep 01:55

github-actions

0.11.1

a93cdea

v0.11.1

🐛 Bug Fixes

More triggers for replacement, better handling of enabled=false @Nuru (#32)

what

Fix edge cases where certain inputs would cause errors when the module enabled = false
Make cluster name dependent on additional variables so that it changes whenever the node group needs to be replaced

why

Module should always work (and create no resources) when var.enabled = false as long as other variables are valid
Terraform will fail when create_before_destroy is set if it needs to replace the node group but the name does not change

Assets 2

07 Sep 02:16

github-actions

0.11.0

7a1248f

v0.11.0

🚀 Enhancements

Optional Create before destroy. Add Launch Template and related features. @Nuru (#31)

what

Implement "create before destroy" for zero downtime node group updates. This is optional and off by default, because on first use it will cause any existing node groups created with this module to be destroyed and then replaced, causing the same kind of outage this feature will prevent after it is activated.

Because node groups must have unique names within a cluster, creating a new node group before destroying the old one requires node groups to have random names. This is implemented by adding a 1-word random pet name to the end of the static node group name. Turning this on (or turning it off after it has been on) will cause previously created node groups to be replaced because of the change in name.

Add features previously missing here but present in terraform-aws-eks-workers, to the extent supported by AWS, such as

Set nodes to launch with Kubernetes taints
Specify launch template (not all features supported by AWS, see "Launch template - Prohibited" in AWS documentation)
Specify AMI for nodes
Arbitrary bootstrap.sh options
Arbitrary kubelet options
"Before join" and "after join" scripting

why

Many kinds of node group changes require Terraform to replace the existing node group with a new one. The default Terraform behavior is to delete the old resource before creating the new one, since many resources (such as node group) require unique names, so you cannot create the new resource while the old one exists. However, this results in the node group being completely destroyed, and therefore offline for several minutes, which is usually an unacceptable outage. Now you can avoid this by setting create_before_destroy = true.
Useful features previously unavailable, bring closer to feature parity with terraform-aws-eks-workers.

caveats

When using create before destroy

We cannot automatically detect when the node_group will be destroyed and generate a new name for it. Instead, we have tried to cause a new name to be generated when anything changes that would cause a the node group to be destroyed. This may not be perfect. If the name changes unnecessarily, it will trigger a node group replacement, which should be tolerable. If the name fails to change when it needs to, the Terraform apply will fail with an error about the resource already existing. Please let us know what change we missed so we can update the module. Meanwhile, you can get around this by manually "tainting" the random_pet as explained below.

For a short period of time you will be running 2 node groups.

There may still be service outages related to pods and EBS volumes transferring from the old node group to the new one, though this should generally behave like the cluster rapidly scaling up and rapidly scaling back down. If you have issues with autoscaling, such as running single replicas with minAvailable: 25% (which rounds up to minAvailable: 1), preventing the pod from being drained from a node, you may have issues with node groups being replaced.
Your AWS service quotas need to be large enough to run 2 sets of node groups at the same time. If you do not have enough quota for that, launching the new node group will fail. If the new node group launch fails, you will need to manually taint the random_pet resource because while Terraform tries to replace the tainted new node group, it will try to do so with the same name (and fail) unless you also taint random_pet. Assuming you invoked the module as module "eks_node_group", you would taint random_pet with

terraform taint 'module.eks_node_group.random_pet.cbd[0]'

Using new features

Many of the new features of this module rely on new AWS features, and it is unclear to what extent they actually work.

It appears that it is still not possible to tag the Auto Scaling Group or Launch Template with extra tags for the Kubernetes Cluster Autoscaler.
It appears that it is still not possible to propagate tags to elastic GPUs or spot instance requests.
There may be other issues similarly beyond our control.
There are many new features in this module and it has not be comprehensively tested, so be cautious and test your use cases out on non-critical clusters before moving this into production.

Most of the new features require this module to create a Launch Template, and of course you can now supply your own launch template (referenced by name). There is some overlap between settings that can be made directly on an EKS managed node group and some that can be made in a launch template. This results in settings being allowed in one place and not in the other: these limitations and prohibitions are detailed in the AWS documentation. This module attempts to resolve these differences in many cases, but some limitations remain:

Support for remote access using SSH is not supported when using a launch template created by this module. Correctly configuring the launch template for remote access is tricky because it interferes with automatic configuration of access by the Kubernetes control plane. We do not need it and cannot test it at this time, so we do not support it, but if you need it, you can create your own launch template that has the desired configuration and leave the ec2_ssh_key setting null.
If you supply the Launch Template, this module requires that the Launch Template specify the AMI Image ID to use. This requirement could be relaxed in the future if we find demand for it.
In general, this module assumes you are using an Amazon Linux 2 AMI, and supports selecting the AMI by Kubernetes version or AMI release version. If you are using some other AMI that does not support Amazon's bootstrap.sh, most of the new features will not work. You will need to implement them yourself on your AMI. You can provide arbitrary (Base64 encoded) User Data to your AMI via userdata_override.
No support for spot instances specified by launch template (EKS limitation).
No support for shutdown behavior or "Stop - Hibernate" behavior in launch template (EKS limitation).
No support for IAM instance profile or Subnets via Launch Template (EKS limitation). You can still supply subnets via subnet_ids and the module will apply the via the node group configuration.

references

Many of the new features are made possible by EKS adding support for launch templates.

New AWS support for Launch Templates and custom AMIs in EKS managed node groups: blog post
AWS Docs: EKS Launch template support
Request for Node Groups to support ASG tagging: aws/containers-roadmap#608
Special note from Kubernetes Cluster Autoscaler on labeling GPU instances

Assets 2

30 Aug 14:05

github-actions

0.10.0

ac814c6

v0.10.0 Pre-release

Pre-release

Fixing issues w/ userdata @danjbh (#28)

what

After further testing, I discovered that the default userdata is being added to the end of the custom userdata we're supplying via our launch template. This causes bootstrap.sh to be called twice and creates a condition where a node group fails to provision correctly in some circumstances. And after digging further, aws_eks_node_group appears to be doing a bit of trickery w/ the launch templates under the hood, contrary to our initial expectation that the userdata we were supplying would act as an override.

We'll need to revisit this once there is more information/documentation available on the exact behavior and whether or not it's possible to completely override userdata when using aws_eks_node_group.

Anyhow, for now I propose that we just support before_cluster_joining_userdata and omit the rest of the userdata options. This will provide us with the proper tag propagation, as well as the ability to add some custom provisioning to to the node as requested by the community.

why

The latest version the module may not work at all for some folks unfortunately

UPDATE

After further research, I found the following in the introduction blog post for this feature...

Note that user data is used by EKS to insert the EKS bootstrap invocation for your managed nodes. EKS will automatically merge this in for you, unless a custom AMI is specified. In that case, you’ll need to add that in.

So hypothetically, if we supply our AMI configuration option (which presents it's own challenges), we should be able to override the userdata completely and supply our own kubelet arguments directly (e.g. taints). We can discuss this approach in another issue/PR, but I say for now we proceed with this PR and get the first couple bits of functionality working reliably. We'll regroup and proceed from there.

references

https://aws.amazon.com/blogs/containers/introducing-launch-template-and-custom-ami-support-in-amazon-eks-managed-node-groups/

Assets 2

29 Aug 03:18

github-actions

0.9.0

842e0a6

v0.9.0 Pre-release

Pre-release

Adding support for launch templates & userdata parameters @danjbh (#27)

what

Adding default launch template configuration
Adding ability to provide your own launch template by overriding the launch template id & version
Adding dynamic config options for user_data
Bumping various upstream module versions & tests
Keeping instance_types as a list but adding TF 0.13 variable validation

why

In previous version of the AWS provider (2.x), you could not define your own launch template for aws_eks_node_group. Additionally, the tags specified in the aws_eks_node_group definition were not being passed down to the EC2 instances created by the ASG, which made tasks like monitoring and cost tracking difficult.

The latest versions of the AWS provider (3.x) give us the ability to specify our own launch template directly from aws_eks_node_group, which allows us to set our own options (e.g. tag_specifications. user_data, etc.).

This also should satisfy the requests in #24

references

Assets 2

Uh oh!

Releases: cloudposse/terraform-aws-eks-node-group

v0.15.0

what

why

references

Uh oh!

v0.14.0

what

why

references

Uh oh!

v0.13.0

🚀 Enhancements

what

why

references

Uh oh!

v0.12.0: Remove autoscaler permissions from worker role

Potentially breaking changes

Terraform 0.13.3 or later required

Securing the Cluster Autoscaler

Known issues

Reminder from 0.11.0: create_before_destroy

Additional Release Notes

what

why

Uh oh!

v0.12.0-rc1

Warning

breaking changes

what

why

references

Uh oh!

v0.11.2

Warning

🐛 Bug Fixes

what

why

Uh oh!

v0.11.1

🐛 Bug Fixes

what

why

Uh oh!

v0.11.0

🚀 Enhancements

what

why

caveats

When using create before destroy

Using new features

references

Uh oh!

v0.10.0

what

why

UPDATE

references

https://aws.amazon.com/blogs/containers/introducing-launch-template-and-custom-ami-support-in-amazon-eks-managed-node-groups/

Uh oh!

v0.9.0

what

why

references

Uh oh!

Reminder from 0.11.0: `create_before_destroy`