-
-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Hi! @M4t7e π
Hope everything is going well! π
Today I encountered a potentially dangerous plan proposal when attempting to remove one of my node groups that was positioned earlier in the worker_nodepools array (not the last one).
After investigating the module, I discovered that subnet allocation IP ranges are directly tied to the nodepool array indices. When the array shrinks by removing a nodegroup from the middle, the module proposes to shift the remaining nodepool ranges and node IPs to "fill the gaps."
Problematic code reference:
https://github.com/hcloud-k8s/terraform-hcloud-kubernetes/blob/main/network.tf#L103
There may be additional references throughout the codebase that exhibit similar behavior.
This creates a critical issue, since it will try to delete healthy nodegroup subnets and recreate the nodes in the removed nodegroup's subnet.
The potential chaos this could create is significant π
Example scenario:
worker_nodepools = [
{
name = "worker-platform-egress-nbg1"
type = "cax21"
location = "nbg1"
count = 1
labels = {}
annotations = {}
taints = []
},
# {
# name = "worker-platform-egress-fsn1"
# type = "cax21"
# location = "fsn1"
# count = 1
# labels = {}
# annotations = {}
# taints = []
# },
{
name = "worker-platform-fsn1"
type = "cax21"
location = "fsn1"
count = 1
labels = {}
annotations = {}
taints = []
},
]`TF Plan:
14:44:22.028 STDOUT tofu: # module.kubernetes.hcloud_network_subnet.worker["worker-platform-egress-fsn1"] will be destroyed
14:44:22.028 STDOUT tofu: # (because key ["worker-platform-egress-fsn1"] is not in for_each map)
14:44:22.028 STDOUT tofu: - resource "hcloud_network_subnet" "worker" {
14:44:22.028 STDOUT tofu: - gateway = "10.10.0.1" -> null
14:44:22.028 STDOUT tofu: - id = "REDACTED-10.10.65.128/25" -> null
14:44:22.028 STDOUT tofu: - ip_range = "10.10.65.128/25" -> null
14:44:22.028 STDOUT tofu: - network_id = REDACTED -> null
14:44:22.028 STDOUT tofu: - network_zone = "eu-central" -> null
14:44:22.028 STDOUT tofu: - type = "cloud" -> null
14:44:22.028 STDOUT tofu: }
14:44:22.028 STDOUT tofu: # module.kubernetes.hcloud_network_subnet.worker["worker-platform-fsn1"] must be replaced
14:44:22.028 STDOUT tofu: -/+ resource "hcloud_network_subnet" "worker" {
14:44:22.028 STDOUT tofu: ~ gateway = "10.10.0.1" -> (known after apply)
14:44:22.028 STDOUT tofu: ~ id = "REDACTED-10.10.66.0/25" -> (known after apply)
14:44:22.028 STDOUT tofu: ~ ip_range = "10.10.66.0/25" -> "10.10.65.128/25" # forces replacement
14:44:22.028 STDOUT tofu: # (3 unchanged attributes hidden)
14:44:22.029 STDOUT tofu: }
14:44:22.029 STDOUT tofu: # module.kubernetes.hcloud_placement_group.worker["eb-hcloud-ops-worker-platform-egress-fsn1-pg-1"] will be destroyed
14:44:22.029 STDOUT tofu: # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-pg-1"] is not in for_each map)
14:44:22.029 STDOUT tofu: - resource "hcloud_placement_group" "worker" {
14:44:22.029 STDOUT tofu: - id = "1100212" -> null
14:44:22.029 STDOUT tofu: - labels = {
14:44:22.029 STDOUT tofu: - "cluster" = "REDACTED"
14:44:22.029 STDOUT tofu: - "nodepool" = "worker-platform-egress-fsn1"
14:44:22.029 STDOUT tofu: - "role" = "worker"
14:44:22.029 STDOUT tofu: } -> null
14:44:22.029 STDOUT tofu: - name = "REDACTED-worker-platform-egress-fsn1-pg-1" -> null
14:44:22.029 STDOUT tofu: - servers = [
14:44:22.029 STDOUT tofu: - 106345027,
14:44:22.029 STDOUT tofu: ] -> null
14:44:22.029 STDOUT tofu: - type = "spread" -> null
14:44:22.029 STDOUT tofu: }
14:44:22.029 STDOUT tofu: # module.kubernetes.hcloud_server.worker["eb-hcloud-ops-worker-platform-egress-fsn1-1"] will be destroyed
14:44:22.029 STDOUT tofu: # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-1"] is not in for_each map)
14:44:22.029 STDOUT tofu: - resource "hcloud_server" "worker" {
14:44:22.029 STDOUT tofu: - allow_deprecated_images = false -> null
14:44:22.029 STDOUT tofu: - backups = false -> null
14:44:22.029 STDOUT tofu: - datacenter = "fsn1-dc14" -> null
14:44:22.029 STDOUT tofu: - delete_protection = false -> null
14:44:22.029 STDOUT tofu: - firewall_ids = [
14:44:22.029 STDOUT tofu: - 2299076,
14:44:22.029 STDOUT tofu: ] -> null
14:44:22.029 STDOUT tofu: - id = "REDACTED" -> null
14:44:22.029 STDOUT tofu: - ignore_remote_firewall_ids = false -> null
14:44:22.029 STDOUT tofu: - image = "REDACTED" -> null
14:44:22.029 STDOUT tofu: - ipv6_network = "<nil>" -> null
14:44:22.029 STDOUT tofu: - keep_disk = false -> null
14:44:22.029 STDOUT tofu: - labels = {
14:44:22.029 STDOUT tofu: - "cluster" = "REDACTED"
14:44:22.029 STDOUT tofu: - "nodepool" = "worker-platform-egress-fsn1"
14:44:22.029 STDOUT tofu: - "role" = "worker"
14:44:22.029 STDOUT tofu: } -> null
14:44:22.029 STDOUT tofu: - location = "fsn1" -> null
14:44:22.029 STDOUT tofu: - name = "REDACTED-worker-platform-egress-fsn1-1" -> null
14:44:22.029 STDOUT tofu: - placement_group_id = 1100212 -> null
14:44:22.029 STDOUT tofu: - primary_disk_size = 80 -> null
14:44:22.029 STDOUT tofu: - rebuild_protection = false -> null
14:44:22.029 STDOUT tofu: - server_type = "cax21" -> null
14:44:22.029 STDOUT tofu: - shutdown_before_deletion = true -> null
14:44:22.029 STDOUT tofu: - ssh_keys = [
14:44:22.029 STDOUT tofu: - "100479455",
14:44:22.029 STDOUT tofu: ] -> null
14:44:22.029 STDOUT tofu: - status = "running" -> null
14:44:22.029 STDOUT tofu: - network {
14:44:22.029 STDOUT tofu: - alias_ips = [] -> null
14:44:22.029 STDOUT tofu: - ip = "10.10.65.129" -> null
14:44:22.029 STDOUT tofu: - mac_address = "86:00:00:ac:6d:bd" -> null
14:44:22.029 STDOUT tofu: - network_id = 11263756 -> null
14:44:22.029 STDOUT tofu: }
14:44:22.029 STDOUT tofu: - public_net {
14:44:22.030 STDOUT tofu: - ipv4 = 0 -> null
14:44:22.030 STDOUT tofu: - ipv4_enabled = false -> null
14:44:22.030 STDOUT tofu: - ipv6 = 0 -> null
14:44:22.030 STDOUT tofu: - ipv6_enabled = false -> null
14:44:22.030 STDOUT tofu: }
14:44:22.030 STDOUT tofu: }
14:44:22.030 STDOUT tofu: # module.kubernetes.hcloud_server.worker["eb-hcloud-ops-worker-platform-fsn1-1"] will be updated in-place
14:44:22.030 STDOUT tofu: ~ resource "hcloud_server" "worker" {
14:44:22.030 STDOUT tofu: id = "REDACTED"
14:44:22.030 STDOUT tofu: name = "REDACTED"
14:44:22.030 STDOUT tofu: # (18 unchanged attributes hidden)
14:44:22.030 STDOUT tofu: - network {
14:44:22.030 STDOUT tofu: - alias_ips = [] -> null
14:44:22.030 STDOUT tofu: - ip = "10.10.66.1" -> null
14:44:22.030 STDOUT tofu: - mac_address = "86:00:00:a9:87:80" -> null
14:44:22.030 STDOUT tofu: - network_id = 11263756 -> null
14:44:22.030 STDOUT tofu: }
14:44:22.030 STDOUT tofu: + network {
14:44:22.030 STDOUT tofu: + alias_ips = []
14:44:22.030 STDOUT tofu: + ip = "10.10.65.129"
14:44:22.030 STDOUT tofu: + mac_address = (known after apply)
14:44:22.030 STDOUT tofu: + network_id = REDACTED
14:44:22.030 STDOUT tofu: }
14:44:22.030 STDOUT tofu: # (1 unchanged block hidden)
14:44:22.030 STDOUT tofu: }
14:44:22.030 STDOUT tofu: # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-egress-fsn1-1"] will be destroyed
14:44:22.030 STDOUT tofu: # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-1"] is not in for_each map)
14:44:22.030 STDOUT tofu: - resource "talos_machine_configuration_apply" "worker" {
14:44:22.030 STDOUT tofu: - apply_mode = "auto" -> null
14:44:22.030 STDOUT tofu: - client_configuration = {
14:44:22.030 STDOUT tofu: - ca_certificate = "REDACTED" -> null
14:44:22.030 STDOUT tofu: - client_certificate = "REDACTED" -> null
14:44:22.030 STDOUT tofu: - client_key = (sensitive value) -> null
14:44:22.030 STDOUT tofu: } -> null
14:44:22.030 STDOUT tofu: - endpoint = "10.10.65.129" -> null
14:44:22.030 STDOUT tofu: - id = "machine_configuration_apply" -> null
14:44:22.030 STDOUT tofu: - machine_configuration = (sensitive value) -> null
14:44:22.030 STDOUT tofu: - machine_configuration_input = (sensitive value) -> null
14:44:22.030 STDOUT tofu: - node = "10.10.65.129" -> null
14:44:22.030 STDOUT tofu: - on_destroy = {
14:44:22.030 STDOUT tofu: - graceful = true -> null
14:44:22.030 STDOUT tofu: - reboot = false -> null
14:44:22.030 STDOUT tofu: - reset = true -> null
14:44:22.030 STDOUT tofu: } -> null
14:44:22.030 STDOUT tofu: }
14:44:22.031 STDOUT tofu: # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-egress-nbg1-1"] will be updated in-place
14:44:22.031 STDOUT tofu: ~ resource "talos_machine_configuration_apply" "worker" {
14:44:22.031 STDOUT tofu: id = "machine_configuration_apply"
14:44:22.031 STDOUT tofu: ~ machine_configuration = (sensitive value)
14:44:22.031 STDOUT tofu: ~ machine_configuration_input = (sensitive value)
14:44:22.031 STDOUT tofu: # (5 unchanged attributes hidden)
14:44:22.031 STDOUT tofu: }
14:44:22.031 STDOUT tofu: # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-fsn1-1"] will be updated in-place
14:44:22.031 STDOUT tofu: ~ resource "talos_machine_configuration_apply" "worker" {
14:44:22.031 STDOUT tofu: ~ endpoint = "10.10.66.1" -> "10.10.65.129"
14:44:22.031 STDOUT tofu: id = "machine_configuration_apply"
14:44:22.031 STDOUT tofu: ~ machine_configuration = (sensitive value)
14:44:22.031 STDOUT tofu: ~ machine_configuration_input = (sensitive value)
14:44:22.031 STDOUT tofu: ~ node = "10.10.66.1" -> "10.10.65.129"
14:44:22.031 STDOUT tofu: # (3 unchanged attributes hidden)
14:44:22.031 STDOUT tofu: }
14:44:22.031 STDOUT tofu: # module.kubernetes.terraform_data.talos_health_data will be updated in-place
14:44:22.031 STDOUT tofu: ~ resource "terraform_data" "talos_health_data" {
14:44:22.031 STDOUT tofu: id = "REDACTED"
14:44:22.031 STDOUT tofu: ~ input = {
14:44:22.031 STDOUT tofu: ~ worker_nodes = [
14:44:22.031 STDOUT tofu: - "10.10.65.129",
14:44:22.031 STDOUT tofu: "10.10.65.1",
14:44:22.031 STDOUT tofu: - "10.10.66.1",
14:44:22.031 STDOUT tofu: + "10.10.65.129",
14:44:22.031 STDOUT tofu: ]
14:44:22.031 STDOUT tofu: # (4 unchanged attributes hidden)
14:44:22.031 STDOUT tofu: }
14:44:22.031 STDOUT tofu: ~ output = {
14:44:22.031 STDOUT tofu: - control_plane_nodes = [
14:44:22.031 STDOUT tofu: - "10.10.64.11",
14:44:22.031 STDOUT tofu: - "10.10.64.21",
14:44:22.031 STDOUT tofu: - "10.10.64.1",
14:44:22.031 STDOUT tofu: ]
14:44:22.031 STDOUT tofu: - current_ip = []
14:44:22.031 STDOUT tofu: - endpoints = [
14:44:22.031 STDOUT tofu: - "10.10.64.11",
14:44:22.031 STDOUT tofu: - "10.10.64.21",
14:44:22.031 STDOUT tofu: - "10.10.64.1",
14:44:22.031 STDOUT tofu: ]
14:44:22.031 STDOUT tofu: - kube_api_url = "https://REDACTED:6443"
14:44:22.031 STDOUT tofu: - worker_nodes = [
14:44:22.031 STDOUT tofu: - "10.10.65.129",
14:44:22.031 STDOUT tofu: - "10.10.65.1",
14:44:22.031 STDOUT tofu: - "10.10.66.1",
14:44:22.031 STDOUT tofu: ]
14:44:22.031 STDOUT tofu: } -> (known after apply)
14:44:22.031 STDOUT tofu: }
14:44:22.031 STDOUT tofu: Plan: 1 to add, 4 to change, 5 to destroy.- It will remove worker-platform-egress-fsn1 nodegroup which is expected
- As the index shrinks, it wants to assign the deleted nodepool subnet range (10.10.65.128/25) to the healthy worker-platform-fsn1 nodepool (changing from 10.10.66.0/25)
- This forces subnet replacement, IP changes (10.10.66.1 β 10.10.65.129), and Talos machineconfig updates.
@M4t7e what are your thoughts on this?
Which approach do you think would work best while maintaining backward compatibility?
Regards!