Dangerous behavior when removing a  worker nodegroup.

Hi! @M4t7e 👋
Hope everything is going well! 😄
Today I encountered a potentially dangerous plan proposal when attempting to remove one of my node groups that was positioned earlier in the worker_nodepools array (not the last one).

After investigating the module, I discovered that subnet allocation IP ranges are directly tied to the nodepool array indices. When the array shrinks by removing a nodegroup from the middle, the module proposes to shift the remaining nodepool ranges and node IPs to "fill the gaps."

**Problematic code reference:**
https://github.com/hcloud-k8s/terraform-hcloud-kubernetes/blob/main/network.tf#L103
There may be additional references throughout the codebase that exhibit similar behavior.

This creates a critical issue, since it will try to delete healthy nodegroup subnets and recreate the nodes in the removed nodegroup's subnet.
The potential chaos this could create is significant 😅

**Example scenario:**
```HCL
worker_nodepools = [
    {
      name     = "worker-platform-egress-nbg1"
      type     = "cax21"
      location = "nbg1"
      count    = 1
      labels = {}
      annotations = {}
      taints      = []
    },
    # {
    #   name     = "worker-platform-egress-fsn1"
    #   type     = "cax21"
    #   location = "fsn1"
    #   count    = 1
    #   labels = {}
    #   annotations = {}
    #   taints      = []
    # },
    {
      name     = "worker-platform-fsn1"
      type     = "cax21"
      location = "fsn1"
      count    = 1
      labels = {}
      annotations = {}
      taints      = []
    },
  ]`
```
**TF Plan:**
```bash
14:44:22.028 STDOUT tofu:   # module.kubernetes.hcloud_network_subnet.worker["worker-platform-egress-fsn1"] will be destroyed
14:44:22.028 STDOUT tofu:   # (because key ["worker-platform-egress-fsn1"] is not in for_each map)
14:44:22.028 STDOUT tofu:   - resource "hcloud_network_subnet" "worker" {
14:44:22.028 STDOUT tofu:       - gateway      = "10.10.0.1" -> null
14:44:22.028 STDOUT tofu:       - id           = "REDACTED-10.10.65.128/25" -> null
14:44:22.028 STDOUT tofu:       - ip_range     = "10.10.65.128/25" -> null
14:44:22.028 STDOUT tofu:       - network_id   = REDACTED -> null
14:44:22.028 STDOUT tofu:       - network_zone = "eu-central" -> null
14:44:22.028 STDOUT tofu:       - type         = "cloud" -> null
14:44:22.028 STDOUT tofu:     }
14:44:22.028 STDOUT tofu:   # module.kubernetes.hcloud_network_subnet.worker["worker-platform-fsn1"] must be replaced
14:44:22.028 STDOUT tofu: -/+ resource "hcloud_network_subnet" "worker" {
14:44:22.028 STDOUT tofu:       ~ gateway      = "10.10.0.1" -> (known after apply)
14:44:22.028 STDOUT tofu:       ~ id           = "REDACTED-10.10.66.0/25" -> (known after apply)
14:44:22.028 STDOUT tofu:       ~ ip_range     = "10.10.66.0/25" -> "10.10.65.128/25" # forces replacement
14:44:22.028 STDOUT tofu:         # (3 unchanged attributes hidden)
14:44:22.029 STDOUT tofu:     }
14:44:22.029 STDOUT tofu:   # module.kubernetes.hcloud_placement_group.worker["eb-hcloud-ops-worker-platform-egress-fsn1-pg-1"] will be destroyed
14:44:22.029 STDOUT tofu:   # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-pg-1"] is not in for_each map)
14:44:22.029 STDOUT tofu:   - resource "hcloud_placement_group" "worker" {
14:44:22.029 STDOUT tofu:       - id      = "1100212" -> null
14:44:22.029 STDOUT tofu:       - labels  = {
14:44:22.029 STDOUT tofu:           - "cluster"  = "REDACTED"
14:44:22.029 STDOUT tofu:           - "nodepool" = "worker-platform-egress-fsn1"
14:44:22.029 STDOUT tofu:           - "role"     = "worker"
14:44:22.029 STDOUT tofu:         } -> null
14:44:22.029 STDOUT tofu:       - name    = "REDACTED-worker-platform-egress-fsn1-pg-1" -> null
14:44:22.029 STDOUT tofu:       - servers = [
14:44:22.029 STDOUT tofu:           - 106345027,
14:44:22.029 STDOUT tofu:         ] -> null
14:44:22.029 STDOUT tofu:       - type    = "spread" -> null
14:44:22.029 STDOUT tofu:     }
14:44:22.029 STDOUT tofu:   # module.kubernetes.hcloud_server.worker["eb-hcloud-ops-worker-platform-egress-fsn1-1"] will be destroyed
14:44:22.029 STDOUT tofu:   # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-1"] is not in for_each map)
14:44:22.029 STDOUT tofu:   - resource "hcloud_server" "worker" {
14:44:22.029 STDOUT tofu:       - allow_deprecated_images    = false -> null
14:44:22.029 STDOUT tofu:       - backups                    = false -> null
14:44:22.029 STDOUT tofu:       - datacenter                 = "fsn1-dc14" -> null
14:44:22.029 STDOUT tofu:       - delete_protection          = false -> null
14:44:22.029 STDOUT tofu:       - firewall_ids               = [
14:44:22.029 STDOUT tofu:           - 2299076,
14:44:22.029 STDOUT tofu:         ] -> null
14:44:22.029 STDOUT tofu:       - id                         = "REDACTED" -> null
14:44:22.029 STDOUT tofu:       - ignore_remote_firewall_ids = false -> null
14:44:22.029 STDOUT tofu:       - image                      = "REDACTED" -> null
14:44:22.029 STDOUT tofu:       - ipv6_network               = "<nil>" -> null
14:44:22.029 STDOUT tofu:       - keep_disk                  = false -> null
14:44:22.029 STDOUT tofu:       - labels                     = {
14:44:22.029 STDOUT tofu:           - "cluster"                 = "REDACTED"
14:44:22.029 STDOUT tofu:           - "nodepool"                = "worker-platform-egress-fsn1"
14:44:22.029 STDOUT tofu:           - "role"                    = "worker"
14:44:22.029 STDOUT tofu:         } -> null
14:44:22.029 STDOUT tofu:       - location                   = "fsn1" -> null
14:44:22.029 STDOUT tofu:       - name                       = "REDACTED-worker-platform-egress-fsn1-1" -> null
14:44:22.029 STDOUT tofu:       - placement_group_id         = 1100212 -> null
14:44:22.029 STDOUT tofu:       - primary_disk_size          = 80 -> null
14:44:22.029 STDOUT tofu:       - rebuild_protection         = false -> null
14:44:22.029 STDOUT tofu:       - server_type                = "cax21" -> null
14:44:22.029 STDOUT tofu:       - shutdown_before_deletion   = true -> null
14:44:22.029 STDOUT tofu:       - ssh_keys                   = [
14:44:22.029 STDOUT tofu:           - "100479455",
14:44:22.029 STDOUT tofu:         ] -> null
14:44:22.029 STDOUT tofu:       - status                     = "running" -> null
14:44:22.029 STDOUT tofu:       - network {
14:44:22.029 STDOUT tofu:           - alias_ips   = [] -> null
14:44:22.029 STDOUT tofu:           - ip          = "10.10.65.129" -> null
14:44:22.029 STDOUT tofu:           - mac_address = "86:00:00:ac:6d:bd" -> null
14:44:22.029 STDOUT tofu:           - network_id  = 11263756 -> null
14:44:22.029 STDOUT tofu:         }
14:44:22.029 STDOUT tofu:       - public_net {
14:44:22.030 STDOUT tofu:           - ipv4         = 0 -> null
14:44:22.030 STDOUT tofu:           - ipv4_enabled = false -> null
14:44:22.030 STDOUT tofu:           - ipv6         = 0 -> null
14:44:22.030 STDOUT tofu:           - ipv6_enabled = false -> null
14:44:22.030 STDOUT tofu:         }
14:44:22.030 STDOUT tofu:     }
14:44:22.030 STDOUT tofu:   # module.kubernetes.hcloud_server.worker["eb-hcloud-ops-worker-platform-fsn1-1"] will be updated in-place
14:44:22.030 STDOUT tofu:   ~ resource "hcloud_server" "worker" {
14:44:22.030 STDOUT tofu:         id                         = "REDACTED"
14:44:22.030 STDOUT tofu:         name                       = "REDACTED"
14:44:22.030 STDOUT tofu:         # (18 unchanged attributes hidden)
14:44:22.030 STDOUT tofu:       - network {
14:44:22.030 STDOUT tofu:           - alias_ips   = [] -> null
14:44:22.030 STDOUT tofu:           - ip          = "10.10.66.1" -> null
14:44:22.030 STDOUT tofu:           - mac_address = "86:00:00:a9:87:80" -> null
14:44:22.030 STDOUT tofu:           - network_id  = 11263756 -> null
14:44:22.030 STDOUT tofu:         }
14:44:22.030 STDOUT tofu:       + network {
14:44:22.030 STDOUT tofu:           + alias_ips   = []
14:44:22.030 STDOUT tofu:           + ip          = "10.10.65.129"
14:44:22.030 STDOUT tofu:           + mac_address = (known after apply)
14:44:22.030 STDOUT tofu:           + network_id  = REDACTED
14:44:22.030 STDOUT tofu:         }
14:44:22.030 STDOUT tofu:         # (1 unchanged block hidden)
14:44:22.030 STDOUT tofu:     }
14:44:22.030 STDOUT tofu:   # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-egress-fsn1-1"] will be destroyed
14:44:22.030 STDOUT tofu:   # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-1"] is not in for_each map)
14:44:22.030 STDOUT tofu:   - resource "talos_machine_configuration_apply" "worker" {
14:44:22.030 STDOUT tofu:       - apply_mode                  = "auto" -> null
14:44:22.030 STDOUT tofu:       - client_configuration        = {
14:44:22.030 STDOUT tofu:           - ca_certificate     = "REDACTED" -> null
14:44:22.030 STDOUT tofu:           - client_certificate = "REDACTED" -> null
14:44:22.030 STDOUT tofu:           - client_key         = (sensitive value) -> null
14:44:22.030 STDOUT tofu:         } -> null
14:44:22.030 STDOUT tofu:       - endpoint                    = "10.10.65.129" -> null
14:44:22.030 STDOUT tofu:       - id                          = "machine_configuration_apply" -> null
14:44:22.030 STDOUT tofu:       - machine_configuration       = (sensitive value) -> null
14:44:22.030 STDOUT tofu:       - machine_configuration_input = (sensitive value) -> null
14:44:22.030 STDOUT tofu:       - node                        = "10.10.65.129" -> null
14:44:22.030 STDOUT tofu:       - on_destroy                  = {
14:44:22.030 STDOUT tofu:           - graceful = true -> null
14:44:22.030 STDOUT tofu:           - reboot   = false -> null
14:44:22.030 STDOUT tofu:           - reset    = true -> null
14:44:22.030 STDOUT tofu:         } -> null
14:44:22.030 STDOUT tofu:     }
14:44:22.031 STDOUT tofu:   # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-egress-nbg1-1"] will be updated in-place
14:44:22.031 STDOUT tofu:   ~ resource "talos_machine_configuration_apply" "worker" {
14:44:22.031 STDOUT tofu:         id                          = "machine_configuration_apply"
14:44:22.031 STDOUT tofu:       ~ machine_configuration       = (sensitive value)
14:44:22.031 STDOUT tofu:       ~ machine_configuration_input = (sensitive value)
14:44:22.031 STDOUT tofu:         # (5 unchanged attributes hidden)
14:44:22.031 STDOUT tofu:     }
14:44:22.031 STDOUT tofu:   # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-fsn1-1"] will be updated in-place
14:44:22.031 STDOUT tofu:   ~ resource "talos_machine_configuration_apply" "worker" {
14:44:22.031 STDOUT tofu:       ~ endpoint                    = "10.10.66.1" -> "10.10.65.129"
14:44:22.031 STDOUT tofu:         id                          = "machine_configuration_apply"
14:44:22.031 STDOUT tofu:       ~ machine_configuration       = (sensitive value)
14:44:22.031 STDOUT tofu:       ~ machine_configuration_input = (sensitive value)
14:44:22.031 STDOUT tofu:       ~ node                        = "10.10.66.1" -> "10.10.65.129"
14:44:22.031 STDOUT tofu:         # (3 unchanged attributes hidden)
14:44:22.031 STDOUT tofu:     }
14:44:22.031 STDOUT tofu:   # module.kubernetes.terraform_data.talos_health_data will be updated in-place
14:44:22.031 STDOUT tofu:   ~ resource "terraform_data" "talos_health_data" {
14:44:22.031 STDOUT tofu:         id     = "REDACTED"
14:44:22.031 STDOUT tofu:       ~ input  = {
14:44:22.031 STDOUT tofu:           ~ worker_nodes        = [
14:44:22.031 STDOUT tofu:               - "10.10.65.129",
14:44:22.031 STDOUT tofu:                 "10.10.65.1",
14:44:22.031 STDOUT tofu:               - "10.10.66.1",
14:44:22.031 STDOUT tofu:               + "10.10.65.129",
14:44:22.031 STDOUT tofu:             ]
14:44:22.031 STDOUT tofu:             # (4 unchanged attributes hidden)
14:44:22.031 STDOUT tofu:         }
14:44:22.031 STDOUT tofu:       ~ output = {
14:44:22.031 STDOUT tofu:           - control_plane_nodes = [
14:44:22.031 STDOUT tofu:               - "10.10.64.11",
14:44:22.031 STDOUT tofu:               - "10.10.64.21",
14:44:22.031 STDOUT tofu:               - "10.10.64.1",
14:44:22.031 STDOUT tofu:             ]
14:44:22.031 STDOUT tofu:           - current_ip          = []
14:44:22.031 STDOUT tofu:           - endpoints           = [
14:44:22.031 STDOUT tofu:               - "10.10.64.11",
14:44:22.031 STDOUT tofu:               - "10.10.64.21",
14:44:22.031 STDOUT tofu:               - "10.10.64.1",
14:44:22.031 STDOUT tofu:             ]
14:44:22.031 STDOUT tofu:           - kube_api_url        = "https://REDACTED:6443"
14:44:22.031 STDOUT tofu:           - worker_nodes        = [
14:44:22.031 STDOUT tofu:               - "10.10.65.129",
14:44:22.031 STDOUT tofu:               - "10.10.65.1",
14:44:22.031 STDOUT tofu:               - "10.10.66.1",
14:44:22.031 STDOUT tofu:             ]
14:44:22.031 STDOUT tofu:         } -> (known after apply)
14:44:22.031 STDOUT tofu:     }
14:44:22.031 STDOUT tofu: Plan: 1 to add, 4 to change, 5 to destroy.
```
- It will remove worker-platform-egress-fsn1 nodegroup which is expected
- As the index shrinks, it wants to assign the deleted nodepool subnet range (10.10.65.128/25) to the healthy worker-platform-fsn1 nodepool (changing from 10.10.66.0/25)
- This forces subnet replacement, IP changes (10.10.66.1 → 10.10.65.129), and Talos machineconfig updates.

@M4t7e what are your thoughts on this? 
Which approach do you think would work best while maintaining backward compatibility?

Regards!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dangerous behavior when removing a worker nodegroup. #163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Dangerous behavior when removing a worker nodegroup. #163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions