Skip to content

Dangerous behavior when removing a worker nodegroup.Β #163

@mlinares1998

Description

@mlinares1998

Hi! @M4t7e πŸ‘‹
Hope everything is going well! πŸ˜„
Today I encountered a potentially dangerous plan proposal when attempting to remove one of my node groups that was positioned earlier in the worker_nodepools array (not the last one).

After investigating the module, I discovered that subnet allocation IP ranges are directly tied to the nodepool array indices. When the array shrinks by removing a nodegroup from the middle, the module proposes to shift the remaining nodepool ranges and node IPs to "fill the gaps."

Problematic code reference:
https://github.com/hcloud-k8s/terraform-hcloud-kubernetes/blob/main/network.tf#L103
There may be additional references throughout the codebase that exhibit similar behavior.

This creates a critical issue, since it will try to delete healthy nodegroup subnets and recreate the nodes in the removed nodegroup's subnet.
The potential chaos this could create is significant πŸ˜…

Example scenario:

worker_nodepools = [
    {
      name     = "worker-platform-egress-nbg1"
      type     = "cax21"
      location = "nbg1"
      count    = 1
      labels = {}
      annotations = {}
      taints      = []
    },
    # {
    #   name     = "worker-platform-egress-fsn1"
    #   type     = "cax21"
    #   location = "fsn1"
    #   count    = 1
    #   labels = {}
    #   annotations = {}
    #   taints      = []
    # },
    {
      name     = "worker-platform-fsn1"
      type     = "cax21"
      location = "fsn1"
      count    = 1
      labels = {}
      annotations = {}
      taints      = []
    },
  ]`

TF Plan:

14:44:22.028 STDOUT tofu:   # module.kubernetes.hcloud_network_subnet.worker["worker-platform-egress-fsn1"] will be destroyed
14:44:22.028 STDOUT tofu:   # (because key ["worker-platform-egress-fsn1"] is not in for_each map)
14:44:22.028 STDOUT tofu:   - resource "hcloud_network_subnet" "worker" {
14:44:22.028 STDOUT tofu:       - gateway      = "10.10.0.1" -> null
14:44:22.028 STDOUT tofu:       - id           = "REDACTED-10.10.65.128/25" -> null
14:44:22.028 STDOUT tofu:       - ip_range     = "10.10.65.128/25" -> null
14:44:22.028 STDOUT tofu:       - network_id   = REDACTED -> null
14:44:22.028 STDOUT tofu:       - network_zone = "eu-central" -> null
14:44:22.028 STDOUT tofu:       - type         = "cloud" -> null
14:44:22.028 STDOUT tofu:     }
14:44:22.028 STDOUT tofu:   # module.kubernetes.hcloud_network_subnet.worker["worker-platform-fsn1"] must be replaced
14:44:22.028 STDOUT tofu: -/+ resource "hcloud_network_subnet" "worker" {
14:44:22.028 STDOUT tofu:       ~ gateway      = "10.10.0.1" -> (known after apply)
14:44:22.028 STDOUT tofu:       ~ id           = "REDACTED-10.10.66.0/25" -> (known after apply)
14:44:22.028 STDOUT tofu:       ~ ip_range     = "10.10.66.0/25" -> "10.10.65.128/25" # forces replacement
14:44:22.028 STDOUT tofu:         # (3 unchanged attributes hidden)
14:44:22.029 STDOUT tofu:     }
14:44:22.029 STDOUT tofu:   # module.kubernetes.hcloud_placement_group.worker["eb-hcloud-ops-worker-platform-egress-fsn1-pg-1"] will be destroyed
14:44:22.029 STDOUT tofu:   # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-pg-1"] is not in for_each map)
14:44:22.029 STDOUT tofu:   - resource "hcloud_placement_group" "worker" {
14:44:22.029 STDOUT tofu:       - id      = "1100212" -> null
14:44:22.029 STDOUT tofu:       - labels  = {
14:44:22.029 STDOUT tofu:           - "cluster"  = "REDACTED"
14:44:22.029 STDOUT tofu:           - "nodepool" = "worker-platform-egress-fsn1"
14:44:22.029 STDOUT tofu:           - "role"     = "worker"
14:44:22.029 STDOUT tofu:         } -> null
14:44:22.029 STDOUT tofu:       - name    = "REDACTED-worker-platform-egress-fsn1-pg-1" -> null
14:44:22.029 STDOUT tofu:       - servers = [
14:44:22.029 STDOUT tofu:           - 106345027,
14:44:22.029 STDOUT tofu:         ] -> null
14:44:22.029 STDOUT tofu:       - type    = "spread" -> null
14:44:22.029 STDOUT tofu:     }
14:44:22.029 STDOUT tofu:   # module.kubernetes.hcloud_server.worker["eb-hcloud-ops-worker-platform-egress-fsn1-1"] will be destroyed
14:44:22.029 STDOUT tofu:   # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-1"] is not in for_each map)
14:44:22.029 STDOUT tofu:   - resource "hcloud_server" "worker" {
14:44:22.029 STDOUT tofu:       - allow_deprecated_images    = false -> null
14:44:22.029 STDOUT tofu:       - backups                    = false -> null
14:44:22.029 STDOUT tofu:       - datacenter                 = "fsn1-dc14" -> null
14:44:22.029 STDOUT tofu:       - delete_protection          = false -> null
14:44:22.029 STDOUT tofu:       - firewall_ids               = [
14:44:22.029 STDOUT tofu:           - 2299076,
14:44:22.029 STDOUT tofu:         ] -> null
14:44:22.029 STDOUT tofu:       - id                         = "REDACTED" -> null
14:44:22.029 STDOUT tofu:       - ignore_remote_firewall_ids = false -> null
14:44:22.029 STDOUT tofu:       - image                      = "REDACTED" -> null
14:44:22.029 STDOUT tofu:       - ipv6_network               = "<nil>" -> null
14:44:22.029 STDOUT tofu:       - keep_disk                  = false -> null
14:44:22.029 STDOUT tofu:       - labels                     = {
14:44:22.029 STDOUT tofu:           - "cluster"                 = "REDACTED"
14:44:22.029 STDOUT tofu:           - "nodepool"                = "worker-platform-egress-fsn1"
14:44:22.029 STDOUT tofu:           - "role"                    = "worker"
14:44:22.029 STDOUT tofu:         } -> null
14:44:22.029 STDOUT tofu:       - location                   = "fsn1" -> null
14:44:22.029 STDOUT tofu:       - name                       = "REDACTED-worker-platform-egress-fsn1-1" -> null
14:44:22.029 STDOUT tofu:       - placement_group_id         = 1100212 -> null
14:44:22.029 STDOUT tofu:       - primary_disk_size          = 80 -> null
14:44:22.029 STDOUT tofu:       - rebuild_protection         = false -> null
14:44:22.029 STDOUT tofu:       - server_type                = "cax21" -> null
14:44:22.029 STDOUT tofu:       - shutdown_before_deletion   = true -> null
14:44:22.029 STDOUT tofu:       - ssh_keys                   = [
14:44:22.029 STDOUT tofu:           - "100479455",
14:44:22.029 STDOUT tofu:         ] -> null
14:44:22.029 STDOUT tofu:       - status                     = "running" -> null
14:44:22.029 STDOUT tofu:       - network {
14:44:22.029 STDOUT tofu:           - alias_ips   = [] -> null
14:44:22.029 STDOUT tofu:           - ip          = "10.10.65.129" -> null
14:44:22.029 STDOUT tofu:           - mac_address = "86:00:00:ac:6d:bd" -> null
14:44:22.029 STDOUT tofu:           - network_id  = 11263756 -> null
14:44:22.029 STDOUT tofu:         }
14:44:22.029 STDOUT tofu:       - public_net {
14:44:22.030 STDOUT tofu:           - ipv4         = 0 -> null
14:44:22.030 STDOUT tofu:           - ipv4_enabled = false -> null
14:44:22.030 STDOUT tofu:           - ipv6         = 0 -> null
14:44:22.030 STDOUT tofu:           - ipv6_enabled = false -> null
14:44:22.030 STDOUT tofu:         }
14:44:22.030 STDOUT tofu:     }
14:44:22.030 STDOUT tofu:   # module.kubernetes.hcloud_server.worker["eb-hcloud-ops-worker-platform-fsn1-1"] will be updated in-place
14:44:22.030 STDOUT tofu:   ~ resource "hcloud_server" "worker" {
14:44:22.030 STDOUT tofu:         id                         = "REDACTED"
14:44:22.030 STDOUT tofu:         name                       = "REDACTED"
14:44:22.030 STDOUT tofu:         # (18 unchanged attributes hidden)
14:44:22.030 STDOUT tofu:       - network {
14:44:22.030 STDOUT tofu:           - alias_ips   = [] -> null
14:44:22.030 STDOUT tofu:           - ip          = "10.10.66.1" -> null
14:44:22.030 STDOUT tofu:           - mac_address = "86:00:00:a9:87:80" -> null
14:44:22.030 STDOUT tofu:           - network_id  = 11263756 -> null
14:44:22.030 STDOUT tofu:         }
14:44:22.030 STDOUT tofu:       + network {
14:44:22.030 STDOUT tofu:           + alias_ips   = []
14:44:22.030 STDOUT tofu:           + ip          = "10.10.65.129"
14:44:22.030 STDOUT tofu:           + mac_address = (known after apply)
14:44:22.030 STDOUT tofu:           + network_id  = REDACTED
14:44:22.030 STDOUT tofu:         }
14:44:22.030 STDOUT tofu:         # (1 unchanged block hidden)
14:44:22.030 STDOUT tofu:     }
14:44:22.030 STDOUT tofu:   # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-egress-fsn1-1"] will be destroyed
14:44:22.030 STDOUT tofu:   # (because key ["eb-hcloud-ops-worker-platform-egress-fsn1-1"] is not in for_each map)
14:44:22.030 STDOUT tofu:   - resource "talos_machine_configuration_apply" "worker" {
14:44:22.030 STDOUT tofu:       - apply_mode                  = "auto" -> null
14:44:22.030 STDOUT tofu:       - client_configuration        = {
14:44:22.030 STDOUT tofu:           - ca_certificate     = "REDACTED" -> null
14:44:22.030 STDOUT tofu:           - client_certificate = "REDACTED" -> null
14:44:22.030 STDOUT tofu:           - client_key         = (sensitive value) -> null
14:44:22.030 STDOUT tofu:         } -> null
14:44:22.030 STDOUT tofu:       - endpoint                    = "10.10.65.129" -> null
14:44:22.030 STDOUT tofu:       - id                          = "machine_configuration_apply" -> null
14:44:22.030 STDOUT tofu:       - machine_configuration       = (sensitive value) -> null
14:44:22.030 STDOUT tofu:       - machine_configuration_input = (sensitive value) -> null
14:44:22.030 STDOUT tofu:       - node                        = "10.10.65.129" -> null
14:44:22.030 STDOUT tofu:       - on_destroy                  = {
14:44:22.030 STDOUT tofu:           - graceful = true -> null
14:44:22.030 STDOUT tofu:           - reboot   = false -> null
14:44:22.030 STDOUT tofu:           - reset    = true -> null
14:44:22.030 STDOUT tofu:         } -> null
14:44:22.030 STDOUT tofu:     }
14:44:22.031 STDOUT tofu:   # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-egress-nbg1-1"] will be updated in-place
14:44:22.031 STDOUT tofu:   ~ resource "talos_machine_configuration_apply" "worker" {
14:44:22.031 STDOUT tofu:         id                          = "machine_configuration_apply"
14:44:22.031 STDOUT tofu:       ~ machine_configuration       = (sensitive value)
14:44:22.031 STDOUT tofu:       ~ machine_configuration_input = (sensitive value)
14:44:22.031 STDOUT tofu:         # (5 unchanged attributes hidden)
14:44:22.031 STDOUT tofu:     }
14:44:22.031 STDOUT tofu:   # module.kubernetes.talos_machine_configuration_apply.worker["eb-hcloud-ops-worker-platform-fsn1-1"] will be updated in-place
14:44:22.031 STDOUT tofu:   ~ resource "talos_machine_configuration_apply" "worker" {
14:44:22.031 STDOUT tofu:       ~ endpoint                    = "10.10.66.1" -> "10.10.65.129"
14:44:22.031 STDOUT tofu:         id                          = "machine_configuration_apply"
14:44:22.031 STDOUT tofu:       ~ machine_configuration       = (sensitive value)
14:44:22.031 STDOUT tofu:       ~ machine_configuration_input = (sensitive value)
14:44:22.031 STDOUT tofu:       ~ node                        = "10.10.66.1" -> "10.10.65.129"
14:44:22.031 STDOUT tofu:         # (3 unchanged attributes hidden)
14:44:22.031 STDOUT tofu:     }
14:44:22.031 STDOUT tofu:   # module.kubernetes.terraform_data.talos_health_data will be updated in-place
14:44:22.031 STDOUT tofu:   ~ resource "terraform_data" "talos_health_data" {
14:44:22.031 STDOUT tofu:         id     = "REDACTED"
14:44:22.031 STDOUT tofu:       ~ input  = {
14:44:22.031 STDOUT tofu:           ~ worker_nodes        = [
14:44:22.031 STDOUT tofu:               - "10.10.65.129",
14:44:22.031 STDOUT tofu:                 "10.10.65.1",
14:44:22.031 STDOUT tofu:               - "10.10.66.1",
14:44:22.031 STDOUT tofu:               + "10.10.65.129",
14:44:22.031 STDOUT tofu:             ]
14:44:22.031 STDOUT tofu:             # (4 unchanged attributes hidden)
14:44:22.031 STDOUT tofu:         }
14:44:22.031 STDOUT tofu:       ~ output = {
14:44:22.031 STDOUT tofu:           - control_plane_nodes = [
14:44:22.031 STDOUT tofu:               - "10.10.64.11",
14:44:22.031 STDOUT tofu:               - "10.10.64.21",
14:44:22.031 STDOUT tofu:               - "10.10.64.1",
14:44:22.031 STDOUT tofu:             ]
14:44:22.031 STDOUT tofu:           - current_ip          = []
14:44:22.031 STDOUT tofu:           - endpoints           = [
14:44:22.031 STDOUT tofu:               - "10.10.64.11",
14:44:22.031 STDOUT tofu:               - "10.10.64.21",
14:44:22.031 STDOUT tofu:               - "10.10.64.1",
14:44:22.031 STDOUT tofu:             ]
14:44:22.031 STDOUT tofu:           - kube_api_url        = "https://REDACTED:6443"
14:44:22.031 STDOUT tofu:           - worker_nodes        = [
14:44:22.031 STDOUT tofu:               - "10.10.65.129",
14:44:22.031 STDOUT tofu:               - "10.10.65.1",
14:44:22.031 STDOUT tofu:               - "10.10.66.1",
14:44:22.031 STDOUT tofu:             ]
14:44:22.031 STDOUT tofu:         } -> (known after apply)
14:44:22.031 STDOUT tofu:     }
14:44:22.031 STDOUT tofu: Plan: 1 to add, 4 to change, 5 to destroy.
  • It will remove worker-platform-egress-fsn1 nodegroup which is expected
  • As the index shrinks, it wants to assign the deleted nodepool subnet range (10.10.65.128/25) to the healthy worker-platform-fsn1 nodepool (changing from 10.10.66.0/25)
  • This forces subnet replacement, IP changes (10.10.66.1 β†’ 10.10.65.129), and Talos machineconfig updates.

@M4t7e what are your thoughts on this?
Which approach do you think would work best while maintaining backward compatibility?

Regards!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions