Network errors after removing nodepool #1836

MarvinJWendt · 2025-07-04T08:25:01Z

MarvinJWendt
Jul 4, 2025

Description

Hi all, I've been trying out to run a HA kubernetes cluster from scratch for a short while.

First of, holy **** this project is amazing, but I have one problem that I am unsure why it happens:

When I remove a node pool, I get network issues on all services, not just on the ones at the removed nodepool.

What I did:

Create test nodepool with 2 nodes
Redeploy my svc1 service
- The two replicas are now at test-1 and test-2
Drain test nodepool
Wait for svc1 to be on main node pool again (and ready)
Until now, there wasn't a single second downtime for svc1
Removed test nodepool from kube.tf
Now timeouts when reaching my service svc1 happen often (but not always)
svc2, which is scheduled on the main nodepool, that I didn't touch, also had the same downtime.
Over time the timeouts decreased and everything slowly went back to stable

Last events of the cluster:

39m         Normal    Pulled                         pod/httb-667499dd87-nw5df    Container image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4" already present on machine
39m         Normal    Created                        pod/httb-667499dd87-nw5df    Created container: httb
39m         Normal    Scheduled                      pod/httb-667499dd87-kgcc4    Successfully assigned default/httb-667499dd87-kgcc4 to mjw-agent-small-2-oph
39m         Normal    SuccessfulDelete               replicaset/httb-6b69bf5756   Deleted pod: httb-6b69bf5756-lk9w9
39m         Normal    SuccessfulCreate               replicaset/httb-667499dd87   Created pod: httb-667499dd87-kgcc4
39m         Normal    Created                        pod/httb-667499dd87-kgcc4    Created container: httb
39m         Normal    Pulled                         pod/httb-667499dd87-kgcc4    Container image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4" already present on machine
39m         Normal    Started                        pod/httb-667499dd87-kgcc4    Started container httb
39m         Normal    SuccessfulDelete               replicaset/httb-6b69bf5756   Deleted pod: httb-6b69bf5756-wcm2f
38m         Normal    NodeNotReady                   node/mjw-agent-test-2-tqm    Node mjw-agent-test-2-tqm status is now: NodeNotReady
38m         Warning   NodeNotReady                   pod/httb-6b69bf5756-lk9w9    Node is not ready
38m         Normal    DeletingNode                   node/mjw-agent-test-2-tqm    Deleting node mjw-agent-test-2-tqm because it does not exist in the cloud provider
38m         Normal    RemovingNode                   node/mjw-agent-test-2-tqm    Node mjw-agent-test-2-tqm event: Removing Node mjw-agent-test-2-tqm from Controller
38m         Warning   NodeNotReady                   pod/httb-6b69bf5756-wcm2f    Node is not ready
38m         Normal    NodeNotReady                   node/mjw-agent-test-1-foi    Node mjw-agent-test-1-foi status is now: NodeNotReady
38m         Warning   FailedToUpdateEndpointSlices   service/httb                 Error updating Endpoint Slices for Service default/httb: skipping Pod httb-6b69bf5756-lk9w9 for Service default/httb: Node mjw-agent-test-2-tqm Not Found
38m         Normal    DeletingNode                   node/mjw-agent-test-1-foi    Deleting node mjw-agent-test-1-foi because it does not exist in the cloud provider
38m         Normal    RemovingNode                   node/mjw-agent-test-1-foi    Node mjw-agent-test-1-foi event: Removing Node mjw-agent-test-1-foi from Controller
38m         Warning   FailedToUpdateEndpointSlices   service/httb                 Error updating Endpoint Slices for Service default/httb: [skipping Pod httb-6b69bf5756-wcm2f for Service default/httb: Node mjw-agent-test-1-foi Not Found, skipping Pod httb-6b69bf5756-lk9w9 for Service default/httb: Node mjw-agent-test-2-tqm Not Found]
38m         Warning   FailedToUpdateEndpointSlices   service/httb                 Error updating Endpoint Slices for Service default/httb: [skipping Pod httb-6b69bf5756-lk9w9 for Service default/httb: Node mjw-agent-test-2-tqm Not Found, skipping Pod httb-6b69bf5756-wcm2f for Service default/httb: Node mjw-agent-test-1-foi Not Found]
37m         Warning   FailedToUpdateEndpointSlices   service/httb                 Error updating Endpoint Slices for Service default/httb: skipping Pod httb-6b69bf5756-wcm2f for Service default/httb: Node mjw-agent-test-1-foi Not Found
27m         Normal    RegisteredNode                 node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx event: Registered Node mjw-agent-test-1-qxx in Controller
27m         Normal    NodeHasSufficientMemory        node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx status is now: NodeHasSufficientMemory
27m         Warning   InvalidDiskCapacity            node/mjw-agent-test-1-qxx    invalid capacity 0 on image filesystem
27m         Normal    NodeHasNoDiskPressure          node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx status is now: NodeHasNoDiskPressure
27m         Normal    Starting                       node/mjw-agent-test-1-qxx    Starting kubelet.
27m         Normal    NodeHasSufficientPID           node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx status is now: NodeHasSufficientPID
27m         Normal    NodeReady                      node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx status is now: NodeReady
27m         Normal    Starting                       node/mjw-agent-test-1-qxx    
27m         Warning   UnknownProviderIDPrefix        node/mjw-agent-test-1-qxx    Node could not be added to Load Balancer for service traefik because the provider ID does not match any known format
27m         Normal    Synced                         node/mjw-agent-test-1-qxx    Node synced successfully
27m         Normal    NodeAllocatableEnforced        node/mjw-agent-test-1-qxx    Updated Node Allocatable limit across pods
27m         Warning   InvalidDiskCapacity            node/mjw-agent-test-2-fpy    invalid capacity 0 on image filesystem
27m         Normal    NodeHasNoDiskPressure          node/mjw-agent-test-2-fpy    Node mjw-agent-test-2-fpy status is now: NodeHasNoDiskPressure
27m         Normal    NodeAllocatableEnforced        node/mjw-agent-test-2-fpy    Updated Node Allocatable limit across pods
27m         Normal    NodeHasSufficientPID           node/mjw-agent-test-2-fpy    Node mjw-agent-test-2-fpy status is now: NodeHasSufficientPID
27m         Normal    Starting                       node/mjw-agent-test-2-fpy    Starting kubelet.
27m         Normal    NodeHasSufficientMemory        node/mjw-agent-test-2-fpy    Node mjw-agent-test-2-fpy status is now: NodeHasSufficientMemory
27m         Normal    Starting                       node/mjw-agent-test-2-fpy    
27m         Normal    NodeReady                      node/mjw-agent-test-2-fpy    Node mjw-agent-test-2-fpy status is now: NodeReady
27m         Warning   UnknownProviderIDPrefix        node/mjw-agent-test-2-fpy    Node could not be added to Load Balancer for service traefik because the provider ID does not match any known format
27m         Normal    Synced                         node/mjw-agent-test-2-fpy    Node synced successfully
27m         Normal    RegisteredNode                 node/mjw-agent-test-2-fpy    Node mjw-agent-test-2-fpy event: Registered Node mjw-agent-test-2-fpy in Controller
23m         Normal    NodeNotSchedulable             node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx status is now: NodeNotSchedulable
27m         Warning   ClusterCIDRMisconfigured       node/mjw-agent-test-2-fpy    route CIDR 10.42.10.0/24 is not contained within cluster CIDR 10.244.0.0/16
27m         Warning   ClusterCIDRMisconfigured       node/mjw-agent-test-1-qxx    route CIDR 10.42.9.0/24 is not contained within cluster CIDR 10.244.0.0/16
27m         Normal    NodeSchedulable                node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx status is now: NodeSchedulable
26m         Normal    SuccessfulCreate               replicaset/httb-57c5f79b8d   Created pod: httb-57c5f79b8d-dxffh
26m         Normal    ScalingReplicaSet              deployment/httb              Scaled up replica set httb-57c5f79b8d to 1
26m         Normal    Scheduled                      pod/httb-57c5f79b8d-dxffh    Successfully assigned default/httb-57c5f79b8d-dxffh to mjw-agent-test-2-fpy
26m         Normal    Pulling                        pod/httb-57c5f79b8d-dxffh    Pulling image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4"
26m         Normal    Started                        pod/httb-57c5f79b8d-dxffh    Started container httb
26m         Normal    Created                        pod/httb-57c5f79b8d-dxffh    Created container: httb
26m         Normal    Pulled                         pod/httb-57c5f79b8d-dxffh    Successfully pulled image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4" in 2.077s (2.077s including waiting). Image size: 15966735 bytes.
26m         Normal    ScalingReplicaSet              deployment/httb              Scaled up replica set httb-57c5f79b8d to 2 from 1
26m         Normal    ScalingReplicaSet              deployment/httb              Scaled down replica set httb-667499dd87 to 1 from 2
26m         Normal    Killing                        pod/httb-667499dd87-nw5df    Stopping container httb
26m         Normal    SuccessfulDelete               replicaset/httb-667499dd87   Deleted pod: httb-667499dd87-nw5df
26m         Normal    SuccessfulCreate               replicaset/httb-57c5f79b8d   Created pod: httb-57c5f79b8d-vzrg2
26m         Normal    Scheduled                      pod/httb-57c5f79b8d-vzrg2    Successfully assigned default/httb-57c5f79b8d-vzrg2 to mjw-agent-test-1-qxx
26m         Normal    Pulling                        pod/httb-57c5f79b8d-vzrg2    Pulling image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4"
26m         Normal    Created                        pod/httb-57c5f79b8d-vzrg2    Created container: httb
26m         Normal    Started                        pod/httb-57c5f79b8d-vzrg2    Started container httb
26m         Normal    Pulled                         pod/httb-57c5f79b8d-vzrg2    Successfully pulled image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4" in 2.392s (2.392s including waiting). Image size: 15966735 bytes.
26m         Normal    SuccessfulDelete               replicaset/httb-667499dd87   Deleted pod: httb-667499dd87-kgcc4
26m         Normal    Killing                        pod/httb-667499dd87-kgcc4    Stopping container httb
23m         Normal    Scheduled                      pod/httb-57c5f79b8d-skvmt    Successfully assigned default/httb-57c5f79b8d-skvmt to mjw-agent-small-1-ytg
23m         Normal    Killing                        pod/httb-57c5f79b8d-vzrg2    Stopping container httb
23m         Normal    SuccessfulCreate               replicaset/httb-57c5f79b8d   Created pod: httb-57c5f79b8d-skvmt
23m         Normal    Pulled                         pod/httb-57c5f79b8d-skvmt    Container image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4" already present on machine
23m         Normal    Created                        pod/httb-57c5f79b8d-skvmt    Created container: httb
23m         Normal    Started                        pod/httb-57c5f79b8d-skvmt    Started container httb
23m         Normal    Pulled                         pod/httb-57c5f79b8d-9vxrp    Container image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4" already present on machine
23m         Normal    Scheduled                      pod/httb-57c5f79b8d-9vxrp    Successfully assigned default/httb-57c5f79b8d-9vxrp to mjw-agent-small-2-oph
23m         Normal    Created                        pod/httb-57c5f79b8d-9vxrp    Created container: httb
23m         Normal    Started                        pod/httb-57c5f79b8d-9vxrp    Started container httb
23m         Normal    SuccessfulCreate               replicaset/httb-57c5f79b8d   Created pod: httb-57c5f79b8d-9vxrp
23m         Normal    Killing                        pod/httb-57c5f79b8d-dxffh    Stopping container httb
22m         Normal    NodeNotSchedulable             node/mjw-agent-test-2-fpy    Node mjw-agent-test-2-fpy status is now: NodeNotSchedulable
22m         Normal    Scheduled                      pod/httb-667499dd87-srjl7    Successfully assigned default/httb-667499dd87-srjl7 to mjw-agent-small-3-xsp
22m         Normal    ScalingReplicaSet              deployment/httb              Scaled up replica set httb-667499dd87 to 1 from 0
22m         Normal    SuccessfulCreate               replicaset/httb-667499dd87   Created pod: httb-667499dd87-srjl7
22m         Normal    Created                        pod/httb-667499dd87-srjl7    Created container: httb
22m         Normal    Pulled                         pod/httb-667499dd87-srjl7    Container image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4" already present on machine
22m         Normal    Started                        pod/httb-667499dd87-srjl7    Started container httb
22m         Normal    Created                        pod/httb-667499dd87-hjrtt    Created container: httb
22m         Normal    Pulled                         pod/httb-667499dd87-hjrtt    Container image "ghcr.io/marvinjwendt/httb@sha256:622ab1c3140659d26fcc8e7a5cec3a0d769a8f4fd07edddc535eb8482119b3b4" already present on machine
22m         Normal    SuccessfulCreate               replicaset/httb-667499dd87   Created pod: httb-667499dd87-hjrtt
22m         Normal    Scheduled                      pod/httb-667499dd87-hjrtt    Successfully assigned default/httb-667499dd87-hjrtt to mjw-agent-small-1-ytg
22m         Normal    SuccessfulDelete               replicaset/httb-57c5f79b8d   Deleted pod: httb-57c5f79b8d-9vxrp
22m         Normal    Killing                        pod/httb-57c5f79b8d-9vxrp    Stopping container httb
22m         Normal    Started                        pod/httb-667499dd87-hjrtt    Started container httb
21m         Normal    SuccessfulDelete               replicaset/httb-57c5f79b8d   Deleted pod: httb-57c5f79b8d-skvmt
21m         Normal    Killing                        pod/httb-57c5f79b8d-skvmt    Stopping container httb
21m         Normal    NodeNotReady                   node/mjw-agent-test-2-fpy    Node mjw-agent-test-2-fpy status is now: NodeNotReady
21m         Normal    NodeNotReady                   node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx status is now: NodeNotReady
21m         Normal    DeletingNode                   node/mjw-agent-test-1-qxx    Deleting node mjw-agent-test-1-qxx because it does not exist in the cloud provider
21m         Normal    DeletingNode                   node/mjw-agent-test-2-fpy    Deleting node mjw-agent-test-2-fpy because it does not exist in the cloud provider
21m         Normal    RemovingNode                   node/mjw-agent-test-2-fpy    Node mjw-agent-test-2-fpy event: Removing Node mjw-agent-test-2-fpy from Controller
21m         Normal    RemovingNode                   node/mjw-agent-test-1-qxx    Node mjw-agent-test-1-qxx event: Removing Node mjw-agent-test-1-qxx from Controller

Kube.tf file

locals {
  hcloud_token = "<secret>"
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
  source       = "kube-hetzner/kube-hetzner/hcloud"
  ssh_public_key  = file("~/.ssh/id_ed25519.pub")
  ssh_private_key = file("~/.ssh/id_ed25519")
  network_region = "eu-central"
  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cx22",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cx22",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cx22",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-small",
      server_type = "cx22",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      nodes = {
        "1" : {
          location = "nbg1"
        },
        "2" : {
          location = "fsn1"
        },
        "3" : {
          location = "hel1"
        }
      }
    },
  ]
  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"
  base_domain = "<secret>"
  autoscaler_nodepools = [
    {
      name        = "autoscaled-small"
      server_type = "cx22"
      location    = "fsn1"
      min_nodes   = 0
      max_nodes   = 1
      labels = {
        "node.kubernetes.io/role" : "peak-workloads"
      }
      taints = [
        {
          key    = "node.kubernetes.io/role"
          value  = "peak-workloads"
          effect = "NoExecute"
        }
      ]
    }
  ]

  autoscaler_taints = [
    "node.kubernetes.io/role=specific-workloads:NoExecute",
  ]
  enable_delete_protection = {
    floating_ip   = true
    load_balancer = true
    volume        = true
  }
  etcd_s3_backup = {
    etcd-s3-endpoint   = "<secret>"
    etcd-s3-access-key = "<secret>"
    etcd-s3-secret-key = "<secret>"
    etcd-s3-bucket     = "<secret>"
    etcd-s3-region     = "empty"
  }
  enable_csi_driver_smb = true
  enable_longhorn = true
  longhorn_replica_count = 2
  hetzner_ccm_use_helm = true
  traefik_additional_options = ["--log.level=INFO", "--accesslog=true"]
  traefik_redirect_to_https  = false
  traefik_additional_trusted_ips = [
    "173.245.48.0/20",
    "103.21.244.0/22",
    "103.22.200.0/22",
    "103.31.4.0/22",
    "141.101.64.0/18",
    "108.162.192.0/18",
    "190.93.240.0/20",
    "188.114.96.0/20",
    "197.234.240.0/22",
    "198.41.128.0/17",
    "162.158.0.0/15",
    "104.16.0.0/13",
    "104.24.0.0/14",
    "172.64.0.0/13",
    "131.0.72.0/22",
    "2400:cb00::/32",
    "2606:4700::/32",
    "2803:f800::/32",
    "2405:b500::/32",
    "2405:8100::/32",
    "2a06:98c0::/29",
    "2c0f:f248::/32"
  ]
  system_upgrade_use_drain = true
  cluster_name = "<secret>"
  dns_servers = [
    "1.1.1.1",
    "8.8.8.8",
    "2606:4700:4700::1111",
  ]
  create_kubeconfig = false
}

provider "hcloud" {
  token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

Screenshots

No response

Platform

Linux

Answered by mysticaltech

Jul 27, 2025

Hi @MarvinJWendt,

First off, thank you for such a detailed and well-documented issue report! And I'm glad you're enjoying the project 🙂

You actually did everything perfectly from a Kubernetes perspective - draining the nodes, waiting for pods to reschedule, verifying zero downtime. Your procedure was textbook correct! The issue you encountered is due to a specific limitation in how this Terraform module manages nodepools.

What happened:
The module allocates subnets and IPs in a specific order, and removing a nodepool from the middle of the list (rather than from the end) disrupts this allocation scheme. This causes the networking issues you observed, even affecting pods on other nodes.

Th…

View full answer

mysticaltech · 2025-07-27T23:42:53Z

mysticaltech
Jul 27, 2025
Maintainer

Hi @MarvinJWendt,

First off, thank you for such a detailed and well-documented issue report! And I'm glad you're enjoying the project 🙂

You actually did everything perfectly from a Kubernetes perspective - draining the nodes, waiting for pods to reschedule, verifying zero downtime. Your procedure was textbook correct! The issue you encountered is due to a specific limitation in how this Terraform module manages nodepools.

What happened:
The module allocates subnets and IPs in a specific order, and removing a nodepool from the middle of the list (rather than from the end) disrupts this allocation scheme. This causes the networking issues you observed, even affecting pods on other nodes.

The solution is simpler than it might seem:
Instead of removing the test nodepool from your configuration, just set its count to 0:

{
  name        = "test",
  server_type = "cx22",
  location    = "fsn1",
  labels      = [],
  taints      = [],
  count       = 0  # ← Just change this
}

This way, the nodepool stays in the configuration (preserving the subnet allocation order) but has no actual nodes running. You can leave it at 0 indefinitely, or spin it back up whenever needed!

The key rules:

✅ You can set any nodepool's count to 0 (keeping it in the config)
✅ You can only fully remove nodepools from the END of the list
✅ You can add new nodepools at the END of the list

I know this limitation isn't intuitive, especially when you're following all the right Kubernetes practices. It's one of those terraform-specific quirks that comes from how the underlying infrastructure is managed. You can find more info about this in both the readme and the kube.tf.example file.

Thanks again for the excellent report - it will definitely help others who encounter the same confusion!

2 replies

MarvinJWendt Jul 27, 2025
Author

Hi @mysticaltech, thanks for your reply! Just to append this: The test nodepool was actually my last entry (there were only 2 nodepools, with my main one being the first).

I have learned the hard way that I may not remove nodepools, as I once removed the first one, which actually destroyed and rebuild the whole cluster 😅 (it was in my kube-hetzner testing phase, so not a big problem).

Also, services on my main nodepool didn't re-deploy, they only had network issues. Is there maybe some magic that reconfigures the subnets even when the last nodepool is deleted?

Oh and one more thing, maybe this behaviour should be in a comment before the nodepool section, when someone deletes the first nodepool, all nodepools will have a new index and re-deploy, which most likely always breaks the cluster 😅

mysticaltech Jul 28, 2025
Maintainer

@MarvinJWendt We really tried to make this clear enough in both the readme and kube.tf.example, but PRs to improve are most welcome.

So rule of thumb, before removing the node pool, even if at the end, count = 0, then apply, then remove then apply 😁

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Network errors after removing nodepool #1836

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Network errors after removing nodepool #1836

Uh oh!

Uh oh!

MarvinJWendt Jul 4, 2025

Description

Last events of the cluster:

Kube.tf file

Screenshots

Platform

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

mysticaltech Jul 27, 2025 Maintainer

Uh oh!

Uh oh!

MarvinJWendt Jul 27, 2025 Author

Uh oh!

mysticaltech Jul 28, 2025 Maintainer

MarvinJWendt
Jul 4, 2025

Replies: 1 comment 2 replies

mysticaltech
Jul 27, 2025
Maintainer

MarvinJWendt Jul 27, 2025
Author

mysticaltech Jul 28, 2025
Maintainer