Network errors after removing nodepool #1836
-
DescriptionHi all, I've been trying out to run a HA kubernetes cluster from scratch for a short while. First of, holy **** this project is amazing, but I have one problem that I am unsure why it happens: When I remove a node pool, I get network issues on all services, not just on the ones at the removed nodepool. What I did:
Last events of the cluster:
Kube.tf filelocals {
hcloud_token = "<secret>"
}
module "kube-hetzner" {
providers = {
hcloud = hcloud
}
hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
source = "kube-hetzner/kube-hetzner/hcloud"
ssh_public_key = file("~/.ssh/id_ed25519.pub")
ssh_private_key = file("~/.ssh/id_ed25519")
network_region = "eu-central"
control_plane_nodepools = [
{
name = "control-plane-fsn1",
server_type = "cx22",
location = "fsn1",
labels = [],
taints = [],
count = 1
},
{
name = "control-plane-nbg1",
server_type = "cx22",
location = "nbg1",
labels = [],
taints = [],
count = 1
},
{
name = "control-plane-hel1",
server_type = "cx22",
location = "hel1",
labels = [],
taints = [],
count = 1
}
]
agent_nodepools = [
{
name = "agent-small",
server_type = "cx22",
location = "fsn1",
labels = [],
taints = [],
nodes = {
"1" : {
location = "nbg1"
},
"2" : {
location = "fsn1"
},
"3" : {
location = "hel1"
}
}
},
]
load_balancer_type = "lb11"
load_balancer_location = "fsn1"
base_domain = "<secret>"
autoscaler_nodepools = [
{
name = "autoscaled-small"
server_type = "cx22"
location = "fsn1"
min_nodes = 0
max_nodes = 1
labels = {
"node.kubernetes.io/role" : "peak-workloads"
}
taints = [
{
key = "node.kubernetes.io/role"
value = "peak-workloads"
effect = "NoExecute"
}
]
}
]
autoscaler_taints = [
"node.kubernetes.io/role=specific-workloads:NoExecute",
]
enable_delete_protection = {
floating_ip = true
load_balancer = true
volume = true
}
etcd_s3_backup = {
etcd-s3-endpoint = "<secret>"
etcd-s3-access-key = "<secret>"
etcd-s3-secret-key = "<secret>"
etcd-s3-bucket = "<secret>"
etcd-s3-region = "empty"
}
enable_csi_driver_smb = true
enable_longhorn = true
longhorn_replica_count = 2
hetzner_ccm_use_helm = true
traefik_additional_options = ["--log.level=INFO", "--accesslog=true"]
traefik_redirect_to_https = false
traefik_additional_trusted_ips = [
"173.245.48.0/20",
"103.21.244.0/22",
"103.22.200.0/22",
"103.31.4.0/22",
"141.101.64.0/18",
"108.162.192.0/18",
"190.93.240.0/20",
"188.114.96.0/20",
"197.234.240.0/22",
"198.41.128.0/17",
"162.158.0.0/15",
"104.16.0.0/13",
"104.24.0.0/14",
"172.64.0.0/13",
"131.0.72.0/22",
"2400:cb00::/32",
"2606:4700::/32",
"2803:f800::/32",
"2405:b500::/32",
"2405:8100::/32",
"2a06:98c0::/29",
"2c0f:f248::/32"
]
system_upgrade_use_drain = true
cluster_name = "<secret>"
dns_servers = [
"1.1.1.1",
"8.8.8.8",
"2606:4700:4700::1111",
]
create_kubeconfig = false
}
provider "hcloud" {
token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}
output "kubeconfig" {
value = module.kube-hetzner.kubeconfig
sensitive = true
}
variable "hcloud_token" {
sensitive = true
default = ""
} ScreenshotsNo response PlatformLinux |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @MarvinJWendt, First off, thank you for such a detailed and well-documented issue report! And I'm glad you're enjoying the project 🙂 You actually did everything perfectly from a Kubernetes perspective - draining the nodes, waiting for pods to reschedule, verifying zero downtime. Your procedure was textbook correct! The issue you encountered is due to a specific limitation in how this Terraform module manages nodepools. What happened: The solution is simpler than it might seem: {
name = "test",
server_type = "cx22",
location = "fsn1",
labels = [],
taints = [],
count = 0 # ← Just change this
} This way, the nodepool stays in the configuration (preserving the subnet allocation order) but has no actual nodes running. You can leave it at 0 indefinitely, or spin it back up whenever needed! The key rules:
I know this limitation isn't intuitive, especially when you're following all the right Kubernetes practices. It's one of those terraform-specific quirks that comes from how the underlying infrastructure is managed. You can find more info about this in both the readme and the kube.tf.example file. Thanks again for the excellent report - it will definitely help others who encounter the same confusion! |
Beta Was this translation helpful? Give feedback.
Hi @MarvinJWendt,
First off, thank you for such a detailed and well-documented issue report! And I'm glad you're enjoying the project 🙂
You actually did everything perfectly from a Kubernetes perspective - draining the nodes, waiting for pods to reschedule, verifying zero downtime. Your procedure was textbook correct! The issue you encountered is due to a specific limitation in how this Terraform module manages nodepools.
What happened:
The module allocates subnets and IPs in a specific order, and removing a nodepool from the middle of the list (rather than from the end) disrupts this allocation scheme. This causes the networking issues you observed, even affecting pods on other nodes.
Th…