Single Node missing private IP #1829

Spivur · 2025-07-02T14:37:35Z

Spivur
Jul 2, 2025

Description

We are currently trying to deploy a simple 4-node cluster. Since July 30th, however, one of the nodes consistently fails to receive a private IP address.

We deploy the cluster using the GitLab OpenTofu component through a GitLab CI/CD pipeline. We have also tested the deployment locally on three different devices. On two devices (WSL, Mac) the cluster deploys successfully without issues, but on one device (Mac) we observe the same problem as in the pipelines.

We have already verified that the issue is not related to the latest version update. Additionally, we have deployed four regular nodes attached to a private network, and this works reliably on all devices and the pipeline.

There is no error message, the deployment simply times out after a while.

We have also tried the deployment without the cilium config.

Kube.tf file

terraform {
  required_version = "~> 1.9.1"

  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "~> 1.50.1"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.36.0"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.17.0"
    }
  }
}

variable "hcloud_token" {
  description = "Hetzner-Cloud API token"
  type        = string
  sensitive   = true
}

variable "ssh_public_key" {
  description = "Hetzner-Cloud SSH public key"
  type        = string
  sensitive   = true
}

variable "ssh_private_key" {
  description = "Hetzner-Cloud SSH private key"
  type        = string
  sensitive   = true
}

resource "local_file" "kubeconfig" {
  content  = module.kube_hetzner.kubeconfig
  filename = "${path.module}/kubeconfig.yaml"
}

output "kubeconfig" {
  description = "The kubeconfig file for Kubernetes"
  value       = module.kube_hetzner.kubeconfig
  sensitive   = true
}

module "kube_hetzner" {
  providers = {
    hcloud = hcloud
  }
  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.17.1"

  hcloud_token = var.hcloud_token
  cluster_name = "cluster"

  ssh_public_key = base64decode(var.ssh_public_key)
  ssh_private_key = base64decode(var.ssh_private_key)

  network_region = "eu-central"

  enable_wireguard              = true
  enable_metrics_server         = true
  ingress_controller            = "nginx"
  use_cluster_name_in_node_name = true

  control_plane_nodepools = [
    {
      name            = "control"
      server_type     = "cax11"
      location        = "fsn1"
      labels = [ ]
      taints = [ ]
      count           = 1
      placement_group = "default"
    }
  ]

  agent_nodepools = [
    {
      name            = "agent"
      server_type     = "cax11"
      location        = "fsn1"
      labels = [ ]
      taints = [ ]
      count           = 3
      placement_group = "default"
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  initial_k3s_channel = "v1.29"

  automatically_upgrade_os = false

  cni_plugin            = "cilium"
  cluster_ipv4_cidr     = "10.42.0.0/16"
  cilium_hubble_enabled = true
  // ===========================
  // Cilium Configuration
  // ===========================
  cilium_values         = <<-EOT
  ipam:
    mode: kubernetes
  k8s:
    requireIPv4PodCIDR: true
  kubeProxyReplacement: true
  routingMode: native
  ipv4NativeRoutingCIDR: "10.0.0.0/8"
  endpointRoutes:
    enabled: true
  loadBalancer:
    acceleration: native
  bpf:
    masquerade: true
  MTU: 1450
  hubble:
    relay:
      enabled: true
    ui:
      enabled: true
  EOT
}

Screenshots

No response

Platform

Linux, Mac

mysticaltech · 2025-07-27T15:59:32Z

mysticaltech
Jul 27, 2025
Maintainer

Hi @Spivur,

I've investigated this issue and the intermittent nature suggests it might be related to external factors rather than a bug in the module code. Here are some debugging steps to help identify the root cause:

Debugging Steps

1. Check Terraform Parallelism

The issue might be related to too many concurrent API calls. Try reducing parallelism:

terraform apply -parallelism=1

2. Enable Debug Logging

Set these environment variables to get more detailed logs:

export TF_LOG=DEBUG
export HCLOUD_DEBUG=true
terraform apply 2>&1 | tee terraform-debug.log

3. Check Hetzner Cloud Console

Log into the Hetzner Cloud console during deployment
Check if all 4 servers are created
Check if the problematic node has a private network attached
Look for any error messages in the console

4. Terraform State Investigation

After a failed deployment, check which resources were created:

terraform state list | grep -E "(server|network)"
terraform state show module.kube_hetzner.module.agents

5. Manual Network Attachment Test

If a node is missing its private IP, try manually attaching it:

# Get the server ID of the problematic node
hcloud server list

# Get the network ID
hcloud network list

# Try to attach manually
hcloud server attach-to-network <SERVER_ID> --network <NETWORK_ID> --ip <IP_ADDRESS>

6. Check Resource Limits

Verify you're not hitting any Hetzner account limits:

Server limits
Network limits
API rate limits

Potential Workarounds

1. Sequential Creation

Force sequential creation of agent nodes by adding explicit dependencies:

agent_nodepools = [
  {
    name            = "agent"
    server_type     = "cax11"
    location        = "fsn1"
    labels          = []
    taints          = []
    count           = 1
    placement_group = "agent-1"
  },
  {
    name            = "agent"
    server_type     = "cax11"
    location        = "fsn1"
    labels          = []
    taints          = []
    count           = 1
    placement_group = "agent-2"
  },
  {
    name            = "agent"
    server_type     = "cax11"
    location        = "fsn1"
    labels          = []
    taints          = []
    count           = 1
    placement_group = "agent-3"
  }
]

2. Remove Placement Groups

Try deploying without placement groups first:

# Comment out or remove placement_group = "default"

3. Different Network Configuration

Try using a different network CIDR or region to rule out network-specific issues.

Information Needed

Could you please provide:

The debug logs from a failed deployment
Screenshots or details from the Hetzner Cloud console showing the problematic node
The exact error message (if any) from the timeout
Your Terraform version (terraform version)
Whether this happens with the same node index each time (e.g., always the 3rd agent node)

This information will help identify whether it's a module issue or an infrastructure/API problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Single Node missing private IP #1829

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Single Node missing private IP #1829

Uh oh!

Spivur Jul 2, 2025

Description

Kube.tf file

Screenshots

Platform

Replies: 1 comment

Uh oh!

mysticaltech Jul 27, 2025 Maintainer

Debugging Steps

1. Check Terraform Parallelism

2. Enable Debug Logging

3. Check Hetzner Cloud Console

4. Terraform State Investigation

5. Manual Network Attachment Test

6. Check Resource Limits

Potential Workarounds

1. Sequential Creation

2. Remove Placement Groups

3. Different Network Configuration

Information Needed

Spivur
Jul 2, 2025

mysticaltech
Jul 27, 2025
Maintainer