Fail if destroy first_control_plane? #898

jidckii · 2023-06-21T08:05:40Z

jidckii
Jun 21, 2023

Description

I started this discussion here
#794
But it seems it deserves separate issues.

And so, I have a HA cluster and I need to update image.
But I can't do it for the first_control_plane because a lot of resources that are created inside the cluster are applied from the first instance and as a result terraform apply fails.

module.kube-hetzner.null_resource.kustomization: Provisioning with 'file'...
module.kube-hetzner.null_resource.kustomization: Provisioning with 'file'...
module.kube-hetzner.null_resource.kustomization: Provisioning with 'remote-exec'...
...
module.kube-hetzner.null_resource.kustomization (remote-exec): (output suppressed due to sensitive value in config)
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube-hetzner.null_resource.kustomization,
│   on .terraform/modules/kube-hetzner/init.tf line 275, in resource "null_resource" "kustomization":
│  275:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_210953451.sh": Process exited with status 23

how to play

Create a cluster with
control-plane-fsn1
control-plane-nbg1
control-plane-hel1

then add
control-plane-fsn2
control-plane-nbg2
control-plane-hel2

then drain
control-plane-fsn1
control-plane-nbg1
control-plane-hel1
and set them to count = 0 in Kube.tf to remove servers.

run terraform apply -auto-approve

You will see the error above

Kube.tf file

# export TF_VAR_HCLOUD_TOKEN=${HCLOUD_TOKEN}
variable "HCLOUD_TOKEN" {
  type        = string
  description = "Token for hetzner cloud"
}

module "kube-hetzner" {
  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.2.3"

  providers = {
    hcloud = hcloud
  }

  depends_on = [
    tls_private_key.admin
  ]

  hcloud_token = var.HCLOUD_TOKEN

  ssh_public_key  = tls_private_key.admin.public_key_openssh
  ssh_private_key = tls_private_key.admin.private_key_openssh
  network_region  = "eu-central"

  use_control_plane_lb = true

  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 0
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 0
    },
    {
      name        = "control-plane-hel1",
      server_type = "cpx11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 0
    },

    {
      name        = "control-plane-fsn2",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg2",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel2",
      server_type = "cpx11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-normal",
      server_type = "cpx11",
      location    = "fsn1",
      labels = [
        "nodepool=normal",
      ],
      taints = [],
      count  = 1
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  ingress_controller      = "none"
  enable_klipper_metal_lb = false
  enable_cert_manager     = false

  cluster_name = "test-kube-hetzner"

  create_kubeconfig    = true
  create_kustomization = true

  restrict_outbound_traffic = false

  kured_options = {
    "reboot-days" : "su,mo,tu,we,th,fr,sa"
    "start-time" : "00:00:00"
    "end-time" : "23:59:59"
    "time-zone" : "Europe/Istanbul"
  }
}


resource "tls_private_key" "admin" {
  algorithm   = "ECDSA"
  ecdsa_curve = "P384"
}

resource "local_file" "ssh_private_key" {
  content         = tls_private_key.admin.private_key_openssh
  filename        = "${path.module}/out_files/admin_key"
  file_permission = 600
}

resource "local_file" "ssh_public_key" {
  content         = tls_private_key.admin.public_key_openssh
  filename        = "${path.module}/out_files/admin_key.pub"
  file_permission = 600
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig_file
  sensitive = true
}

Screenshots

No response

Platform

Linux

Answered by jidckii

Jul 31, 2023

Will be resolved here #913

View full answer

mysticaltech · 2023-06-22T01:33:40Z

mysticaltech
Jun 22, 2023
Maintainer

@jidckii I see, yes, you are right, actually the first nodepool is always needed. You cannot do that the way you are trying. That would work for agents only. As for control planes, since you have 3 control plane nodes already, you are automatically in HA.

1/ Drain the first control-plane-fsn1, then terraform destroy -target 'module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].

2/ terraform apply, will rebuild it with the new node.

3/ Do the same with the other remaining two.

Basically, if 0-0 is down, 0-1 takes over, so one of these two need to be online. That's the reason why you can never turn the first control plane nodepool to a count of 0, it's either 1, or 3, or more (always odd counts)

0 replies

fatelgit · 2023-07-14T08:21:44Z

fatelgit
Jul 14, 2023

I would like to leave a comment here since we had a similar scenario lately. I wasn't paying enough attention about the part with control plane counts (totally my bad) but what happened was pretty weird. Had the same remote-exec provisioner error mentioned above and I wasn't able to solve it by recreating the node so I jumped into the shell on the node and noticed that there was no /var/post_install folder. But there was a /var/post_install file with the following content:

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kured
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: kured
  template:
    metadata:
      labels:
        name: kured
    spec:
      serviceAccountName: kured
      containers:
        - name: kured
          command:
            - /usr/bin/kured
            - --period=5m
            - --post-reboot-node-labels=kured=done
            - --pre-reboot-node-labels=kured=rebooting
            - --reboot-command=/usr/bin/systemctl reboot

After deleting the file and running:

mkdir -p /var/post_install

and running

terraform apply

as well the provisioning was working again. I am not sure how this happened but I thought I could share this here in case anyone else will run into this.

0 replies

mysticaltech · 2023-07-14T16:18:03Z

mysticaltech
Jul 14, 2023
Maintainer

@fatelgit Thanks for sharing, super weird! 🤯

0 replies

jidckii · 2023-07-20T08:44:07Z

jidckii
Jul 20, 2023
Author

@mysticaltech Hello!

I want to reopen this issue.

Previously, I did not check your solution, but just believed you)
And so now I tested the suggested solution and it doesn't work.

If I destroy the second or third control-plane, then there is no problem when recreating the resource, the new instance joins the cluster. But if I try to destroy and recreate the first control-plane instance, it won't join the cluster again. The /etc/rancher/k3s/k3s.yaml file will contain the new certificates. That is, the effect is the same as when creating a new cluster.

0 replies

jidckii · 2023-07-20T09:24:46Z

jidckii
Jul 20, 2023
Author

Now the first control-plane is used as a store and source of certificates for attaching new instances.

I consider this dependence extremely dangerous, because if something happens to this instance, then the user will have to manually deal with the problem of adding new instances.

Unfortunately hetzner doesn't have an s3 solution, otherwise one could use it. Do you have any ideas on how to get around the node re-creation restrictions? Maybe we can copy the token and kubeconfig locally or use a third party s3 provider?

0 replies

mysticaltech · 2023-07-20T22:47:38Z

mysticaltech
Jul 20, 2023
Maintainer

@jidckii No, we designed the system to be able to live without the first control plane, nothing special is stored on the first control plane, it's just used for initialisation.

Remember, what can never be taken to 0 is the count of the first control plane nodepool, that is always count >= 1. Now if you are in HA, count == 0, you can take out the first control plane node, in your case control-plane-fsn1-0-0, and you can still work with the cluster by connecting to control-plane-fsn1-0-1. THE CLUSTER DOES NOT GO DOWN. You repair/replace 0-0, and than you can start adding new nodes again, either way, as long as 0-1, normally you should be able to add new nodes, there is an if-else in the code that generates the configs. But long story short the cluster stays running, it's just better of course to replace the 0-0 if it is down.

If it's not behaving like this, then it's a newly introduced bug. On my last tests a few months ago, it was still ok. And if it's not ok for you, please share detailed error messages and an exact repeatable procedure.

0 replies

jidckii · 2023-07-21T09:06:39Z

jidckii
Jul 21, 2023
Author

Well, here is my configuration:

# export TF_VAR_HCLOUD_TOKEN=${HCLOUD_TOKEN}
variable "HCLOUD_TOKEN" {
  type        = string
  description = "Token for hetzner cloud"
}

module "kube-hetzner" {
  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.3.1"

  providers = {
    hcloud = hcloud
  }

  hcloud_token = var.HCLOUD_TOKEN

  ssh_public_key       = tls_private_key.admin.public_key_openssh
  ssh_private_key      = tls_private_key.admin.private_key_openssh
  network_region       = "eu-central"

  use_control_plane_lb = true

  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cpx11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-normal",
      server_type = "cpx11",
      location    = "fsn1",
      labels = [
        "nodepool=normal",
      ],
      taints = [],
      count  = 2
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  ingress_controller      = "none"
  enable_klipper_metal_lb = false
  enable_cert_manager     = false

  cluster_name = "issue-898"

  create_kubeconfig    = true
  create_kustomization = true

  restrict_outbound_traffic = false

  automatically_upgrade_k3s = false
  automatically_upgrade_os = false

  kured_options = {
    "reboot-days"    = "mon,tue,wed,thu,fri"
    "start-time"     = "9am"
    "end-time"       = "2pm"
    "time-zone"      = "Europe/Istanbul"
    "prometheus-url" = "http://prometheus-server.prometheus.svc.cluster.local"
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig_file
  sensitive = true
}


resource "tls_private_key" "admin" {
  algorithm   = "ECDSA"
  ecdsa_curve = "P384"
}

resource "local_file" "ssh_private_key" {
  content         = tls_private_key.admin.private_key_openssh
  filename        = "${path.module}/out_files/admin_key"
  file_permission = 600
}

resource "local_file" "ssh_public_key" {
  content         = tls_private_key.admin.public_key_openssh
  filename        = "${path.module}/out_files/admin_key.pub"
  file_permission = 600
}

provider "hcloud" {
  token = var.HCLOUD_TOKEN
}

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.40.0"
    }
  }
}

Full state:

$ terraform state list
local_file.ssh_private_key
local_file.ssh_public_key
tls_private_key.admin
module.kube-hetzner.data.github_release.hetzner_ccm[0]
module.kube-hetzner.data.github_release.hetzner_csi[0]
module.kube-hetzner.data.github_release.kured[0]
module.kube-hetzner.data.hcloud_image.microos_arm_snapshot
module.kube-hetzner.data.hcloud_image.microos_x86_snapshot
module.kube-hetzner.data.remote_file.kubeconfig
module.kube-hetzner.data.remote_file.kustomization_backup
module.kube-hetzner.hcloud_firewall.k3s
module.kube-hetzner.hcloud_load_balancer.control_plane[0]
module.kube-hetzner.hcloud_load_balancer_network.control_plane[0]
module.kube-hetzner.hcloud_load_balancer_service.control_plane[0]
module.kube-hetzner.hcloud_load_balancer_target.control_plane[0]
module.kube-hetzner.hcloud_network.k3s
module.kube-hetzner.hcloud_network_subnet.agent[0]
module.kube-hetzner.hcloud_network_subnet.control_plane[0]
module.kube-hetzner.hcloud_network_subnet.control_plane[1]
module.kube-hetzner.hcloud_network_subnet.control_plane[2]
module.kube-hetzner.hcloud_placement_group.agent[0]
module.kube-hetzner.hcloud_placement_group.control_plane[0]
module.kube-hetzner.hcloud_ssh_key.k3s[0]
module.kube-hetzner.local_file.kustomization_backup[0]
module.kube-hetzner.local_sensitive_file.kubeconfig[0]
module.kube-hetzner.null_resource.agents["0-0-agent-normal"]
module.kube-hetzner.null_resource.agents["0-1-agent-normal"]
module.kube-hetzner.null_resource.control_planes["0-0-control-plane-fsn1"]
module.kube-hetzner.null_resource.control_planes["1-0-control-plane-nbg1"]
module.kube-hetzner.null_resource.control_planes["2-0-control-plane-hel1"]
module.kube-hetzner.null_resource.first_control_plane
module.kube-hetzner.null_resource.kustomization
module.kube-hetzner.random_password.k3s_token
module.kube-hetzner.random_password.rancher_bootstrap[0]
module.kube-hetzner.module.agents["0-0-agent-normal"].data.cloudinit_config.config
module.kube-hetzner.module.agents["0-0-agent-normal"].hcloud_server.server
module.kube-hetzner.module.agents["0-0-agent-normal"].hcloud_server_network.server
module.kube-hetzner.module.agents["0-0-agent-normal"].null_resource.registries
module.kube-hetzner.module.agents["0-0-agent-normal"].random_string.identity_file
module.kube-hetzner.module.agents["0-0-agent-normal"].random_string.server
module.kube-hetzner.module.agents["0-1-agent-normal"].data.cloudinit_config.config
module.kube-hetzner.module.agents["0-1-agent-normal"].hcloud_server.server
module.kube-hetzner.module.agents["0-1-agent-normal"].hcloud_server_network.server
module.kube-hetzner.module.agents["0-1-agent-normal"].null_resource.registries
module.kube-hetzner.module.agents["0-1-agent-normal"].random_string.identity_file
module.kube-hetzner.module.agents["0-1-agent-normal"].random_string.server
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].data.cloudinit_config.config
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].hcloud_server.server
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].hcloud_server_network.server
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].null_resource.registries
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].random_string.identity_file
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].random_string.server
module.kube-hetzner.module.control_planes["1-0-control-plane-nbg1"].data.cloudinit_config.config
module.kube-hetzner.module.control_planes["1-0-control-plane-nbg1"].hcloud_server.server
module.kube-hetzner.module.control_planes["1-0-control-plane-nbg1"].hcloud_server_network.server
module.kube-hetzner.module.control_planes["1-0-control-plane-nbg1"].null_resource.registries
module.kube-hetzner.module.control_planes["1-0-control-plane-nbg1"].random_string.identity_file
module.kube-hetzner.module.control_planes["1-0-control-plane-nbg1"].random_string.server
module.kube-hetzner.module.control_planes["2-0-control-plane-hel1"].data.cloudinit_config.config
module.kube-hetzner.module.control_planes["2-0-control-plane-hel1"].hcloud_server.server
module.kube-hetzner.module.control_planes["2-0-control-plane-hel1"].hcloud_server_network.server
module.kube-hetzner.module.control_planes["2-0-control-plane-hel1"].null_resource.registries
module.kube-hetzner.module.control_planes["2-0-control-plane-hel1"].random_string.identity_file
module.kube-hetzner.module.control_planes["2-0-control-plane-hel1"].random_string.server

Next i drain and remove ower kubectl node control-plane-fsn1

 kubectl drain issue-898-control-plane-fsn1-xpf --delete-local-data --ignore-daemonsets --force

terraform destroy -target 'module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"]'

$ terraform destroy -target 'module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"]'
module.kube-hetzner.data.hcloud_image.microos_x86_snapshot: Reading...
tls_private_key.admin: Refreshing state... [id=4418ad1a67dc1f36fb6102b2fc0587bee262b684]
module.kube-hetzner.data.hcloud_image.microos_arm_snapshot: Reading...
module.kube-hetzner.hcloud_placement_group.control_plane[0]: Refreshing state... [id=185495]
module.kube-hetzner.hcloud_network.k3s: Refreshing state... [id=3141571]
module.kube-hetzner.hcloud_firewall.k3s: Refreshing state... [id=977281]
module.kube-hetzner.hcloud_ssh_key.k3s[0]: Refreshing state... [id=13245184]
module.kube-hetzner.data.hcloud_image.microos_x86_snapshot: Read complete after 1s
module.kube-hetzner.hcloud_network_subnet.control_plane[1]: Refreshing state... [id=3141571-10.254.0.0/16]
module.kube-hetzner.hcloud_network_subnet.control_plane[0]: Refreshing state... [id=3141571-10.255.0.0/16]
module.kube-hetzner.hcloud_network_subnet.control_plane[2]: Refreshing state... [id=3141571-10.253.0.0/16]
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].random_string.identity_file: Refreshing state... [id=7wn8u5bzk230mj23u2b2]
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].random_string.server: Refreshing state... [id=xpf]
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].data.cloudinit_config.config: Reading...
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].data.cloudinit_config.config: Read complete after 0s [id=664652408]
module.kube-hetzner.data.hcloud_image.microos_arm_snapshot: Read complete after 1s
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].hcloud_server.server: Refreshing state... [id=35095096]
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].null_resource.registries: Refreshing state... [id=5356835605206867045]
module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].hcloud_server_network.server: Refreshing state... [id=35095096-3141571]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  # module.kube-hetzner.local_file.kustomization_backup[0] will be destroyed
  - resource "local_file" "kustomization_backup" {
      - content              = <<-EOT
            "apiVersion": "kustomize.config.k8s.io/v1beta1"
            "kind": "Kustomization"
            "patchesStrategicMerge":
            - |
              apiVersion: apps/v1
              kind: Deployment
              metadata:
                name: system-upgrade-controller
                namespace: system-upgrade
              spec:
                template:
                  spec:
                    containers:
                      - name: system-upgrade-controller
                        volumeMounts:
                          - name: ca-certificates
                            mountPath: /var/lib/ca-certificates
                    volumes:
                      - name: ca-certificates
                        hostPath:
                          path: /var/lib/ca-certificates
                          type: Directory
            - "kured.yaml"
            - "ccm.yaml"
            "resources":
            - "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.17.1/ccm-networks.yaml"
            - "https://github.com/weaveworks/kured/releases/download/1.13.1/kured-1.13.1-dockerhub.yaml"
            - "https://raw.githubusercontent.com/rancher/system-upgrade-controller/master/manifests/system-upgrade-controller.yaml"
            - "hcloud-csi.yml"
        EOT -> null
      - content_base64sha256 = "wCYH1JQyYI5XTmAOvTdsIivWj2lbTFrkGh0fWvOvjxU=" -> null
      - content_base64sha512 = "4RD2oqA3ScuIP5jWeetptpMiSeZz+2cteFhj+kvjJgde1HQuENUN7omn5eDD0sIbAKAg12s7FXI6q1LgKvmtMQ==" -> null
      - content_md5          = "63ff70d3b95aba12bb5429cdb6dd22c5" -> null
      - content_sha1         = "19fdc094d680b9d7d71d84ba06500d85ffba9009" -> null
      - content_sha256       = "c02607d49432608e574e600ebd376c222bd68f695b4c5ae41a1d1f5af3af8f15" -> null
      - content_sha512       = "e110f6a2a03749cb883f98d679eb69b6932249e673fb672d785863fa4be326075ed4742e10d50dee89a7e5e0c3d2c21b00a020d76b3b15723aab52e02af9ad31" -> null
      - directory_permission = "0777" -> null
      - file_permission      = "600" -> null
      - filename             = "issue-898_kustomization_backup.yaml" -> null
      - id                   = "19fdc094d680b9d7d71d84ba06500d85ffba9009" -> null
    }

  # module.kube-hetzner.local_sensitive_file.kubeconfig[0] will be destroyed
  - resource "local_sensitive_file" "kubeconfig" {
      - content              = (sensitive value) -> null
      - content_base64sha256 = "pnCnRshqvy8b/kFPAHT/t+W1X8IZwb9FzLDlbJhg9hQ=" -> null
      - content_base64sha512 = "7g9GGiSv/wjD6qO+o5GC+woCn8ctyi7sfh+7cZmr9SsSKCEYLDPPSiuvtg2EqB+aL76tmLFJoJOFZUZ88Q0Tow==" -> null
      - content_md5          = "e1dd3146fcf923dc610a1709102b3bdd" -> null
      - content_sha1         = "dbce533874e0333ebe88b4a4d8e10927fd48cc07" -> null
      - content_sha256       = "a670a746c86abf2f1bfe414f0074ffb7e5b55fc219c1bf45ccb0e56c9860f614" -> null
      - content_sha512       = "ee0f461a24afff08c3eaa3bea39182fb0a029fc72dca2eec7e1fbb7199abf52b122821182c33cf4a2bafb60d84a81f9a2fbead98b149a0938565467cf10d13a3" -> null
      - directory_permission = "0700" -> null
      - file_permission      = "600" -> null
      - filename             = "issue-898_kubeconfig.yaml" -> null
      - id                   = "dbce533874e0333ebe88b4a4d8e10927fd48cc07" -> null
    }

  # module.kube-hetzner.null_resource.agents["0-0-agent-normal"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "2408404765030849635" -> null
      - triggers = {
          - "agent_id" = "35095093"
        } -> null
    }

  # module.kube-hetzner.null_resource.agents["0-1-agent-normal"] will be destroyed
  - resource "null_resource" "agents" {
      - id       = "5218124642308337771" -> null
      - triggers = {
          - "agent_id" = "35095095"
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["0-0-control-plane-fsn1"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "1214336598766979713" -> null
      - triggers = {
          - "control_plane_id" = "35095096"
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["1-0-control-plane-nbg1"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "8458564626035884258" -> null
      - triggers = {
          - "control_plane_id" = "35095094"
        } -> null
    }

  # module.kube-hetzner.null_resource.control_planes["2-0-control-plane-hel1"] will be destroyed
  - resource "null_resource" "control_planes" {
      - id       = "2434566273611774646" -> null
      - triggers = {
          - "control_plane_id" = "35095097"
        } -> null
    }

  # module.kube-hetzner.null_resource.first_control_plane will be destroyed
  - resource "null_resource" "first_control_plane" {
      - id = "8843543601890885660" -> null
    }

  # module.kube-hetzner.null_resource.kustomization will be destroyed
  - resource "null_resource" "kustomization" {
      - id       = "8120878057941375904" -> null
      - triggers = {
          - "helm_values_yaml" = (sensitive value)
          - "options"          = <<-EOT
                end-time=2pm
                prometheus-url=http://prometheus-server.prometheus.svc.cluster.local
                reboot-days=mon,tue,wed,thu,fri
                start-time=9am
                time-zone=Europe/Istanbul
            EOT
          - "versions"         = <<-EOT
                v1.26
                v1.26.2
                N/A
                N/A
                N/A
                N/A
            EOT
        } -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].hcloud_server.server will be destroyed
  - resource "hcloud_server" "server" {
      - allow_deprecated_images    = false -> null
      - backups                    = false -> null
      - datacenter                 = "fsn1-dc14" -> null
      - delete_protection          = false -> null
      - firewall_ids               = [
          - 977281,
        ] -> null
      - id                         = "35095096" -> null
      - ignore_remote_firewall_ids = false -> null
      - image                      = "119721594" -> null
      - ipv4_address               = "49.13.53.218" -> null
      - ipv6_address               = "2a01:4f8:c17:2b21::1" -> null
      - ipv6_network               = "2a01:4f8:c17:2b21::/64" -> null
      - keep_disk                  = false -> null
      - labels                     = {
          - "cluster"     = "issue-898"
          - "engine"      = "k3s"
          - "provisioner" = "terraform"
          - "role"        = "control_plane_node"
        } -> null
      - location                   = "fsn1" -> null
      - name                       = "issue-898-control-plane-fsn1-xpf" -> null
      - placement_group_id         = 185495 -> null
      - rebuild_protection         = false -> null
      - server_type                = "cpx11" -> null
      - ssh_keys                   = [
          - "13245184",
        ] -> null
      - status                     = "running" -> null
      - user_data                  = "tA+ny+Rr/82iyhaGueMx9hyqIPc=" -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].hcloud_server_network.server will be destroyed
  - resource "hcloud_server_network" "server" {
      - alias_ips   = [] -> null
      - id          = "35095096-3141571" -> null
      - ip          = "10.255.0.101" -> null
      - mac_address = "86:00:00:52:67:e3" -> null
      - server_id   = 35095096 -> null
      - subnet_id   = "3141571-10.255.0.0/16" -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].null_resource.registries will be destroyed
  - resource "null_resource" "registries" {
      - id       = "5356835605206867045" -> null
      - triggers = {
          - "registries" = " "
        } -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].random_string.identity_file will be destroyed
  - resource "random_string" "identity_file" {
      - id          = "7wn8u5bzk230mj23u2b2" -> null
      - length      = 20 -> null
      - lower       = true -> null
      - min_lower   = 0 -> null
      - min_numeric = 0 -> null
      - min_special = 0 -> null
      - min_upper   = 0 -> null
      - number      = true -> null
      - numeric     = true -> null
      - result      = "7wn8u5bzk230mj23u2b2" -> null
      - special     = false -> null
      - upper       = false -> null
    }

  # module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"].random_string.server will be destroyed
  - resource "random_string" "server" {
      - id          = "xpf" -> null
      - keepers     = {
          - "name" = "issue-898-control-plane-fsn1"
        } -> null
      - length      = 3 -> null
      - lower       = true -> null
      - min_lower   = 0 -> null
      - min_numeric = 0 -> null
      - min_special = 0 -> null
      - min_upper   = 0 -> null
      - number      = false -> null
      - numeric     = false -> null
      - result      = "xpf" -> null
      - special     = false -> null
      - upper       = false -> null
    }

Plan: 0 to add, 0 to change, 14 to destroy.

Next make again

terraform apply -auto-approve

Next, the kubeconfig at the root project of the directory was updated, and the newly created instance initiated a new cluster.

$ kubectl --kubeconfig ~/.kube/issue-898_kubeconfig.yaml  get nodes 
NAME                               STATUS   ROLES                       AGE   VERSION
issue-898-agent-normal-kho         Ready    <none>                      63m   v1.26.6+k3s1
issue-898-agent-normal-tjo         Ready    <none>                      63m   v1.26.6+k3s1
issue-898-control-plane-hel1-tfh   Ready    control-plane,etcd,master   63m   v1.26.6+k3s1
issue-898-control-plane-nbg1-eql   Ready    control-plane,etcd,master   62m   v1.26.6+k3s1

$ kubectl --kubeconfig issue-898_kubeconfig.yaml  get nodes 
Unable to connect to the server: x509: certificate signed by unknown authority

$ kubectl --kubeconfig issue-898_kubeconfig.yaml  get nodes 
NAME                               STATUS   ROLES                       AGE     VERSION
issue-898-control-plane-fsn1-qaj   Ready    control-plane,etcd,master   4m55s   v1.26.6+k3s1

Also, before destroying the 0-0 instance, I tried to create 1 more instance in 1 control plane pool.

    {
      name        = "control-plane-fsn1",
      server_type = "cpx21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 2
    },

 kubectl drain issue-898-control-plane-fsn1-xpf --delete-local-data --ignore-daemonsets --force

terraform destroy -target 'module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"]'

Next make again

terraform apply -auto-approve

But the effect is exactly the same, 1 instance creates a new cluster

$ kubectl --kubeconfig ~/.kube/issue-898_kubeconfig.yaml  get nodes 
NAME                               STATUS   ROLES                       AGE    VERSION
issue-898-agent-normal-liw         Ready    <none>                      13m    v1.26.6+k3s1
issue-898-agent-normal-tez         Ready    <none>                      13m    v1.26.6+k3s1
issue-898-control-plane-fsn1-kls   Ready    control-plane,etcd,master   8m5s   v1.26.6+k3s1
issue-898-control-plane-hel1-mhn   Ready    control-plane,etcd,master   19m    v1.26.6+k3s1
issue-898-control-plane-nbg1-oen   Ready    control-plane,etcd,master   19m    v1.26.6+k3s1

$ kubectl --kubeconfig issue-898_kubeconfig.yaml  get nodes 
NAME                               STATUS   ROLES                       AGE   VERSION
issue-898-control-plane-fsn1-xyp   Ready    control-plane,etcd,master   54s   v1.26.6+k3s1

I also wanted to add it as offtopic, it's a pity that the postfix of the index in the pool is not added to the names of the instances, but instead a random string. It is not immediately clear which instance needs to be drained.

6 replies

mysticaltech Jul 21, 2023
Maintainer

@jidckii First, try with use_control_plane_lb = false then and changing the kubeconfig to point to alive IPs. You may be surprised. I've done it in the past, that's why I insist. Don't worry about certificates.

If that works for you, then we may need to revisit the logic around use_control_plane_lb.

jidckii Jul 27, 2023
Author

@mysticaltech Sorry for the delay, been busy.

I tested your suggestion and no, it doesn't work.

step by step:

I added 1 more instance to the first pool

   control_plane_nodepools = [
     {
       name="control-plane-fsn1",
       server_type="cpx11",
       location="fsn1",
       labels = [],
       taints = [],
       count = 2
     },

next I turned off LB

use_control_plane_lb = false
and did terraform apply

then changed the IP in the old kubeconfig to one of the last in the list of nodes.
next drain instance 0-0
next delete the instance
terraform destroy -target 'module.kube-hetzner.module.control_planes["0-0-control-plane-fsn1"]'
next, apply the configuration again to recreate the node

terraform apply -auto-approve
As a result, a completely different kubeconfig is generated, and the first node creates a new cluster, and does not add itself to the old one.
As I thought, nothing depends on LB.

here is the output with the old kubeconfig and changed IP

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
issue-898-agent-normal-beo Ready <none> 38m v1.26.6+k3s1
issue-898-agent-normal-tjz Ready <none> 38m v1.26.6+k3s1
issue-898-control-plane-fsn1-fiq Ready control-plane,etcd,master 21m v1.26.6+k3s1
issue-898-control-plane-hel1-mmg Ready control-plane,etcd,master 37m v1.26.6+k3s1
issue-898-control-plane-nbg1-kpk Ready control-plane,etcd,master 38m v1.26.6+k3s1

and here with the new one

$ kubectl --kubeconfig issue-898_kubeconfig.yaml get nodes
NAME STATUS ROLES AGE VERSION
issue-898-control-plane-fsn1-vkt Ready control-plane,etcd,master 69s v1.26.6+k3s1

we see that the issue-898-control-plane-fsn1-vkt node has initiated itself in the new cluster.
You can take my example above and reproduce it for yourself, testing will take no more than 30 minutes.

I find this behavior strange and not obvious, and we should warn about it in the Readme. If someone wants to change the type of instances. And update the labels at the same time, it will want to recreate all the instances in the control plane and, as a result, will not be able to do this.

mysticaltech Jul 30, 2023
Maintainer

@jidckii You were right there is indeed a sticking point a the kustomization_backup stage (logic in kustomization_backup.tf), as that fetches a file from the first control plane that is not needed. I have verified that it still works if that logic is removed, it wasn't that useful anyways.

You brutally destroy the first control plane with hcloud server delete .... the cluster still works perfectly (by pointing kubeconfig to any other control plane), and terraform apply is able to rebuild it perfectly.

I will publish the fix ASAP and a small video demo'ing the action.

jidckii Jul 31, 2023
Author

Will be resolved here #913

Answer selected by jidckii

frknakk · 2023-08-08T19:56:03Z

frknakk
Aug 8, 2023

I have a similar problem with the newest version (2.5.3), but in a single-controller setup.

Steps i have done:

Create a second control plane via terraform
Cordon/drain the old control plane
Remove the old control plane from the etcd member list
Update kubeconfig to the new ip address
Set count = 0 of the old control plane and apply terraform

The result is that the old control plane was successfully removed and the cluster is working without any problems (so far), but i keep getting this error with every terraform apply:

Terraform will perform the following actions:

  # module.kube-hetzner.null_resource.kustomization is tainted, so must be replaced
-/+ resource "null_resource" "kustomization" {
      ~ id       = "******************" -> (known after apply)
        # (1 unchanged attribute hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

[...]

│ Error: remote-exec provisioner error
│
│   with module.kube-hetzner.null_resource.kustomization,
│   on .terraform/modules/kube-hetzner/init.tf line 248, in resource "null_resource" "kustomization":
│  248:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_1530946020.sh": Process exited with status 23

4 replies

mysticaltech Aug 8, 2023
Maintainer

@frknakk This won't work, comtrol plane number needs to be odd for HA, so at least 3. It's how k3s and probably k8s too work.

mysticaltech Aug 8, 2023
Maintainer

Also, the first control plane nodepool can't be zero. You do not remove it with the count, you destroy it with hcloud directly to simulate a problem.

The doc speeks about that more, also check the latest PR on the matter. See above the link to the issue.

frknakk Aug 8, 2023

It's not possible to create a non-HA 1 control plane cluster in k3s/k8s? Or do you mean the transition doesn't work because the system has 2 control planes for a short time?

#898 (comment)
This solved my problem btw. There was a file /var/post_install that blocked the creation of the /var/post_install/hcloud-csi.yml file.

mysticaltech Aug 9, 2023
Maintainer

@frknakk I was referring to HA clusters, where you simulate the first control plane going down. However, very good to hear you solved this out. I understand it has to do with /var/post_install/, but if you could you tell me exactly your cluster setup, what you were trying to do when you got this error, and exactly what steps you did to solve it, it would be great. It would allow me to create a proactive fix for it. Thanks!

Uh oh!

Fail if destroy first_control_plane? #898

Uh oh!

Uh oh!

jidckii Jun 21, 2023

Description

how to play

Kube.tf file

Screenshots

Platform

Replies: 8 comments · 10 replies

Uh oh!

Uh oh!

mysticaltech Jun 22, 2023 Maintainer

Uh oh!

fatelgit Jul 14, 2023

Uh oh!

mysticaltech Jul 14, 2023 Maintainer

Uh oh!

jidckii Jul 20, 2023 Author

Uh oh!

jidckii Jul 20, 2023 Author

Uh oh!

mysticaltech Jul 20, 2023 Maintainer

Uh oh!

jidckii Jul 21, 2023 Author

Uh oh!

mysticaltech Jul 21, 2023 Maintainer

Uh oh!

jidckii Jul 27, 2023 Author

Uh oh!

Uh oh!

mysticaltech Jul 30, 2023 Maintainer

Uh oh!

jidckii Jul 31, 2023 Author

Uh oh!

frknakk Aug 8, 2023

Uh oh!

mysticaltech Aug 8, 2023 Maintainer

Uh oh!

mysticaltech Aug 8, 2023 Maintainer

Uh oh!

Uh oh!

frknakk Aug 8, 2023

Uh oh!

Uh oh!

mysticaltech Aug 9, 2023 Maintainer

jidckii
Jun 21, 2023

Replies: 8 comments 10 replies

mysticaltech
Jun 22, 2023
Maintainer

fatelgit
Jul 14, 2023

mysticaltech
Jul 14, 2023
Maintainer

jidckii
Jul 20, 2023
Author

jidckii
Jul 20, 2023
Author

mysticaltech
Jul 20, 2023
Maintainer

jidckii
Jul 21, 2023
Author

mysticaltech Jul 21, 2023
Maintainer

jidckii Jul 27, 2023
Author

mysticaltech Jul 30, 2023
Maintainer

jidckii Jul 31, 2023
Author

frknakk
Aug 8, 2023

mysticaltech Aug 8, 2023
Maintainer

mysticaltech Aug 8, 2023
Maintainer

mysticaltech Aug 9, 2023
Maintainer