-
DescriptionI started this discussion here And so, I have a HA cluster and I need to update image.
how to playCreate a cluster with then add then drain run You will see the error above Kube.tf file# export TF_VAR_HCLOUD_TOKEN=${HCLOUD_TOKEN}
variable "HCLOUD_TOKEN" {
type = string
description = "Token for hetzner cloud"
}
module "kube-hetzner" {
source = "kube-hetzner/kube-hetzner/hcloud"
version = "2.2.3"
providers = {
hcloud = hcloud
}
depends_on = [
tls_private_key.admin
]
hcloud_token = var.HCLOUD_TOKEN
ssh_public_key = tls_private_key.admin.public_key_openssh
ssh_private_key = tls_private_key.admin.private_key_openssh
network_region = "eu-central"
use_control_plane_lb = true
control_plane_nodepools = [
{
name = "control-plane-fsn1",
server_type = "cpx11",
location = "fsn1",
labels = [],
taints = [],
count = 0
},
{
name = "control-plane-nbg1",
server_type = "cpx11",
location = "nbg1",
labels = [],
taints = [],
count = 0
},
{
name = "control-plane-hel1",
server_type = "cpx11",
location = "hel1",
labels = [],
taints = [],
count = 0
},
{
name = "control-plane-fsn2",
server_type = "cpx11",
location = "fsn1",
labels = [],
taints = [],
count = 1
},
{
name = "control-plane-nbg2",
server_type = "cpx11",
location = "nbg1",
labels = [],
taints = [],
count = 1
},
{
name = "control-plane-hel2",
server_type = "cpx11",
location = "hel1",
labels = [],
taints = [],
count = 1
}
]
agent_nodepools = [
{
name = "agent-normal",
server_type = "cpx11",
location = "fsn1",
labels = [
"nodepool=normal",
],
taints = [],
count = 1
}
]
load_balancer_type = "lb11"
load_balancer_location = "fsn1"
ingress_controller = "none"
enable_klipper_metal_lb = false
enable_cert_manager = false
cluster_name = "test-kube-hetzner"
create_kubeconfig = true
create_kustomization = true
restrict_outbound_traffic = false
kured_options = {
"reboot-days" : "su,mo,tu,we,th,fr,sa"
"start-time" : "00:00:00"
"end-time" : "23:59:59"
"time-zone" : "Europe/Istanbul"
}
}
resource "tls_private_key" "admin" {
algorithm = "ECDSA"
ecdsa_curve = "P384"
}
resource "local_file" "ssh_private_key" {
content = tls_private_key.admin.private_key_openssh
filename = "${path.module}/out_files/admin_key"
file_permission = 600
}
resource "local_file" "ssh_public_key" {
content = tls_private_key.admin.public_key_openssh
filename = "${path.module}/out_files/admin_key.pub"
file_permission = 600
}
output "kubeconfig" {
value = module.kube-hetzner.kubeconfig_file
sensitive = true
} ScreenshotsNo response PlatformLinux |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 10 replies
-
@jidckii I see, yes, you are right, actually the first nodepool is always needed. You cannot do that the way you are trying. That would work for agents only. As for control planes, since you have 3 control plane nodes already, you are automatically in HA. 1/ Drain the first control-plane-fsn1, then 2/ terraform apply, will rebuild it with the new node. 3/ Do the same with the other remaining two. Basically, if 0-0 is down, 0-1 takes over, so one of these two need to be online. That's the reason why you can never turn the first control plane nodepool to a count of 0, it's either 1, or 3, or more (always odd counts) |
Beta Was this translation helpful? Give feedback.
-
I would like to leave a comment here since we had a similar scenario lately. I wasn't paying enough attention about the part with control plane counts (totally my bad) but what happened was pretty weird. Had the same remote-exec provisioner error mentioned above and I wasn't able to solve it by recreating the node so I jumped into the shell on the node and noticed that there was no /var/post_install folder. But there was a /var/post_install file with the following content: ---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kured
namespace: kube-system
spec:
selector:
matchLabels:
name: kured
template:
metadata:
labels:
name: kured
spec:
serviceAccountName: kured
containers:
- name: kured
command:
- /usr/bin/kured
- --period=5m
- --post-reboot-node-labels=kured=done
- --pre-reboot-node-labels=kured=rebooting
- --reboot-command=/usr/bin/systemctl reboot After deleting the file and running: mkdir -p /var/post_install and running terraform apply as well the provisioning was working again. I am not sure how this happened but I thought I could share this here in case anyone else will run into this. |
Beta Was this translation helpful? Give feedback.
-
@fatelgit Thanks for sharing, super weird! 🤯 |
Beta Was this translation helpful? Give feedback.
-
@mysticaltech Hello! I want to reopen this issue. Previously, I did not check your solution, but just believed you) If I destroy the second or third control-plane, then there is no problem when recreating the resource, the new instance joins the cluster. But if I try to destroy and recreate the first control-plane instance, it won't join the cluster again. The /etc/rancher/k3s/k3s.yaml file will contain the new certificates. That is, the effect is the same as when creating a new cluster. |
Beta Was this translation helpful? Give feedback.
-
Now the first control-plane is used as a store and source of certificates for attaching new instances. I consider this dependence extremely dangerous, because if something happens to this instance, then the user will have to manually deal with the problem of adding new instances. Unfortunately hetzner doesn't have an s3 solution, otherwise one could use it. Do you have any ideas on how to get around the node re-creation restrictions? Maybe we can copy the token and kubeconfig locally or use a third party s3 provider? |
Beta Was this translation helpful? Give feedback.
-
@jidckii No, we designed the system to be able to live without the first control plane, nothing special is stored on the first control plane, it's just used for initialisation. Remember, what can never be taken to 0 is the count of the first control plane nodepool, that is always count >= 1. Now if you are in HA, count == 0, you can take out the first control plane node, in your case control-plane-fsn1-0-0, and you can still work with the cluster by connecting to control-plane-fsn1-0-1. THE CLUSTER DOES NOT GO DOWN. You repair/replace 0-0, and than you can start adding new nodes again, either way, as long as 0-1, normally you should be able to add new nodes, there is an if-else in the code that generates the configs. But long story short the cluster stays running, it's just better of course to replace the 0-0 if it is down. If it's not behaving like this, then it's a newly introduced bug. On my last tests a few months ago, it was still ok. And if it's not ok for you, please share detailed error messages and an exact repeatable procedure. |
Beta Was this translation helpful? Give feedback.
-
Well, here is my configuration:
Full state:
Next i drain and remove ower kubectl node control-plane-fsn1
Next
Next make again
Next, the kubeconfig at the root project of the directory was updated, and the newly created instance initiated a new cluster.
Also, before destroying the 0-0 instance, I tried to create 1 more instance in 1 control plane pool.
Next
Next
Next make again
But the effect is exactly the same, 1 instance creates a new cluster
I also wanted to add it as offtopic, it's a pity that the postfix of the index in the pool is not added to the names of the instances, but instead a random string. It is not immediately clear which instance needs to be drained. |
Beta Was this translation helpful? Give feedback.
-
I have a similar problem with the newest version (2.5.3), but in a single-controller setup. Steps i have done:
The result is that the old control plane was successfully removed and the cluster is working without any problems (so far), but i keep getting this error with every
|
Beta Was this translation helpful? Give feedback.
Will be resolved here #913