MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely #1096
-
Hi folks, we drive two clusters with kube-hetzner (version 2.7.2) for several months without issues but now an automatic MicroOS failed on both clusters. This resulted in a NotReady node, no kubelet ist running. Obviously no further upgrades were triggered on other nodes due to this error, because all other nodes run fine, but with old MicroOS version. Both nodes rebooted with a new kernel 6.6.1-1-default (before it was 6.5.9-1-default). systemctl tells me, that the following services failed:
The log output of these services looks quite similar on both nodes so I post the output just from one node here. cloud-init.service
health-checker.service
transactional-update.service The log output of transactional-update.service is rather long so I paste only the part at the end which look important to me. If more log output is required I am happy to give it to you.
Disk space looks good (16G free on /, 134G free on /var). The /.snapshot/ folder is empty on the failed nodes. I have no experience with MicroOS and don't know what to do now. Any help is appreciated! Thanks! Regards, Jörn |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Today we did more investigation and I have some information to add: If you reboot (again) a node in this state it's completely broken afterwards! The kernel starts, network is up, but it goes into emergency mode without the ability to login as root. Nothing you can do anymore but deleting and re-provisioning the node, We booted another node, which didn't boot yet due to Kured -> same problem It goes into the same state as the first node on its first reboot. The only difference is, that transactional-update did not run yet. We disabled it completely on all nodes to prevent further harm (and that was good so! - see below). So this is a generic problem and not specific to the node in question. As well this happened at the same time on two completely different clusters! Our diagnosis what happened
One root cause/symptom is that This looks like a serious bug to me in the whole transactional-update thing. This must not run if the health-checker detected a problem and/or What makes me wonder is that obviously we are the only ones suffering from this. Did you all have MicroOS upgrades switched off? I can't see anything special with our configuration which led to this. For me it looks the current MicroOS update thingy is actually broken. I checked their Bugzilla and didn't found anything. I filed a bug report: |
Beta Was this translation helpful? Give feedback.
-
Another follow up: I could rescue a node which did not yet run transactional-update after the first reboot:
# show all snapshots
snapper list
# pick the number from the list which is suffixed with - (the current one) and issue
snapper rollback <nr>
# perform a reboot into this new default snapshot
reboot Next I immediately disabled transactional-update again to prevent further harm and save time for more investigations: systemctl --now disable transactional-update.timer This procedure does not work on the node where transaction-update did run and removed all snapshots. It doesn't matter which snapshot I select from Grub menu: emergency mode -> no login possible -> stuck. I will manually delete and then re-create this node with Terraform. |
Beta Was this translation helpful? Give feedback.
-
This is the recipe we used to repair the broken nodes, which did not yet deleted all snapshots from disk. We deleted the latter in Cloud Console and provisioned them with Terraform again. We had to go back to the snapshot before November 12th (which was May) - all others were broken! # set node specs
$ NODE=node-name
$ IP=ip-address
# check / wait until all pods are in ready state
$ k get pods -A
# drain the node in question
$ k drain $NODE --ignore-daemonsets --delete-emptydir-data
# perform a snapper rollback on this node
$ ssh root@$IP
$ snapper list
$ snapper rollback 23
$ reboot
# perform transactional-update to be up-to-date again
$ ssh root@$IP
$ systemctl start transactional-update.service
$ reboot
# activate node again
$ k uncordon $NODE I want to note that due to the read-only-root fashion of MicroOS neither Kubernetes nor the workloads, containers, data etc. are affected by the snapshot rollback. That's what I like about this. What I still dislike is, that an unattended automatic upgrade destroyed two nodes completely by wiping any snapshots from disk. Luckily our production workloads were not affected, because we always had enough healthy nodes to drive them. |
Beta Was this translation helpful? Give feedback.
This is the recipe we used to repair the broken nodes, which did not yet deleted all snapshots from disk. We deleted the latter in Cloud Console and provisioned them with Terraform again.
We had to go back to the snapshot before November 12th (which was May) - all others were broken!