MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely #1096

jr-dimedis · 2023-11-21T16:25:08Z

jr-dimedis
Nov 21, 2023

Hi folks,

we drive two clusters with kube-hetzner (version 2.7.2) for several months without issues but now an automatic MicroOS failed on both clusters.

This resulted in a NotReady node, no kubelet ist running. Obviously no further upgrades were triggered on other nodes due to this error, because all other nodes run fine, but with old MicroOS version.

Both nodes rebooted with a new kernel 6.6.1-1-default (before it was 6.5.9-1-default).

systemctl tells me, that the following services failed:

cloud-init.service
health-checker.service
transactional-update.service

The log output of these services looks quite similar on both nodes so I post the output just from one node here.

cloud-init.service

Nov 19 03:01:10 k8stest1-wk-fsn1-cpx31-xgt cloud-init[1099]: 2023-11-19 03:01:10,289 - util.py[WARNING]: Failed to resize filesystem (cmd=('btrfs', 'filesystem', 'resize', '--enqueue', 'max', '//.snapshots'))
Nov 19 03:01:10 k8stest1-wk-fsn1-cpx31-xgt cloud-init[1099]: 2023-11-19 03:01:10,291 - util.py[WARNING]: Running module resizefs (<module 'cloudinit.config.cc_resizefs' from '/usr/lib/python3.11/site-packages/cloudinit/config/cc_resizef>
Nov 19 03:01:10 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: cloud-init.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 03:01:10 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: cloud-init.service: Failed with result 'exit-code'.
Nov 19 03:01:10 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Failed to start Initial cloud-init job (metadata service crawler).

health-checker.service

Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Starting MicroOS Health Checker...
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Clearing GRUB flag
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1199]: grub2-editenv: error: cannot open `/boot/grub2/grubenv': Read-only file system.
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Starting health check
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1207]: <10>Nov 19 03:01:11 root: ERROR: "/usr/libexec/health-checker/btrfs-subvolumes-mounted.sh check" failed
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1212]: active
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Health check failed!
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt root[1233]: Machine didn't come up correctly, do a rollback
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1233]: <9>Nov 19 03:01:11 root: Machine didn't come up correctly, do a rollback
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1234]: mount: /.snapshots: mount point not mounted or bad option.
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1234]:        dmesg(1) may have more information after failed mount system call.
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1235]: ERROR: Could not set default subvolume: Inappropriate ioctl for device
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt root[1236]: ERROR: btrfs set-default 449 failed!
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1236]: <10>Nov 19 03:01:11 root: ERROR: btrfs set-default 449 failed!
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1237]: /usr/sbin/health-checker: line 91: telem_send_payload: command not found
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: health-checker.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: health-checker.service: Failed with result 'exit-code'.
Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Failed to start MicroOS Health Checker.

transactional-update.service

The log output of transactional-update.service is rather long so I paste only the part at the end which look important to me. If more log output is required I am happy to give it to you.

Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: Creating a new grub2 config
Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:02 tukit 4.4.0 started
Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:02 Options: call 52 bash -c /usr/sbin/grub2-mkconfig > /boot/grub2/grub.cfg
Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0
Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:cont>
Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:03 Executing `bash -c /usr/sbin/grub2-mkconfig > /boot/grub2/grub.cfg`:
Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Generating grub configuration file ...
Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found theme: /boot/grub2/themes/openSUSE/theme.txt
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found linux image: /boot/vmlinuz-6.6.1-1-default
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found initrd image: /boot/initrd-6.6.1-1-default
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Warning: os-prober will not be executed to detect other bootable partitions.
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Systems on them will not be added to the GRUB boot configuration.
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Check GRUB_DISABLE_OS_PROBER documentation entry.
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Adding boot menu entry for UEFI Firmware Settings ...
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: done
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:04 Application returned with exit status 0.
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:04 Transaction completed.
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:04 tukit 4.4.0 started
Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:04 Options: close 52
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:cont>
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:05 New default snapshot is #52 (/.snapshots/52/snapshot).
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:05 Transaction completed.
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: Warning: The following files were changed in the snapshot, but are shadowed by
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: other mounts and will not be visible to the system:
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: /.snapshots/52/snapshot/var/lib/dbus/machine-id
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: WARNING: This snapshot has been created from a different base (47)
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]:          than the previous default snapshot (51) and does not
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]:          contain the changes from the latter.
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: New default snapshot is #52 (/.snapshots/52/snapshot).
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: transactional-update finished
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 tukit 4.4.0 started
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Options: reboot kured
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Triggering reboot using kured
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Transaction completed.
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Deactivated successfully.
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Finished Update the system.
Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Consumed 1min 15.285s CPU time.
-- Boot 1a23c0e12e33484c99064b5dcc589168 --
Nov 20 00:07:10 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Starting Update the system...
Nov 20 00:07:10 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Checking for newer version.
Nov 20 00:07:11 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: transactional-update 4.4.0 started
Nov 20 00:07:11 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Options: cleanup dup reboot
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Separate /var detected.
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Adding cleanup algorithm to snapshot #47
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19832]: Snapshot '47' not found.
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set cleanup algorithm for snapshot #47
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Adding "important=yes" to snapshot #47
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19838]: Snapshot '47' not found.
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set "important=yes for snapshot" #47
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Mark unused snapshot #51 for deletion
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19847]: Snapshot '51' not found.
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set cleanup algorithm for snapshot #51
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/1
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/12
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/15
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/2
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/3
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/4
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/41
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/47
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/48
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/49
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/5
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/50
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/51
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/52
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/6
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/7
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/work-etc
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: 2023-11-20 00:07:12 tukit 4.4.0 started
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: 2023-11-20 00:07:12 Options: -c52 open
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: ERROR: Base snapshot '52' does not exist.
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: transactional-update finished
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Main process exited, code=exited, status=1/FAILURE
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Failed with result 'exit-code'.
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Failed to start Update the system.
Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Consumed 1.375s CPU time.

Disk space looks good (16G free on /, 134G free on /var).

The /.snapshot/ folder is empty on the failed nodes.

I have no experience with MicroOS and don't know what to do now.

Any help is appreciated! Thanks!

Regards,

Jörn

Answered by jr-dimedis

Nov 23, 2023

This is the recipe we used to repair the broken nodes, which did not yet deleted all snapshots from disk. We deleted the latter in Cloud Console and provisioned them with Terraform again.

We had to go back to the snapshot before November 12th (which was May) - all others were broken!

# set node specs
$ NODE=node-name
$ IP=ip-address

# check / wait until all pods are in ready state
$ k get pods -A

# drain the node in question
$ k drain $NODE --ignore-daemonsets --delete-emptydir-data

# perform a snapper rollback on this node
$ ssh root@$IP
$ snapper list
$ snapper rollback 23
$ reboot

# perform transactional-update to be up-to-date again
$ ssh root@$IP
$ systemctl start transactional-u…

View full answer

jr-dimedis · 2023-11-22T19:21:08Z

jr-dimedis
Nov 22, 2023
Author

Today we did more investigation and I have some information to add:

If you reboot (again) a node in this state it's completely broken afterwards!

The kernel starts, network is up, but it goes into emergency mode without the ability to login as root. Nothing you can do anymore but deleting and re-provisioning the node,

We booted another node, which didn't boot yet due to Kured -> same problem

It goes into the same state as the first node on its first reboot. The only difference is, that transactional-update did not run yet. We disabled it completely on all nodes to prevent further harm (and that was good so! - see below).

So this is a generic problem and not specific to the node in question. As well this happened at the same time on two completely different clusters!

Our diagnosis what happened

the MicroOS upgrade obviously screws the BTRFS up in some way (that's the part we don't know what exactly causes this)
then booted, the health check fails, tries to roll back, can't do that (due to missing /.snapshots folder) - no kubelet is started
when now transactional-update runs it performs a "unused snapshot cleanup" although the system is in an inconsistent state
this completely deletes all snapshots so on next reboot no root filesystem exists anymore -> Node is dead!

One root cause/symptom is that /.snapshots folder is not mounted as declared in /etc/fstab. We could mount it manually and it looks it has proper content (compared with other "healthy" not-yet-rebooted nodes). We have no idea why it can't be mounted on a regular boot.

This looks like a serious bug to me in the whole transactional-update thing. This must not run if the health-checker detected a problem and/or /.snapshots is empty. In particular performing a snapshot cleanup when important meta information is missing sounds like a bad idea, because at the end it deletes really anything!

What makes me wonder is that obviously we are the only ones suffering from this. Did you all have MicroOS upgrades switched off? I can't see anything special with our configuration which led to this.

For me it looks the current MicroOS update thingy is actually broken.

I checked their Bugzilla and didn't found anything.

I filed a bug report:

https://bugzilla.opensuse.org/show_bug.cgi?id=1217416

0 replies

jr-dimedis · 2023-11-22T22:05:43Z

jr-dimedis
Nov 22, 2023
Author

Another follow up:

I could rescue a node which did not yet run transactional-update after the first reboot:

open Hetzner Console on this node
perform reboot (either via SSH or Ctrl+Alt+Del)
in Grub menu (quickly) go back to snapshot before Kernel 6.6.1-1 (the last known working one)
system boots well
ssh into it

# show all snapshots
snapper list

# pick the number from the list which is suffixed with - (the current one) and issue
snapper rollback <nr>

# perform a reboot into this new default snapshot
reboot

Next I immediately disabled transactional-update again to prevent further harm and save time for more investigations:

systemctl --now disable transactional-update.timer

This procedure does not work on the node where transaction-update did run and removed all snapshots. It doesn't matter which snapshot I select from Grub menu: emergency mode -> no login possible -> stuck.

I will manually delete and then re-create this node with Terraform.

0 replies

jr-dimedis · 2023-11-23T15:09:12Z

jr-dimedis
Nov 23, 2023
Author

This is the recipe we used to repair the broken nodes, which did not yet deleted all snapshots from disk. We deleted the latter in Cloud Console and provisioned them with Terraform again.

We had to go back to the snapshot before November 12th (which was May) - all others were broken!

# set node specs
$ NODE=node-name
$ IP=ip-address

# check / wait until all pods are in ready state
$ k get pods -A

# drain the node in question
$ k drain $NODE --ignore-daemonsets --delete-emptydir-data

# perform a snapper rollback on this node
$ ssh root@$IP
$ snapper list
$ snapper rollback 23
$ reboot

# perform transactional-update to be up-to-date again
$ ssh root@$IP
$ systemctl start transactional-update.service
$ reboot

# activate node again
$ k uncordon $NODE

I want to note that due to the read-only-root fashion of MicroOS neither Kubernetes nor the workloads, containers, data etc. are affected by the snapshot rollback. That's what I like about this.

What I still dislike is, that an unattended automatic upgrade destroyed two nodes completely by wiping any snapshots from disk.

Luckily our production workloads were not affected, because we always had enough healthy nodes to drive them.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely #1096

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely #1096

Uh oh!

Uh oh!

jr-dimedis Nov 21, 2023

Replies: 3 comments

Uh oh!

Uh oh!

jr-dimedis Nov 22, 2023 Author

Uh oh!

jr-dimedis Nov 22, 2023 Author

Uh oh!

jr-dimedis Nov 23, 2023 Author

jr-dimedis
Nov 21, 2023

jr-dimedis
Nov 22, 2023
Author

jr-dimedis
Nov 22, 2023
Author

jr-dimedis
Nov 23, 2023
Author