-During a runtime upgrade, before the BareMetalMachine is shutdown for reimaging, the machine lifecycle controller will cordon and attempt to drain the VirtualMachine resources which underpin the KubernetesCluster Nodes scheduled to that BareMetalMachine. Once the BareMetalMachine has been brought back up, the expectation is that VirtualMachines running on the host will reboot, reschedule to the BareMetalMachine, and then uncordon and become `Ready`. A race condition may occur in which the MachineLifecycleController fails to find the virt-launcher pods responsible for scheduling VirtualMachines to appropriate BareMetalMachines as they themselves have not yet been scheduled. Our hypothesis is that the virt-launcher's associated OS image pulling job has not yet completed when the MachineLifecycleController examines these virt-launcher pods during the uncordon action execution. This problem should appear only during uncordon actions, infrequently on small clusters but frequently for large clusters, as multiple concurrent image pulls will result in longer scheduling times.
0 commit comments