Skip to content

Conversation

@chrischdi
Copy link
Member

What this PR does / why we need it:

The VerifyMachinesReady is used in places where we think a Cluster is a certain state.

Instead of checking if the Machine's Ready conditions are true, the function currently waits up to five minutes for all Machine's Ready conditions to get true.

The logic currently is as follows:

func VerifyMachinesReady(ctx context.Context, input VerifyMachinesReadyInput) {
machineList := &clusterv1.MachineList{}
// Wait for all machines to have Ready condition set to true.
Eventually(func(g Gomega) {
g.Expect(input.Lister.List(ctx, machineList, client.InNamespace(input.Namespace),
client.MatchingLabels{
clusterv1.ClusterNameLabel: input.Name,
})).To(Succeed())
g.Expect(machineList.Items).ToNot(BeEmpty(), "No machines found for cluster %s", input.Name)
for _, machine := range machineList.Items {
readyConditionFound := false
for _, condition := range machine.Status.Conditions {
if condition.Type == clusterv1.ReadyCondition {
readyConditionFound = true
g.Expect(condition.Status).To(Equal(metav1.ConditionTrue), "The Ready condition on Machine %q should be set to true; message: %s", machine.Name, condition.Message)
g.Expect(condition.Message).To(BeEmpty(), "The Ready condition on Machine %q should have an empty message", machine.Name)
break
}
}
g.Expect(readyConditionFound).To(BeTrue(), "Machine %q should have a Ready condition", machine.Name)
}
}, 5*time.Minute, 10*time.Second).Should(Succeed(), "Failed to verify Machines Ready condition for Cluster %s", klog.KRef(input.Namespace, input.Name))
}

This changes the behavior to only retry the list api calls.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

/area e2e-testing

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/e2e-testing Issues or PRs related to e2e testing labels Nov 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign neolit123 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 20, 2025
}
}, 5*time.Minute, 10*time.Second).Should(Succeed(), "Failed to list Machines to check the Ready condition for Cluster %s", klog.KRef(input.Namespace, input.Name))

Expect(machineList.Items).ToNot(BeEmpty(), "No machines found for cluster %s", input.Name)
Copy link
Member

@sbueringer sbueringer Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is desirable in that way considering the Prow CI infrastructure.

I would expect a certain amount of flakiness purely because of the infra we run on.

We can give this a try early next cycle though and then keep an eye on k8s-triage

Alternative to this would be to just reduce the timeout to e.g. 30s/1m. It would allow some sort of flakiness but not entire rollout sequences in progress.

Side note: In general we probably want to do the same for VerifyClusterCondition / VerifyClusterAvailable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/e2e-testing Issues or PRs related to e2e testing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants