OCPCLOUD-2992: machineset migration e2e #287

sunzhaohua2 · 2025-04-29T03:44:46Z

No description provided.

openshift-ci-robot · 2025-04-29T03:44:50Z

@sunzhaohua2: This pull request references OCPCLOUD-2555 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-04-29T03:46:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sunzhaohua2 · 2025-04-29T11:34:14Z

/hold

sunzhaohua2 · 2025-05-06T08:01:12Z

/hold cancel

sunzhaohua2 · 2025-05-06T08:01:23Z

/retest

sunzhaohua2 · 2025-05-22T08:36:56Z

@damdo can you help to take a look when you got time, thanks!

theobarberbany · 2025-06-05T13:26:22Z

Heya @sunzhaohua2 :)

What we have here is useful, but if we look at the CI run we see it's passing in 0-1s.

I think a lot of this is already covered by unit testing, so we might want to focus our efforts a little differently.

IMO, here for e2es we want to test the behaviour of the MAPI controllers (machine, machineset, etc), CAPI controllers (machine, machineset, etc), and our migration controllers all interacting together - which we can't do with unit tests :) This is because in our unit tests (e.g for the MachineSync controllers) we're not running any of the CAPI or MAPI controllers.

So the tests we write here want to check things like deletion logic, showing machines templates and so on end up running / created without status errors (they're healthy - not that they merely exist and are accepted by the api server) and so on.

OCPBUGS-56897 this is a good example of a bug that could nicely be caught by an e2e test :) We would want to test the case where the MAPI machine set goes away, and we assert the CAPI machines do not change (e.g the UUID remains the same consistently, for example).

I think we want to work out a base set of e2e tests that give us broad coverage, and start with adding only those, and then increment :) I'll try and give this some thought!

one example may be something like:

mapi auth , create machineset, observe capi machine + machineset mirror come up healthy, switch auth see other controllers take over, scale up / down see machines work and delete

sunzhaohua2 · 2025-06-06T07:59:42Z

Heya @sunzhaohua2 :)

What we have here is useful, but if we look at the CI run we see it's passing in 0-1s.

I think a lot of this is already covered by unit testing, so we might want to focus our efforts a little differently.

IMO, here for e2es we want to test the behaviour of the MAPI controllers (machine, machineset, etc), CAPI controllers (machine, machineset, etc), and our migration controllers all interacting together - which we can't do with unit tests :) This is because in our unit tests (e.g for the MachineSync controllers) we're not running any of the CAPI or MAPI controllers.

So the tests we write here want to check things like deletion logic, showing machines templates and so on end up running / created without status errors (they're healthy - not that they merely exist and are accepted by the api server) and so on.

OCPBUGS-56897 this is a good example of a bug that could nicely be caught by an e2e test :) We would want to test the case where the MAPI machine set goes away, and we assert the CAPI machines do not change (e.g the UUID remains the same consistently, for example).

I think we want to work out a base set of e2e tests that give us broad coverage, and start with adding only those, and then increment :) I'll try and give this some thought!

one example may be something like:

mapi auth , create machineset, observe capi machine + machineset mirror come up healthy, switch auth see other controllers take over, scale up / down see machines work and delete

Thanks for your suggestion, good advice! I saw Joel said there will be an e2e doc, I will refer to the doc to make adjustments.

JoelSpeed

Having skimmed through the test cases, I think even though a lot of these are duplicating unit testing, there is still value in having these within the E2Es as well. They are quick and don't cost us much, and, if they were to regress somehow (e.g. through something else changing in the cluster), would mean that periodics could be configured to pick up these changes, which we would otherwise only find when running a PR against the cluster.

The bigger e2e cases that Theo mentioned will provide the majority of our E2E value I think, but this is also worthwhile continuing to work on IMO

JoelSpeed · 2025-06-06T12:15:23Z

e2e/framework/util.go

 	return apiUrl.Hostname(), int32(port), nil
 }
+
+// IsTechPreviewNoUpgrade checks if a cluster is a TechPreviewNoUpgrade cluster


You should check for the presence of a particular feature gate being enabled in the feature gate status rather than checking tech preview directly.

If we can use the openshift tests extension (I think we got that?), then naming tests based on the feature gate name will be sufficient and the test extension will skip any tests that aren't enabled for us

I realise we don't have a great feature gate for this feature right now, I'll take an action item to create that so that we can start moving on with this

I updated to check feature gate.
Yes, we got openshift tests extension, raised a pr before, will check that again.

sunzhaohua2 · 2025-06-10T02:21:15Z

/retest

sunzhaohua2 · 2025-06-10T06:37:26Z

@JoelSpeed @theobarberbany thanks for helping review, for the dup with unit I updated with By instead of It
I automated the cases basically following below, take machineapi as an example, now only added Create MAPI MachineSet with specAPI: MAPI

capi machineset with the same name does not exist

Create first, check that all states are correct
update
- scale up/down on both sides
- modify spec/template fields on both sides
- add label/annotation on both sides
- delete machine
change to clusterAPI
- scale up/down on both sides
- modify spec/template fields on both sides
- add label/annotation on both sides
- delete machine
change to machineAPI
- create a new mapi machineset
- delete machinesets on both sides

create a capi machineset with the same name, then create a mapi machineset, which should be rejected

In the E2E plan Dam wrote Create MAPI MachineSet With specAPI: CAPI (and no existing CAPI MSet with that name) This is possible to do at the moment, but shouldn’t, this means we don't need to add automation for creating MAPI MachineSet With specAPI: CAPI? If so, we need to auto all specAPI: CAPI cases after changing machineAPI to clusterAPI?

JoelSpeed · 2025-06-10T11:46:22Z

In the E2E plan Dam wrote Create MAPI MachineSet With specAPI: CAPI (and no existing CAPI MSet with that name) This is possible to do at the moment, but shouldn’t, this means we don't need to add automation for creating MAPI MachineSet With specAPI: CAPI? If so, we need to auto all specAPI: CAPI cases after changing machineAPI to clusterAPI?

I'm not sure if that's true.

I'd agree that creating a MAPI MachineSet with specAPI MAPI shouldn't be allowed when there is already a CAPI MachineSet. But in the case outlined here, I would expect the MachineSet to be created, and then a CAPI MachineSet to be created as a mirror of the MAPI MachineSet.

The mirror CAPI MachineSet would do the work, and the MAPI MachineSet would always be paused.

In a lot of the actions/scenarios, we care specifically about creating MachineSets, and less about the updates.
Once both MAPI and CAPI MachineSets exist, the behaviours are easier to reason about, writes to non-authoritative resources should be rejected in the most part, and changes to the authoritative resource should be reflected to the non-authoritative resource

JoelSpeed · 2025-06-10T11:47:03Z

I suspect that creating a suite of tests that specifically check the creation mechanics and what happens, and then a separate suite of tests that check what happens once you already have the existing MachineSets, would make the most sense

miyadav · 2025-06-18T05:40:47Z

@sunzhaohua2 , can we update the title to use - OCPCLOUD-2992 , Please review if that would be correct .

openshift-ci-robot · 2025-06-18T07:46:25Z

@sunzhaohua2: This pull request references OCPCLOUD-2992 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sunzhaohua2 · 2025-06-19T10:29:50Z

Hi @JoelSpeed @theobarberbany can you help to review this?

openshift-ci · 2025-06-19T11:09:27Z

@sunzhaohua2: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-azure-ovn-techpreview	`10653bf`	link	false	`/test e2e-azure-ovn-techpreview`
ci/prow/unit	`10653bf`	link	true	`/test unit`
ci/prow/security	`10653bf`	link	false	`/test security`
ci/prow/vendor	`10653bf`	link	true	`/test vendor`
ci/prow/e2e-aws-capi-techpreview	`10653bf`	link	true	`/test e2e-aws-capi-techpreview`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

JoelSpeed

When writing the CPMS E2Es, we used https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/envtest/komega to wrap a lot of the fetching of resources, in particular the things like checking the spec/status of a certain field become

Eventually(komega.Object(machine)).Should(HaveField("Status.AuthoritativeAPI", Equal("ClusterAPI"))

Updating objects is also fairly simple, and uses retry logic to make sure we are not susceptible to flakes

Eventually(komega.Update(machine, func() {
  machine.Spec.AuthoritativeAPI = "ClusterAPI"
}).Should(Succeed())
}

I wonder if we could avoid a lot of boiler plate if we adopted the same library here as well?

This PR is also huge! I wonder if we could break it down into stages, start by merging the framework and util bits and then different sections of the tests suites. Might make it a bit easier to review/iterate on.

Also, where I've made suggestions, please see if those might apply in other places. A lot of transforms here could be replaced simply with HaveField. A lot of the fetching of objects could use komega combined with HaveField to extract the fields we care about

JoelSpeed · 2025-06-20T11:00:44Z

e2e/framework/machineset.go

+func DeleteMachines(cl client.Client, machines ...*clusterv1.Machine) error {
+	return wait.PollUntilContextTimeout(ctx, RetryShort, time.Minute, true, func(ctx context.Context) (bool, error) {
+		for _, machine := range machines {
+			if err := cl.Delete(ctx, machine); err != nil {


What if the object isn't found? Would this loop forever?

JoelSpeed · 2025-06-20T11:08:22Z