Skip to content

Conversation

@sunzhaohua2
Copy link
Contributor

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 29, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 29, 2025

@sunzhaohua2: This pull request references OCPCLOUD-2555 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from JoelSpeed and sub-mod April 29, 2025 03:46
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sunzhaohua2
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2025
@sunzhaohua2
Copy link
Contributor Author

/hold cancel

@sunzhaohua2
Copy link
Contributor Author

/retest

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2025
@sunzhaohua2
Copy link
Contributor Author

@damdo can you help to take a look when you got time, thanks!

@theobarberbany
Copy link
Contributor

theobarberbany commented Jun 5, 2025

Heya @sunzhaohua2 :)

What we have here is useful, but if we look at the CI run we see it's passing in 0-1s.

I think a lot of this is already covered by unit testing, so we might want to focus our efforts a little differently.

IMO, here for e2es we want to test the behaviour of the MAPI controllers (machine, machineset, etc), CAPI controllers (machine, machineset, etc), and our migration controllers all interacting together - which we can't do with unit tests :) This is because in our unit tests (e.g for the MachineSync controllers) we're not running any of the CAPI or MAPI controllers.

So the tests we write here want to check things like deletion logic, showing machines templates and so on end up running / created without status errors (they're healthy - not that they merely exist and are accepted by the api server) and so on.

OCPBUGS-56897 this is a good example of a bug that could nicely be caught by an e2e test :) We would want to test the case where the MAPI machine set goes away, and we assert the CAPI machines do not change (e.g the UUID remains the same consistently, for example).

I think we want to work out a base set of e2e tests that give us broad coverage, and start with adding only those, and then increment :) I'll try and give this some thought!

one example may be something like:

mapi auth , create machineset, observe capi machine + machineset mirror come up healthy, switch auth see other controllers take over, scale up / down see machines work and delete

@sunzhaohua2
Copy link
Contributor Author

Heya @sunzhaohua2 :)

What we have here is useful, but if we look at the CI run we see it's passing in 0-1s.

I think a lot of this is already covered by unit testing, so we might want to focus our efforts a little differently.

IMO, here for e2es we want to test the behaviour of the MAPI controllers (machine, machineset, etc), CAPI controllers (machine, machineset, etc), and our migration controllers all interacting together - which we can't do with unit tests :) This is because in our unit tests (e.g for the MachineSync controllers) we're not running any of the CAPI or MAPI controllers.

So the tests we write here want to check things like deletion logic, showing machines templates and so on end up running / created without status errors (they're healthy - not that they merely exist and are accepted by the api server) and so on.

OCPBUGS-56897 this is a good example of a bug that could nicely be caught by an e2e test :) We would want to test the case where the MAPI machine set goes away, and we assert the CAPI machines do not change (e.g the UUID remains the same consistently, for example).

I think we want to work out a base set of e2e tests that give us broad coverage, and start with adding only those, and then increment :) I'll try and give this some thought!

one example may be something like:

mapi auth , create machineset, observe capi machine + machineset mirror come up healthy, switch auth see other controllers take over, scale up / down see machines work and delete

Thanks for your suggestion, good advice! I saw Joel said there will be an e2e doc, I will refer to the doc to make adjustments.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having skimmed through the test cases, I think even though a lot of these are duplicating unit testing, there is still value in having these within the E2Es as well. They are quick and don't cost us much, and, if they were to regress somehow (e.g. through something else changing in the cluster), would mean that periodics could be configured to pick up these changes, which we would otherwise only find when running a PR against the cluster.

The bigger e2e cases that Theo mentioned will provide the majority of our E2E value I think, but this is also worthwhile continuing to work on IMO

return apiUrl.Hostname(), int32(port), nil
}

// IsTechPreviewNoUpgrade checks if a cluster is a TechPreviewNoUpgrade cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check for the presence of a particular feature gate being enabled in the feature gate status rather than checking tech preview directly.

If we can use the openshift tests extension (I think we got that?), then naming tests based on the feature gate name will be sufficient and the test extension will skip any tests that aren't enabled for us

I realise we don't have a great feature gate for this feature right now, I'll take an action item to create that so that we can start moving on with this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to check feature gate.
Yes, we got openshift tests extension, raised a pr before, will check that again.

@sunzhaohua2 sunzhaohua2 force-pushed the migrate branch 3 times, most recently from 86cea52 to 3b5809a Compare June 9, 2025 14:17
@sunzhaohua2
Copy link
Contributor Author

/retest

@sunzhaohua2
Copy link
Contributor Author

@JoelSpeed @theobarberbany thanks for helping review, for the dup with unit I updated with By instead of It
I automated the cases basically following below, take machineapi as an example, now only added Create MAPI MachineSet with specAPI: MAPI

capi machineset with the same name does not exist

  • Create first, check that all states are correct
  • update
    • scale up/down on both sides
    • modify spec/template fields on both sides
    • add label/annotation on both sides
    • delete machine
  • change to clusterAPI
    • scale up/down on both sides
    • modify spec/template fields on both sides
    • add label/annotation on both sides
    • delete machine
  • change to machineAPI
    • create a new mapi machineset
    • delete machinesets on both sides

create a capi machineset with the same name, then create a mapi machineset, which should be rejected

In the E2E plan Dam wrote Create MAPI MachineSet With specAPI: CAPI (and no existing CAPI MSet with that name) This is possible to do at the moment, but shouldn’t, this means we don't need to add automation for creating MAPI MachineSet With specAPI: CAPI? If so, we need to auto all specAPI: CAPI cases after changing machineAPI to clusterAPI?

@JoelSpeed
Copy link
Contributor

In the E2E plan Dam wrote Create MAPI MachineSet With specAPI: CAPI (and no existing CAPI MSet with that name) This is possible to do at the moment, but shouldn’t, this means we don't need to add automation for creating MAPI MachineSet With specAPI: CAPI? If so, we need to auto all specAPI: CAPI cases after changing machineAPI to clusterAPI?

I'm not sure if that's true.

I'd agree that creating a MAPI MachineSet with specAPI MAPI shouldn't be allowed when there is already a CAPI MachineSet. But in the case outlined here, I would expect the MachineSet to be created, and then a CAPI MachineSet to be created as a mirror of the MAPI MachineSet.

The mirror CAPI MachineSet would do the work, and the MAPI MachineSet would always be paused.

In a lot of the actions/scenarios, we care specifically about creating MachineSets, and less about the updates.
Once both MAPI and CAPI MachineSets exist, the behaviours are easier to reason about, writes to non-authoritative resources should be rejected in the most part, and changes to the authoritative resource should be reflected to the non-authoritative resource

@JoelSpeed
Copy link
Contributor

I suspect that creating a suite of tests that specifically check the creation mechanics and what happens, and then a separate suite of tests that check what happens once you already have the existing MachineSets, would make the most sense

@sunzhaohua2 sunzhaohua2 changed the title OCPCLOUD-2555: e2e testing automation for machineapi authority OCPCLOUD-2555: machineset migration e2e Jun 17, 2025
@miyadav
Copy link
Member

miyadav commented Jun 18, 2025

@sunzhaohua2 , can we update the title to use - OCPCLOUD-2992 , Please review if that would be correct .

@sunzhaohua2 sunzhaohua2 changed the title OCPCLOUD-2555: machineset migration e2e OCPCLOUD-2992: machineset migration e2e Jun 18, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 18, 2025

@sunzhaohua2: This pull request references OCPCLOUD-2992 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sunzhaohua2 sunzhaohua2 force-pushed the migrate branch 3 times, most recently from 4bd3719 to 5065f59 Compare June 19, 2025 05:46
@sunzhaohua2
Copy link
Contributor Author

Hi @JoelSpeed @theobarberbany can you help to review this?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 19, 2025

@sunzhaohua2: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-techpreview 10653bf link false /test e2e-azure-ovn-techpreview
ci/prow/unit 10653bf link true /test unit
ci/prow/security 10653bf link false /test security
ci/prow/vendor 10653bf link true /test vendor
ci/prow/e2e-aws-capi-techpreview 10653bf link true /test e2e-aws-capi-techpreview

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing the CPMS E2Es, we used https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/envtest/komega to wrap a lot of the fetching of resources, in particular the things like checking the spec/status of a certain field become

Eventually(komega.Object(machine)).Should(HaveField("Status.AuthoritativeAPI", Equal("ClusterAPI"))

Updating objects is also fairly simple, and uses retry logic to make sure we are not susceptible to flakes

Eventually(komega.Update(machine, func() {
  machine.Spec.AuthoritativeAPI = "ClusterAPI"
}).Should(Succeed())
}

I wonder if we could avoid a lot of boiler plate if we adopted the same library here as well?

This PR is also huge! I wonder if we could break it down into stages, start by merging the framework and util bits and then different sections of the tests suites. Might make it a bit easier to review/iterate on.

Also, where I've made suggestions, please see if those might apply in other places. A lot of transforms here could be replaced simply with HaveField. A lot of the fetching of objects could use komega combined with HaveField to extract the fields we care about

func DeleteMachines(cl client.Client, machines ...*clusterv1.Machine) error {
return wait.PollUntilContextTimeout(ctx, RetryShort, time.Minute, true, func(ctx context.Context) (bool, error) {
for _, machine := range machines {
if err := cl.Delete(ctx, machine); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the object isn't found? Would this loop forever?

_, mapiDefaultProviderSpec := getDefaultAWSMAPIProviderSpec(cl)
createAWSClient(mapiDefaultProviderSpec.Placement.Region)
awsMachineTemplate = newAWSMachineTemplate(mapiDefaultProviderSpec)
if err := cl.Create(ctx, awsMachineTemplate); err != nil && !apierrors.IsAlreadyExists(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it already exist?

Comment on lines +107 to +113
It("should MAPI MachineSet is authoritative and create the CAPI MachineSet mirror", func() {
verifyMachineSetAuthoritative(ctx, cl, machineSetNameMAPI, machinev1beta1.MachineAuthorityMachineAPI)
verifySynchronizedCondition(ctx, cl, machineSetNameMAPI, machinev1beta1.MachineAuthorityMachineAPI)
verifyMAPIPausedCondition(ctx, cl, machineSetNameMAPI, machinev1beta1.MachineAuthorityMachineAPI)
verifyMAPIHasCAPIMirror(cl, machineSetNameMAPI)
verifyCAPIPausedCondition(cl, machineSetNameMAPI, machinev1beta1.MachineAuthorityMachineAPI)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of these could be separate It statements, as that would allow us clearer signal on which of the expectations is failing.

})
})
/*
Context("With specAPI: CAPI and EXISTING CAPI MSet with that name", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the CAPI MS be created in a BeforeAll inside this block?

Comment on lines +144 to +170
//By("Creating a same name MAPI MachineSet")
mapiMachineSetCAPI, err := createMAPIMachineSetWithAuthoritativeAPI(ctx, cl, 0, capiMachineSetNameCAPI, machinev1beta1.MachineAuthorityMachineAPI, machinev1beta1.MachineAuthorityMachineAPI)
Expect(err).ToNot(HaveOccurred(), "failed to create mapiMachineSet %s", mapiMachineSetCAPI)
By("Verify the MAPI MachineSet is Paused")
verifyMAPIPausedCondition(ctx, cl, capiMachineSetNameCAPI, machinev1beta1.MachineAuthorityClusterAPI)
By("Verify the MAPI Spec be updated to reflect existing CAPI mirror")
mapiMachineSetCAPI, err = mapiframework.GetMachineSet(ctx, cl, machineSetNameCAPI)
Expect(err).ToNot(HaveOccurred(), "failed to get mapiMachineSet %s", machineSetNameCAPI)
providerSpec := mapiMachineSetCAPI.Spec.Template.Spec.ProviderSpec
Expect(providerSpec.Value).NotTo(BeNil())
Expect(providerSpec.Value.Raw).NotTo(BeEmpty())
var awsConfig machinev1beta1.AWSMachineProviderConfig
err = json.Unmarshal(providerSpec.Value.Raw, &awsConfig)
Expect(err).NotTo(HaveOccurred(), "Failed to unmarshal ProviderSpec.Value")
Expect(awsConfig.InstanceType).To(
SatisfyAny(
BeEmpty(),
Equal("m5.large"),
),
"Unexpected instanceType: %s",
awsConfig.InstanceType,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different expectations here could be broken into separate ordered It expectations

return *mapiMachineSet.Spec.Replicas
}, framework.WaitShort, framework.RetryShort).Should(Equal(int32(2)), "replicas should change to 2")

By("Verify there is a non-authoritative, paused MAPI Machine mirror for the new CAPI Machine")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure the original machine is MAPI authoritative at some point in the setup

Comment on lines +827 to +829
mapiMachines, err := mapiframework.GetMachinesFromMachineSet(ctx, cl, mapiMachineSet)
Expect(err).ToNot(HaveOccurred(), "Failed to get mapiMachineSet %s", machineSetNameMAPI)
Expect(mapiMachines).NotTo(BeEmpty(), "Machines should remain")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MAPI MachineSet should not exist anymore? So I think instead we check that MAPI machines still exist in the MAPI namespace with the same names as the CAPI machines, WDYT?

Comment on lines +872 to +886
It("should be rejected when scaling CAPI mirror", func() {
By("Scaling up CAPI MachineSet to 1")
framework.ScaleMachineSet(machineSetNameMAPI, 1)
capiMachineSet, err := framework.GetMachineSet(cl, machineSetNameMAPI)
Expect(err).ToNot(HaveOccurred(), "Failed to get capiMachineSet %s", capiMachineSet)

Eventually(func() int32 {
capiMachineSet, err := framework.GetMachineSet(cl, machineSetNameMAPI)
Expect(err).ToNot(HaveOccurred(), "Failed to get capiMachineSet %s", machineSetNameMAPI)
return *capiMachineSet.Spec.Replicas
}, framework.WaitShort, framework.RetryShort).Should(Equal(int32(0)), "replicas should eventually revert to 0")
mapiMachineSet, err := mapiframework.GetMachineSet(ctx, cl, machineSetNameMAPI)
Expect(err).ToNot(HaveOccurred(), "Failed to get mapiMachineSet %s", machineSetNameMAPI)
Expect(*mapiMachineSet.Spec.Replicas).To(Equal(int32(0)), "replicas should remain 0")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should get rejected by a VAP eventually

Expect(*mapiMachineSet.Spec.Replicas).To(Equal(int32(0)), "replicas should remain 0")
})

It("should be rejected when updating CAPI mirror spec", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be rejected by VAP eventually

Comment on lines +1010 to +1011
// sleep for 30s to make sure mirror machineset be created
time.Sleep(30 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use an Eventually to check for it to be created?

@sunzhaohua2
Copy link
Contributor Author

I wonder if we could break it down into stages, start by merging the framework and util bits and then different sections of the tests suites. Might make it a bit easier to review/iterate on.

Thank you for all the suggestions, I raised a new pr for the utils #323 please help to take a look. I will modify the tests suites and submit them one by one later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants