Refactor controllers by hrak · Pull Request #51 · leaseweb/cluster-api-provider-cloudstack

hrak · 2025-02-07T15:32:39Z

Issue #, if available:

Description of changes:

This PR refactors the controllers entirely to not use the base reconciler pattern but instead use a more common approach used in CAPI infrastructure providers. It also adds a scope package and client factory for easier testing and updates CAPI to 1.8.9 and K8s deps to 1.30.9. The machinestatechecker controller is completely removed, as it was no longer needed. Other than that, the goal for now was to refactor without changing any of the features.

The only API change is the deprecation of the non-CAPI-standard 'status' and 'reason' fields, and the addition of their replacements FailureReason and FailureMessage in CloudStackMachineStatus.

Some basic tests have been added to the controllers using envtest. The e2e suite is not updated and probably doesnt work at this point.

This PR targets the develop branch.

Testing performed:

tests passing
spun up many clusters in several configurations (affinity yes/no, 1 vs 3 control plane nodes, etc.)

not tested yet: replace pre-refactored CAPC (0.7.0) with this in an existing mgmt cluster.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

fbouliane

Small comments here and there, no big logic issues
With the size of the PR, I couldn't get as low-level as I enjoy giving feedback.
Everything is negotiable, just in the spirit of giving feedback.

fbouliane · 2025-02-10T23:32:43Z

pkg/scope/clients.go

+}
+
+// getFailureDomainByName gets the CloudStack Failure Domain by name.
+func getFailureDomainByName(ctx context.Context, k8sClient client.Client, name, namespace, clusterName string) (*infrav1.CloudStackFailureDomain, error) {


Improvement:
It feels like this "scope" objects have dual responsibilities.
1- Manage the cache of clients,
2- "wrap" the client methods with accessors.

Maybe a smell of another class coming out.

This is a bit of a WIP. My intention is to split the cloud pkg into several packages, one for the Client interface and client creation functions, and one (or more) packages that implement the client interface.

fbouliane · 2025-02-10T23:33:51Z

pkg/scope/clients.go

+	endpointSecretStrings := map[string]string{}
+	for k, v := range endpointSecret.Data {
+		endpointSecretStrings[k] = string(v)
+	}
+	bytes, err := yaml.Marshal(endpointSecretStrings)
+	if err != nil {
+		return cloudConfig, err
+	}
+
+	if err := yaml.Unmarshal(bytes, &cloudConfig); err != nil {
+		return cloudConfig, err
+	}
+
+	if err := cloudConfig.Validate(); err != nil {
+		return cloudConfig, errors.Wrapf(err, "invalid cloud config")


Improvement:
Not sure why this classes knows

Where are the secrets located

how to marshal and unmarshal yaml

how to error if the config isn't valid.

It feels that belongs to a configManager of some sort.

The cloudconfig part is currently part of the client bits in the cloud pkg. My intention is to separate those in a client pkg.

fbouliane · 2025-02-10T23:45:17Z

pkg/scope/machinescope.go

+	isonet := &infrav1.CloudStackIsolatedNetwork{}
+	if s.CloudStackIsolatedNetwork == nil || s.CloudStackIsolatedNetwork.Name == "" {
+		err := s.client.Get(ctx, client.ObjectKey{Name: s.IsolatedNetworkName(), Namespace: s.Namespace()}, isonet)
+		if err != nil {
+			return nil, errors.Wrapf(err, "failed to get isolated network with name %s", s.IsolatedNetworkName())
+		}
+	}
+
+	return isonet, nil


Improvement; I feel this kind of validation should be pushed a domain lower, or within a controller / client.

internal/controllers/cloudstackcluster_controller_test.go

fbouliane · 2025-02-11T00:01:38Z

internal/controllers/cloudstackfailuredomain_controller_test.go

+		g.Eventually(func() error {
+			ph, err := patch.NewHelper(dummies.CSCluster, testEnv.Client)
+			g.Expect(err).ToNot(HaveOccurred())
+			dummies.CSCluster.OwnerReferences = append(dummies.CSCluster.OwnerReferences, metav1.OwnerReference{
+				Kind:       "Cluster",
+				APIVersion: clusterv1.GroupVersion.String(),
+				Name:       dummies.CAPICluster.Name,
+				UID:        types.UID("cluster-uid"),
+			})
+
+			return ph.Patch(ctx, dummies.CSCluster, patch.WithStatusObservedGeneration{})
+		}, timeout).Should(Succeed())


Honest question: Not sure why these tests are made from a completely "isolated" / blackbox way.
Looks like triggering a single reconciliation would allow to remove the looping and timeouts, which is not what we're trying to test here.

And it looks like sometimes, we do trigger the reconcillitation manually.

Not sure how fast are these either, it might not be an issue at all, but these kind of test are usually flaky.

Actually currently the reconcile is called manually every step. This is mostly so we can test the logic during the various phases of reconciliation (this is most visible in the machine controller). We could consider adding tests that are more integration tests, where all controllers are running and manual calls to reconcile are not needed.

In this particular case a patch for ownerreferences is not really needed, so i pushed a change for that.

internal/controllers/cloudstackisolatednetwork_controller.go

fbouliane · 2025-02-11T00:18:29Z

internal/controllers/cloudstackmachine_controller_test.go

Improvement: Looks like the tested controller is missing a bit of coverage.

Yes, the tests are not complete yet. I converted what was there in the original controllers to the new ones, but the old situation wasn't 100% either :)

kasperhendriks

I haven't looked at every line, but it looks good from what I can see. I agree with some of Felix's comments, but I also think this issue can be addressed in future PRs.

pkg/cloud/instance.go

…amespace for dummies

…mprove testability

removed the factory from the cloud pkg, moved to scope pkg as client scope adjust cloud pkg for factory change, refactory instance operations add machine controller and basic tests

…tions and remove old GetOrCreateVMInstance implementation

…n findinstance returns err

…state

…ne controller In the case of control plane nodes (or when specified in the Machine template), the machine controller will set the failure domain in the Machine. If this is not the case, we assign one of the failure domains set in CloudStackCluster.

… enc

… stage

worker nodes should now no longer get stuck in provisioning state, by working around the fact that CloudStack's stopped state has two meanings - new instance: stopped -> starting -> running - running -> stopped

…ests

…nfracluster

hrak · 2025-02-25T10:51:36Z

Addressed some issues and rebased on develop after K8s 1.30.10 / CAPI 1.8.10 changes.

hrak requested review from FarnazBGH, kasperhendriks and lanord February 7, 2025 15:32

hrak force-pushed the refactor_controllers branch from 10c9a98 to 914a4f0 Compare February 10, 2025 11:09

fbouliane approved these changes Feb 11, 2025

View reviewed changes

hrak force-pushed the refactor_controllers branch from 9e0737f to 461abbb Compare February 11, 2025 09:54

kasperhendriks approved these changes Feb 19, 2025

View reviewed changes

pkg/cloud/instance.go Outdated Show resolved Hide resolved

pkg/cloud/instance.go Outdated Show resolved Hide resolved

hrak added 23 commits February 25, 2025 11:41

refactor: introduce clientFactory for easier testing, allow setting n…

a125c96

…amespace for dummies

refactor: add scoped cloud client package

a491dd4

refactor: rewrite the controllers to get rid of base reconciler and i…

d5c4898

…mprove testability

fix: sanitize isonet name

b0d31ef

chore: add package comment

3d54369

refactor: move and refactor scope pkg, add machine controller

3dc7704

removed the factory from the cloud pkg, moved to scope pkg as client scope adjust cloud pkg for factory change, refactory instance operations add machine controller and basic tests

fix: fix AssignVMToLoadBalancerRules and associated tests

c56bd0f

fix(tests): Adjust instance tests to use new instance get/create func…

d7fc0b9

…tions and remove old GetOrCreateVMInstance implementation

chore: remove old controllers

410c66f

chore: lower verbosity on tests

1950ef7

chore(lint): lint fixes

9366582

chore: Update controller permissions

fbc92d2

chore(deps): Update to cloudstack-go 2.17.0

20436a7

fix(api): Fix conversion v1beta2 <-> v1beta3

7ac1301

chore(deps): switch to Uber maintained gomock pkg

fe6fc49

fix: fix cloud credentials and config retrieval

814d9c2

fix: change isonet failuredomain name from hashed to plain

fd064f2

fix: fix machine failure domain assignment

75491ed

fix: fix nil pointer exception while fetching the isolated network

239d886

fix: Look up VM by name if InstanceID is not set, ignore notfound whe…

59666e2

…n findinstance returns err

fix: fix reconciliation of machines getting stuck on initial stopped …

2151913

…state

fix: Handle userdata as array of bytes instead of string until base64…

ea10afe

… enc

hrak added 21 commits February 25, 2025 11:47

fix: Isolated network is needed to reconcile lb attachments in delete…

8295777

… stage

fix: Properly check if CloudStackCluster is ready

76d85c8

fix: Adjust logging to use logger package

fa6dffa

fix: Increase log level on LB attach/detach to debug

5dd8281

fix: Remove unused ctx param from reconcileDelete

45a8948

fix: Improve instance state handling in machine reconciler

4f31343

worker nodes should now no longer get stuck in provisioning state, by working around the fact that CloudStack's stopped state has two meanings - new instance: stopped -> starting -> running - running -> stopped

feat: Set infracluster failuredomains in status through a scope function

59d4665

fix: fix repeating error msg in NewClientScopeForFailureDomainByName

f92c836

chore: remove logging of already deleted items

78e8f4b

fix: fix the logic to skip lbattachment on new (starting) instance

98cf7f8

feat: Add validation for required fields in cloud config

2387678

chore: Update to Go 1.23

b11bb94

chore: Remove unneeded CloudStackMachineOwner fakes

58cd9c7

fix(tests): Fix machine controller tests

c36f3b8

fix(tests): Fix flaky affinitygroup test + use default polling interval

1bcbf5d

fix(tests): Add logger to context to display controller logs during t…

85ae7d6

…ests

chore: Replace deprecated k8s.io/utils/pointer with k8s.io/utils/ptr

be3dffb

fix(tests): Remove unnecessary patch for setting ownerreferences on i…

057d38b

…nfracluster

feat(tests): Test isolated network and cloudstackcluster deletion

998f3d1

chore(tests): remove periods from test names

1807509

fix: improve handling of error conditions in GetVMBy* functions

0029b89

hrak force-pushed the refactor_controllers branch from 461abbb to 0029b89 Compare February 25, 2025 10:50

hrak requested a review from fbouliane February 25, 2025 11:55

chore: get rid of simpleyaml, use yaml.v3

80d3eb5

hrak force-pushed the refactor_controllers branch from 1407e91 to 80d3eb5 Compare February 25, 2025 13:28

hrak merged commit 6434412 into develop Feb 25, 2025
3 checks passed

hrak deleted the refactor_controllers branch February 25, 2025 21:05

Conversation

hrak commented Feb 7, 2025

Uh oh!

fbouliane left a comment

Choose a reason for hiding this comment

Uh oh!

fbouliane Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hrak Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

fbouliane Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hrak Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

fbouliane Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fbouliane Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

hrak Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fbouliane Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

hrak Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

kasperhendriks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hrak commented Feb 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fbouliane Feb 10, 2025 •

edited

Loading

fbouliane Feb 10, 2025 •

edited

Loading