Skip to content

Refactor controllers#51

Merged
hrak merged 45 commits intodevelopfrom
refactor_controllers
Feb 25, 2025
Merged

Refactor controllers#51
hrak merged 45 commits intodevelopfrom
refactor_controllers

Conversation

@hrak
Copy link
Member

@hrak hrak commented Feb 7, 2025

Issue #, if available:

Description of changes:

This PR refactors the controllers entirely to not use the base reconciler pattern but instead use a more common approach used in CAPI infrastructure providers. It also adds a scope package and client factory for easier testing and updates CAPI to 1.8.9 and K8s deps to 1.30.9. The machinestatechecker controller is completely removed, as it was no longer needed. Other than that, the goal for now was to refactor without changing any of the features.

The only API change is the deprecation of the non-CAPI-standard 'status' and 'reason' fields, and the addition of their replacements FailureReason and FailureMessage in CloudStackMachineStatus.

Some basic tests have been added to the controllers using envtest. The e2e suite is not updated and probably doesnt work at this point.

This PR targets the develop branch.

Testing performed:

  • tests passing
  • spun up many clusters in several configurations (affinity yes/no, 1 vs 3 control plane nodes, etc.)

not tested yet: replace pre-refactored CAPC (0.7.0) with this in an existing mgmt cluster.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link

@fbouliane fbouliane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments here and there, no big logic issues
With the size of the PR, I couldn't get as low-level as I enjoy giving feedback.
Everything is negotiable, just in the spirit of giving feedback.

}

// getFailureDomainByName gets the CloudStack Failure Domain by name.
func getFailureDomainByName(ctx context.Context, k8sClient client.Client, name, namespace, clusterName string) (*infrav1.CloudStackFailureDomain, error) {
Copy link

@fbouliane fbouliane Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement:
It feels like this "scope" objects have dual responsibilities.
1- Manage the cache of clients,
2- "wrap" the client methods with accessors.

Maybe a smell of another class coming out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a WIP. My intention is to split the cloud pkg into several packages, one for the Client interface and client creation functions, and one (or more) packages that implement the client interface.

Comment on lines +194 to +208
endpointSecretStrings := map[string]string{}
for k, v := range endpointSecret.Data {
endpointSecretStrings[k] = string(v)
}
bytes, err := yaml.Marshal(endpointSecretStrings)
if err != nil {
return cloudConfig, err
}

if err := yaml.Unmarshal(bytes, &cloudConfig); err != nil {
return cloudConfig, err
}

if err := cloudConfig.Validate(); err != nil {
return cloudConfig, errors.Wrapf(err, "invalid cloud config")
Copy link

@fbouliane fbouliane Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement:
Not sure why this classes knows

  • Where are the secrets located
  • how to marshal and unmarshal yaml
  • how to error if the config isn't valid.

It feels that belongs to a configManager of some sort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cloudconfig part is currently part of the client bits in the cloud pkg. My intention is to separate those in a client pkg.

Comment on lines +195 to +203
isonet := &infrav1.CloudStackIsolatedNetwork{}
if s.CloudStackIsolatedNetwork == nil || s.CloudStackIsolatedNetwork.Name == "" {
err := s.client.Get(ctx, client.ObjectKey{Name: s.IsolatedNetworkName(), Namespace: s.Namespace()}, isonet)
if err != nil {
return nil, errors.Wrapf(err, "failed to get isolated network with name %s", s.IsolatedNetworkName())
}
}

return isonet, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement; I feel this kind of validation should be pushed a domain lower, or within a controller / client.

Comment on lines +99 to +110
g.Eventually(func() error {
ph, err := patch.NewHelper(dummies.CSCluster, testEnv.Client)
g.Expect(err).ToNot(HaveOccurred())
dummies.CSCluster.OwnerReferences = append(dummies.CSCluster.OwnerReferences, metav1.OwnerReference{
Kind: "Cluster",
APIVersion: clusterv1.GroupVersion.String(),
Name: dummies.CAPICluster.Name,
UID: types.UID("cluster-uid"),
})

return ph.Patch(ctx, dummies.CSCluster, patch.WithStatusObservedGeneration{})
}, timeout).Should(Succeed())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honest question: Not sure why these tests are made from a completely "isolated" / blackbox way.
Looks like triggering a single reconciliation would allow to remove the looping and timeouts, which is not what we're trying to test here.

And it looks like sometimes, we do trigger the reconcillitation manually.

Not sure how fast are these either, it might not be an issue at all, but these kind of test are usually flaky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually currently the reconcile is called manually every step. This is mostly so we can test the logic during the various phases of reconciliation (this is most visible in the machine controller). We could consider adding tests that are more integration tests, where all controllers are running and manual calls to reconcile are not needed.

In this particular case a patch for ownerreferences is not really needed, so i pushed a change for that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement: Looks like the tested controller is missing a bit of coverage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tests are not complete yet. I converted what was there in the original controllers to the new ones, but the old situation wasn't 100% either :)

@hrak hrak force-pushed the refactor_controllers branch from 9e0737f to 461abbb Compare February 11, 2025 09:54
Copy link

@kasperhendriks kasperhendriks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at every line, but it looks good from what I can see. I agree with some of Felix's comments, but I also think this issue can be addressed in future PRs.

hrak added 23 commits February 25, 2025 11:41
removed the factory from the cloud pkg, moved to scope pkg as client scope
adjust cloud pkg for factory change, refactory instance operations
add machine controller and basic tests
…tions

and remove old GetOrCreateVMInstance implementation
…ne controller

In the case of control plane nodes (or when specified in the Machine template),
the machine controller will set the failure domain in the Machine. If this is
not the case, we assign one of the failure domains set in CloudStackCluster.
hrak added 21 commits February 25, 2025 11:47
worker nodes should now no longer get stuck in provisioning state, by working around the fact that CloudStack's stopped state has two meanings

- new instance: stopped -> starting -> running
- running -> stopped
@hrak hrak force-pushed the refactor_controllers branch from 461abbb to 0029b89 Compare February 25, 2025 10:50
@hrak
Copy link
Member Author

hrak commented Feb 25, 2025

Addressed some issues and rebased on develop after K8s 1.30.10 / CAPI 1.8.10 changes.

@hrak hrak requested a review from fbouliane February 25, 2025 11:55
@hrak hrak force-pushed the refactor_controllers branch from 1407e91 to 80d3eb5 Compare February 25, 2025 13:28
@hrak hrak merged commit 6434412 into develop Feb 25, 2025
3 checks passed
@hrak hrak deleted the refactor_controllers branch February 25, 2025 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants