Skip to content

Comments

Basic Lifecycle Management Interfaces + v2 Concrete Implementations#568

Merged
andrewstucki merged 8 commits intomainfrom
as/unified-resources-interfaces
Apr 2, 2025
Merged

Basic Lifecycle Management Interfaces + v2 Concrete Implementations#568
andrewstucki merged 8 commits intomainfrom
as/unified-resources-interfaces

Conversation

@andrewstucki
Copy link
Contributor

This adds the basic interfaces for StatefulSet-based lifecycle management for initially the v2 operator. Currently it is not leveraged in a controller, as that needs to be subsequently broken out from the original PoC.

Included in all of this is a fairly large amount of unit testing that puts the entire package at ~90% coverage. Aside from incorporating some of the feedback from #526, it also moves the subpackage from resources to lifecycle.

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

In my local environment I found the following test failing.

=== RUN   TestClientWatchResources
=== RUN   TestClientWatchResources/base
    client_test.go:382: 
        	Error Trace:	/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:382
        	            				/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:223
        	Error:      	Not equal: 
        	            	expected: "*resources.MockCluster"
        	            	actual  : "*lifecycle.MockCluster"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-*resources.MockCluster
        	            	+*lifecycle.MockCluster
        	Test:       	TestClientWatchResources/base
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for non leader election runnables
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for leader election runnables
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for caches
W0326 13:15:12.978273   90427 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.3/tools/cache/reflector.go:232: watch of *lifecycle.MockCluster ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0326 13:15:12.978347   90427 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.3/tools/cache/reflector.go:232: watch of *v1.CustomResourceDefinition ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for webhooks
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for HTTP servers
2025-03-26T13:15:12+01:00	INFO	Wait completed, proceeding to shutdown the manager
--- FAIL: TestClientWatchResources/base (5.40s)

Expected :*resources.MockCluster
Actual   :*lifecycle.MockCluster
<Click to see difference>


=== RUN   TestClientWatchResources/cluster-scoped-resources
    client_test.go:382: 
        	Error Trace:	/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:382
        	            				/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:223
        	Error:      	Not equal: 
        	            	expected: "*resources.MockCluster"
        	            	actual  : "*lifecycle.MockCluster"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-*resources.MockCluster
        	            	+*lifecycle.MockCluster
        	Test:       	TestClientWatchResources/cluster-scoped-resources
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for non leader election runnables
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for leader election runnables
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for caches
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for webhooks
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for HTTP servers
2025-03-26T13:15:19+01:00	INFO	Wait completed, proceeding to shutdown the manager
--- FAIL: TestClientWatchResources/cluster-scoped-resources (6.22s)

Expected :*resources.MockCluster
Actual   :*lifecycle.MockCluster
<Click to see difference>


=== RUN   TestClientWatchResources/namespace-and-cluster-scoped-resources
    client_test.go:382: 
        	Error Trace:	/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:382
        	            				/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:223
        	Error:      	Not equal: 
        	            	expected: "*resources.MockCluster"
        	            	actual  : "*lifecycle.MockCluster"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-*resources.MockCluster
        	            	+*lifecycle.MockCluster
        	Test:       	TestClientWatchResources/namespace-and-cluster-scoped-resources
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for non leader election runnables
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for leader election runnables
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for caches
W0326 13:15:24.952195   90427 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.3/tools/cache/reflector.go:232: watch of *lifecycle.MockCluster ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0326 13:15:24.952203   90427 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.3/tools/cache/reflector.go:232: watch of *v1.CustomResourceDefinition ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for webhooks
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for HTTP servers
2025-03-26T13:15:24+01:00	INFO	Wait completed, proceeding to shutdown the manager
--- FAIL: TestClientWatchResources/namespace-and-cluster-scoped-resources (5.75s)

Expected :*resources.MockCluster
Actual   :*lifecycle.MockCluster
<Click to see difference>


--- FAIL: TestClientWatchResources (17.37s)

Comment on lines +34 to +36
// OutOfDateReplicas is the number of replicas that don't currently
// match their node pool definitions. If OutOfDateReplicas is not 0
// it should mean that the operator will soon roll this many pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically rolling update should be blocked/paused if Redpanda can not enter into maintenance mode. In Redpanda helm base deployment this is not the case, as lifecycle hook script will wait for Redpanda to enter maintenance mode in the half of the terminationGracePeriod of Redpanda Pod definition. In operator v1 it is enforced.

@andrewstucki
Copy link
Contributor Author

@RafalKorepta thanks -- totatlly forgot to change the string after I renamed the package 😅 -- I responded to a few comments and will try and add some more doc comments soonish.

Copy link
Contributor

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking and nothing that can't be deferred.

if err := r.patchOwnedResource(ctx, owner, resource); err != nil {
errs = append(errs, err)
}
delete(toDelete, gvkObject{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: can this be hoisted above the patchOwnedResource line? Every time I see it something in me starts screaming that there's an edge in the error case 😓

})
}

for _, resource := range toDelete {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add in logs for deletions? It shouldn't be noisy but if it is that would very likely indicate a problem, no?

for _, resource := range toDelete {
if err := r.client.Delete(ctx, resource); err != nil {
if !k8sapierrors.IsNotFound(err) {
errs = append(errs, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's annotate these errors with some information about what resource we're trying to delete.

// WatchResources configures resource watching for the given cluster, including StatefulSets and other resources.
func (r *ResourceClient[T, U]) WatchResources(builder Builder, cluster U) error {
// set that this is for the cluster
builder.For(cluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does builder mutate the instance you call the methods on? Seems like this should be builder = builder.For(cluster)

return resources, nil
}

// patchOwnedResource applies a patch to a resource owned by the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically, an apply is different from a patch. Could we name this applyOwnedResource to indicate that it's an Apply (I guess PUT like?) operation and distinct from a standard PATCH.


// V2SimpleResourceRenderer represents an simple resource renderer for v2 clusters.
type V2SimpleResourceRenderer struct {
kubeConfig clientcmdapi.Config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be a rest.Config soon, if it doesn't already. You may be racing with Rafal's update PR. Not too difficult to fix in a conflict thankfully.

}

// Update updates the given Redpanda v2 cluster with the given cluster status.
func (m *V2ClusterStatusUpdater) Update(cluster *redpandav1alpha2.Redpanda, status ClusterStatus) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could totally be a closure :P

}

// SyncAll synchronizes the simple resources associated with the given cluster,
// cleaning up any resources that should no longer exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include the phrase "garbage collect" or "GC" in here or even in a comment in the struct itself. We and Kubernetes refer to the process as garbage collection, adding the key words here will make it a lot easier to find this for those that aren't intimately familiar with this section of the code base.


// since resources are cluster-scoped we need to call a Watch on them with some
// custom mappings
builder.Watches(resourceType, handler.EnqueueRequestsFromMapFunc(func(ctx context.Context, o client.Object) []reconcile.Request {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky, I would have reached for filter predicates at first but I see why that doesn't work.

@github-actions
Copy link

github-actions bot commented Apr 2, 2025

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

management for initially the v2 operator. Currently it is not
leveraged in a controller, as that needs to be subsequently broken
out from the original PoC.
@andrewstucki andrewstucki force-pushed the as/unified-resources-interfaces branch from e1a13ff to 82f6d67 Compare April 2, 2025 15:01
@andrewstucki andrewstucki merged commit 0ec1f09 into main Apr 2, 2025
12 checks passed
@andrewstucki andrewstucki deleted the as/unified-resources-interfaces branch April 2, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants