Basic Lifecycle Management Interfaces + v2 Concrete Implementations by andrewstucki · Pull Request #568 · redpanda-data/redpanda-operator

andrewstucki · 2025-03-26T04:06:44Z

This adds the basic interfaces for StatefulSet-based lifecycle management for initially the v2 operator. Currently it is not leveraged in a controller, as that needs to be subsequently broken out from the original PoC.

Included in all of this is a fairly large amount of unit testing that puts the entire package at ~90% coverage. Aside from incorporating some of the feedback from #526, it also moves the subpackage from resources to lifecycle.

RafalKorepta

Looks good to me.

In my local environment I found the following test failing.

=== RUN   TestClientWatchResources
=== RUN   TestClientWatchResources/base
    client_test.go:382: 
        	Error Trace:	/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:382
        	            				/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:223
        	Error:      	Not equal: 
        	            	expected: "*resources.MockCluster"
        	            	actual  : "*lifecycle.MockCluster"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-*resources.MockCluster
        	            	+*lifecycle.MockCluster
        	Test:       	TestClientWatchResources/base
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for non leader election runnables
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for leader election runnables
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for caches
W0326 13:15:12.978273   90427 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.3/tools/cache/reflector.go:232: watch of *lifecycle.MockCluster ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0326 13:15:12.978347   90427 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.3/tools/cache/reflector.go:232: watch of *v1.CustomResourceDefinition ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for webhooks
2025-03-26T13:15:12+01:00	INFO	Stopping and waiting for HTTP servers
2025-03-26T13:15:12+01:00	INFO	Wait completed, proceeding to shutdown the manager
--- FAIL: TestClientWatchResources/base (5.40s)

Expected :*resources.MockCluster
Actual   :*lifecycle.MockCluster
<Click to see difference>


=== RUN   TestClientWatchResources/cluster-scoped-resources
    client_test.go:382: 
        	Error Trace:	/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:382
        	            				/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:223
        	Error:      	Not equal: 
        	            	expected: "*resources.MockCluster"
        	            	actual  : "*lifecycle.MockCluster"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-*resources.MockCluster
        	            	+*lifecycle.MockCluster
        	Test:       	TestClientWatchResources/cluster-scoped-resources
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for non leader election runnables
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for leader election runnables
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for caches
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for webhooks
2025-03-26T13:15:19+01:00	INFO	Stopping and waiting for HTTP servers
2025-03-26T13:15:19+01:00	INFO	Wait completed, proceeding to shutdown the manager
--- FAIL: TestClientWatchResources/cluster-scoped-resources (6.22s)

Expected :*resources.MockCluster
Actual   :*lifecycle.MockCluster
<Click to see difference>


=== RUN   TestClientWatchResources/namespace-and-cluster-scoped-resources
    client_test.go:382: 
        	Error Trace:	/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:382
        	            				/Users/rafalkorepta/workspace/redpanda/redpanda-operator/operator/internal/lifecycle/client_test.go:223
        	Error:      	Not equal: 
        	            	expected: "*resources.MockCluster"
        	            	actual  : "*lifecycle.MockCluster"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-*resources.MockCluster
        	            	+*lifecycle.MockCluster
        	Test:       	TestClientWatchResources/namespace-and-cluster-scoped-resources
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for non leader election runnables
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for leader election runnables
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for caches
W0326 13:15:24.952195   90427 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.3/tools/cache/reflector.go:232: watch of *lifecycle.MockCluster ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0326 13:15:24.952203   90427 reflector.go:470] pkg/mod/k8s.io/client-go@v0.30.3/tools/cache/reflector.go:232: watch of *v1.CustomResourceDefinition ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for webhooks
2025-03-26T13:15:24+01:00	INFO	Stopping and waiting for HTTP servers
2025-03-26T13:15:24+01:00	INFO	Wait completed, proceeding to shutdown the manager
--- FAIL: TestClientWatchResources/namespace-and-cluster-scoped-resources (5.75s)

Expected :*resources.MockCluster
Actual   :*lifecycle.MockCluster
<Click to see difference>


--- FAIL: TestClientWatchResources (17.37s)

operator/internal/lifecycle/client.go

RafalKorepta · 2025-03-26T11:27:39Z

operator/internal/lifecycle/interfaces.go

+	// OutOfDateReplicas is the number of replicas that don't currently
+	// match their node pool definitions. If OutOfDateReplicas is not 0
+	// it should mean that the operator will soon roll this many pods.


Historically rolling update should be blocked/paused if Redpanda can not enter into maintenance mode. In Redpanda helm base deployment this is not the case, as lifecycle hook script will wait for Redpanda to enter maintenance mode in the half of the terminationGracePeriod of Redpanda Pod definition. In operator v1 it is enforced.

operator/internal/lifecycle/client_test.go

andrewstucki · 2025-03-26T12:45:45Z

@RafalKorepta thanks -- totatlly forgot to change the string after I renamed the package 😅 -- I responded to a few comments and will try and add some more doc comments soonish.

chrisseto

Nothing blocking and nothing that can't be deferred.

chrisseto · 2025-03-26T20:08:59Z

operator/internal/lifecycle/client.go

+		if err := r.patchOwnedResource(ctx, owner, resource); err != nil {
+			errs = append(errs, err)
+		}
+		delete(toDelete, gvkObject{


Super nit: can this be hoisted above the patchOwnedResource line? Every time I see it something in me starts screaming that there's an edge in the error case 😓

chrisseto · 2025-03-26T20:10:09Z

operator/internal/lifecycle/client.go

+		})
+	}
+
+	for _, resource := range toDelete {


Can we add in logs for deletions? It shouldn't be noisy but if it is that would very likely indicate a problem, no?

chrisseto · 2025-03-26T20:10:48Z

operator/internal/lifecycle/client.go

+	for _, resource := range toDelete {
+		if err := r.client.Delete(ctx, resource); err != nil {
+			if !k8sapierrors.IsNotFound(err) {
+				errs = append(errs, err)


let's annotate these errors with some information about what resource we're trying to delete.

chrisseto · 2025-03-26T20:12:04Z

operator/internal/lifecycle/client.go

+// WatchResources configures resource watching for the given cluster, including StatefulSets and other resources.
+func (r *ResourceClient[T, U]) WatchResources(builder Builder, cluster U) error {
+	// set that this is for the cluster
+	builder.For(cluster)


Does builder mutate the instance you call the methods on? Seems like this should be builder = builder.For(cluster)

chrisseto · 2025-03-26T20:13:33Z

operator/internal/lifecycle/client.go

+	return resources, nil
+}
+
+// patchOwnedResource applies a patch to a resource owned by the cluster.


Semantically, an apply is different from a patch. Could we name this applyOwnedResource to indicate that it's an Apply (I guess PUT like?) operation and distinct from a standard PATCH.

operator/internal/lifecycle/v2_ownership.go

chrisseto · 2025-03-26T20:34:35Z

operator/internal/lifecycle/v2_simple_resources.go

+
+// V2SimpleResourceRenderer represents an simple resource renderer for v2 clusters.
+type V2SimpleResourceRenderer struct {
+	kubeConfig clientcmdapi.Config


This will need to be a rest.Config soon, if it doesn't already. You may be racing with Rafal's update PR. Not too difficult to fix in a conflict thankfully.

chrisseto · 2025-03-26T20:36:35Z

operator/internal/lifecycle/v2_status.go

+}
+
+// Update updates the given Redpanda v2 cluster with the given cluster status.
+func (m *V2ClusterStatusUpdater) Update(cluster *redpandav1alpha2.Redpanda, status ClusterStatus) bool {


This could totally be a closure :P

chrisseto · 2025-03-26T20:39:21Z

operator/internal/lifecycle/client.go

+}
+
+// SyncAll synchronizes the simple resources associated with the given cluster,
+// cleaning up any resources that should no longer exist.


Can you include the phrase "garbage collect" or "GC" in here or even in a comment in the struct itself. We and Kubernetes refer to the process as garbage collection, adding the key words here will make it a lot easier to find this for those that aren't intimately familiar with this section of the code base.

chrisseto · 2025-03-26T20:41:49Z

operator/internal/lifecycle/client.go

+
+		// since resources are cluster-scoped we need to call a Watch on them with some
+		// custom mappings
+		builder.Watches(resourceType, handler.EnqueueRequestsFromMapFunc(func(ctx context.Context, o client.Object) []reconcile.Request {


Tricky, I would have reached for filter predicates at first but I see why that doesn't work.

github-actions · 2025-04-02T02:05:24Z

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

management for initially the v2 operator. Currently it is not leveraged in a controller, as that needs to be subsequently broken out from the original PoC.

andrewstucki requested review from RafalKorepta, chrisseto and gene-redpanda as code owners March 26, 2025 04:06

andrewstucki added the no-changelog label Mar 26, 2025

RafalKorepta reviewed Mar 26, 2025

View reviewed changes

operator/internal/lifecycle/client_test.go Outdated Show resolved Hide resolved

andrewstucki requested a review from RafalKorepta March 26, 2025 14:59

RafalKorepta approved these changes Mar 26, 2025

View reviewed changes

chrisseto approved these changes Mar 26, 2025

View reviewed changes

github-actions bot added the stale label Apr 2, 2025

andrewstucki removed the stale label Apr 2, 2025

andrewstucki added 5 commits April 2, 2025 10:58

This adds the basic interfaces for StatefulSet-based lifecycle

2ddef7a

management for initially the v2 operator. Currently it is not leveraged in a controller, as that needs to be subsequently broken out from the original PoC.

Fix test after renaming package

cff0ca9

manually run go mod tidy

955d77c

parallelize tests

bdc79a9

Add some additional doc comments

82f6d67

andrewstucki force-pushed the as/unified-resources-interfaces branch from e1a13ff to 82f6d67 Compare April 2, 2025 15:01

andrewstucki added 3 commits April 2, 2025 11:28

update package path for helmette

1817eaa

Fix up RESTConfig changes

906726e

Update golden files

ee37491

andrewstucki merged commit 0ec1f09 into main Apr 2, 2025
12 checks passed

andrewstucki deleted the as/unified-resources-interfaces branch April 2, 2025 17:26

Comments

Conversation

andrewstucki commented Mar 26, 2025

Uh oh!

RafalKorepta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewstucki commented Mar 26, 2025

Uh oh!

chrisseto left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants