operator: Restart rolling update when Statefulset changes #1075

RafalKorepta · 2025-09-10T15:18:45Z

This PR enhances the detection mechanism for Node Pool rolling upgrades. Previously, when a Node Pool rolling upgrade was detected, all Pods were marked with the ClusterUpdate Pod Status Condition, and the restarting Cluster Custom Resource Status Node Pool field was marked as true.

If a modification to the Cluster Custom Resource occurred during an ongoing rolling upgrade (e.g., a change to the Redpanda container tag), it could trigger an update to the StatefulSet resource. In such cases, Pods that had already been restarted within the affected Node Pool were not restarted again, resulting in inconsistent state propagation (e.g., mismatched Redpanda container tags).

With this change, updates to the StatefulSet are explicitly distinguished from Node Pool rolling upgrades. When a StatefulSet update is required, all associated Pods are now marked or re-marked with the ClusterUpdate Pod Status Condition to ensure consistent rollout of the updated specification.

andrewstucki

Pretty sure that I'm following the thrust of the change (basically adding in a MarkPodsForUpdate when there's a diff in the patch calculation in the shouldUpdate method). Not super familiar with the patch diff calculation, but I'm assuming this only takes in spec-level fields? If so, this seems good to me, but just making sure this isn't going to cause an infinite loop if say it detects diffs on Status.

Overall though, it'd be really nice if we eventually moved this to the same logic as the lifecycle package (or just refactored it to use lifecycle directly) which basically does this diff detection by fetching and introspecting the intermediate ControllerRevision generated from the StatefulSet and associated with each pod to make sure it's up-to-date... which is pretty much what the built-in Kubernetes StatefulSet controller does for its diff detection.

chrisseto

Great PR Desc / Commit message 🎉

operator/pkg/resources/statefulset_update.go

chrisseto · 2025-09-10T18:11:53Z

operator/pkg/resources/statefulset_update.go

 		return nil
 	}

+	if stsChanged {


Maybe add a quick comment here indicating why the logic is this way so future editors don't have to reverse engineer the reasoning?

Yes, will add comment.

RafalKorepta · 2025-09-10T19:51:44Z

Not super familiar with the patch diff calculation, but I'm assuming this only takes in spec-level fields?

Yes, it's spec-level fields as per

redpanda-operator/operator/pkg/resources/statefulset_update.go

Lines 628 to 631 in c59ff49

    
           patch.IgnoreStatusFields(), 
        
           patch.IgnoreVolumeClaimTemplateTypeMetaAndStatus(), 
        
           utils.IgnoreAnnotation(redpandaAnnotatorKey), 
        
           utils.IgnoreAnnotation(labels.NodePoolSpecKey),

If so, this seems good to me, but just making sure this isn't going to cause an infinite loop if say it detects diffs on Status.

There is no problem of infinite loop when StatefulSet Status is changed.

Overall though, it'd be really nice if we eventually moved this to the same logic as the lifecycle package (or just refactored it to use lifecycle directly) which basically does this diff detection by fetching and introspecting the intermediate ControllerRevision generated from the StatefulSet and associated with each pod to make sure it's up-to-date... which is pretty much what the built-in Kubernetes StatefulSet controller does for its diff detection.

I would love to do that, but with time constraint and lack of tests for Cluster Custom Resource I would like to leave diffing detection as is.

This PR enhances the detection mechanism for Node Pool rolling upgrades. Previously, when a Node Pool rolling upgrade was detected, all Pods were marked with the `ClusterUpdate` Pod Status Condition, and the `restarting` Cluster Custom Resource Status Node Pool field was marked as `true`. If a modification to the Cluster Custom Resource occurred during an ongoing rolling upgrade (e.g., a change to the Redpanda container tag), it could trigger an update to the StatefulSet resource. In such cases, Pods that had already been restarted within the affected Node Pool were not restarted again, resulting in inconsistent state propagation (e.g., mismatched Redpanda container tags). With this change, updates to the StatefulSet are explicitly distinguished from Node Pool rolling upgrades. When a StatefulSet update is required, all associated Pods are now marked or re-marked with the `ClusterUpdate` Pod Status Condition to ensure consistent rollout of the updated specification.

RafalKorepta requested review from andrewstucki, chrisseto and gene-redpanda as code owners September 10, 2025 15:18

andrewstucki approved these changes Sep 10, 2025

View reviewed changes

andrewstucki added the run-kuttl-v1 label Sep 10, 2025

chrisseto approved these changes Sep 10, 2025

View reviewed changes

RafalKorepta added the no-changelog label Sep 10, 2025

RafalKorepta force-pushed the rk/k8s-585/fix-node-pool-reconciliation branch from 8e8d9c3 to e561dfe Compare September 10, 2025 20:10

RafalKorepta enabled auto-merge (rebase) September 10, 2025 20:10

RafalKorepta force-pushed the rk/k8s-585/fix-node-pool-reconciliation branch 2 times, most recently from a037c0f to a4695d9 Compare September 12, 2025 12:06

RafalKorepta force-pushed the rk/k8s-585/fix-node-pool-reconciliation branch from a4695d9 to 8d93a2a Compare September 12, 2025 19:51

RafalKorepta merged commit 2f90055 into main Sep 12, 2025
11 checks passed

RafalKorepta deleted the rk/k8s-585/fix-node-pool-reconciliation branch September 13, 2025 17:51

RafalKorepta mentioned this pull request Oct 9, 2025

statefulset: make pod restart detection idempotent #1133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

operator: Restart rolling update when Statefulset changes #1075

operator: Restart rolling update when Statefulset changes #1075

Uh oh!

RafalKorepta commented Sep 10, 2025

Uh oh!

andrewstucki left a comment

Uh oh!

chrisseto left a comment

Uh oh!

Uh oh!

chrisseto Sep 10, 2025

Uh oh!

RafalKorepta Sep 10, 2025

Uh oh!

RafalKorepta commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

operator: Restart rolling update when Statefulset changes #1075

operator: Restart rolling update when Statefulset changes #1075

Uh oh!

Conversation

RafalKorepta commented Sep 10, 2025

Uh oh!

andrewstucki left a comment

Choose a reason for hiding this comment

Uh oh!

chrisseto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chrisseto Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

RafalKorepta Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

RafalKorepta commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants