statefulset: make pod restart detection idempotent #1133

birdayz · 2025-10-09T06:47:50Z

Problem

Pods get stuck with ClusterUpdate condition set but never restarted. The
controller relied on status.restarting flag which can be lost due to
controller restarts or stale cache reads (~1% of reconciles in distributed
systems). When a fresh reconcile sees restarting=false but pods still have
the condition, it exits early and never completes the restart.

Solution

Check actual pod state (ClusterUpdate condition) as source of truth. If pods
need restart but status flag says otherwise, trust the pods and continue the
rolling update. The status flag gets recomputed from reality.

Pod conditions are the real state. Status flags are just cached hints that
get thrown away and rebuilt if they go stale. This is how distributed systems
should work.

Pods can get stuck with ClusterUpdate condition set but never restarted when the status.restarting flag is not persisted due to controller restarts, stale status reads, or status update conflicts with concurrent reportStatus() calls. The controller relied solely on status.restarting to decide if a rolling update should continue, but this breaks idempotency: a fresh reconcile with stale cached status would skip the restart even though pods still had the ClusterUpdate condition set. Fix by checking actual pod state (ClusterUpdate condition) as the authoritative source of truth in shouldUpdate(). If any pods need restart but status flag is false, we now detect this and continue the rolling update, re-establishing the status flag and completing the restart. This makes the controller self-healing: pod conditions are the real state, status flags are just derived state that gets recomputed if lost.

RafalKorepta

LGTM, I would highly recommend to create unit test.

birdayz · 2025-10-09T12:03:15Z

Will add test and fix compile errors

This test verifies that the controller correctly detects pods with ClusterUpdate condition even when the status cache is stale. It covers the idempotency fix where shouldUpdate() now checks actual pod conditions as the source of truth, not just cached status flags. The test simulates the ~1% case where a fresh reconcile sees stale status: - Pods have ClusterUpdate condition set (authoritative state) - Status flags say restarting=false (stale cached state) - shouldUpdate() must still detect this and return nodePoolRestarting=true Also fixes compilation errors in existing tests by adding context parameter and fake Kubernetes client where needed.

chrisseto · 2025-10-09T13:30:57Z

@RafalKorepta didn't you recently fix this or was it a similar but distinct issue?

@birdayz does this need to be backported to v25.1 or is cloud officially on v25.2 beta builds?

RafalKorepta · 2025-10-09T14:08:28Z

does this need to be backported to v25.1 or is cloud officially on v25.2 beta builds?

Cloud will be on v25.2 just after https://github.com/redpanda-data/cloudv2/pull/23422 which is in the merge queue as of writing.

didn't you recently fix this or was it a similar but distinct issue?

#1075 is different problem, but touches rolling upgrade process too.

birdayz · 2025-10-09T14:08:33Z

i think we need a backport, but rafal knows better

RafalKorepta · 2025-10-09T14:12:15Z

I suggest to open new PR in cloud repo that includes this fix.

birdayz requested review from RafalKorepta, andrewstucki, chrisseto and gene-redpanda as code owners October 9, 2025 06:47

RafalKorepta approved these changes Oct 9, 2025

View reviewed changes

birdayz added 2 commits October 9, 2025 14:30

Fix formatting in test file

c123afd

birdayz added the no-changelog label Oct 9, 2025

birdayz merged commit d586b5b into main Oct 9, 2025
11 of 12 checks passed

RafalKorepta deleted the fix/idempotent-pod-restart branch October 9, 2025 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

statefulset: make pod restart detection idempotent #1133

statefulset: make pod restart detection idempotent #1133

Uh oh!

birdayz commented Oct 9, 2025

Uh oh!

RafalKorepta left a comment

Uh oh!

birdayz commented Oct 9, 2025

Uh oh!

Uh oh!

chrisseto commented Oct 9, 2025

Uh oh!

RafalKorepta commented Oct 9, 2025 •

edited

Loading

Uh oh!

birdayz commented Oct 9, 2025

Uh oh!

RafalKorepta commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

statefulset: make pod restart detection idempotent #1133

statefulset: make pod restart detection idempotent #1133

Uh oh!

Conversation

birdayz commented Oct 9, 2025

Problem

Solution

Uh oh!

RafalKorepta left a comment

Choose a reason for hiding this comment

Uh oh!

birdayz commented Oct 9, 2025

Uh oh!

Uh oh!

chrisseto commented Oct 9, 2025

Uh oh!

RafalKorepta commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

birdayz commented Oct 9, 2025

Uh oh!

RafalKorepta commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RafalKorepta commented Oct 9, 2025 •

edited

Loading