Skip to content

Conversation

@birdayz
Copy link
Contributor

@birdayz birdayz commented Oct 9, 2025

Problem

Pods get stuck with ClusterUpdate condition set but never restarted. The
controller relied on status.restarting flag which can be lost due to
controller restarts or stale cache reads (~1% of reconciles in distributed
systems). When a fresh reconcile sees restarting=false but pods still have
the condition, it exits early and never completes the restart.

Solution

Check actual pod state (ClusterUpdate condition) as source of truth. If pods
need restart but status flag says otherwise, trust the pods and continue the
rolling update. The status flag gets recomputed from reality.

Pod conditions are the real state. Status flags are just cached hints that
get thrown away and rebuilt if they go stale. This is how distributed systems
should work.

Pods can get stuck with ClusterUpdate condition set but never restarted when
the status.restarting flag is not persisted due to controller restarts, stale
status reads, or status update conflicts with concurrent reportStatus() calls.

The controller relied solely on status.restarting to decide if a rolling
update should continue, but this breaks idempotency: a fresh reconcile with
stale cached status would skip the restart even though pods still had the
ClusterUpdate condition set.

Fix by checking actual pod state (ClusterUpdate condition) as the authoritative
source of truth in shouldUpdate(). If any pods need restart but status flag is
false, we now detect this and continue the rolling update, re-establishing the
status flag and completing the restart.

This makes the controller self-healing: pod conditions are the real state,
status flags are just derived state that gets recomputed if lost.
Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I would highly recommend to create unit test.

@birdayz
Copy link
Contributor Author

birdayz commented Oct 9, 2025

Will add test and fix compile errors

This test verifies that the controller correctly detects pods with
ClusterUpdate condition even when the status cache is stale. It covers
the idempotency fix where shouldUpdate() now checks actual pod conditions
as the source of truth, not just cached status flags.

The test simulates the ~1% case where a fresh reconcile sees stale status:
- Pods have ClusterUpdate condition set (authoritative state)
- Status flags say restarting=false (stale cached state)
- shouldUpdate() must still detect this and return nodePoolRestarting=true

Also fixes compilation errors in existing tests by adding context parameter
and fake Kubernetes client where needed.
@birdayz birdayz merged commit d586b5b into main Oct 9, 2025
11 of 12 checks passed
@chrisseto
Copy link
Contributor

@RafalKorepta didn't you recently fix this or was it a similar but distinct issue?

@birdayz does this need to be backported to v25.1 or is cloud officially on v25.2 beta builds?

@RafalKorepta
Copy link
Contributor

RafalKorepta commented Oct 9, 2025

does this need to be backported to v25.1 or is cloud officially on v25.2 beta builds?

Cloud will be on v25.2 just after https://github.com/redpanda-data/cloudv2/pull/23422 which is in the merge queue as of writing.

didn't you recently fix this or was it a similar but distinct issue?

#1075 is different problem, but touches rolling upgrade process too.

@birdayz
Copy link
Contributor Author

birdayz commented Oct 9, 2025

i think we need a backport, but rafal knows better

@RafalKorepta
Copy link
Contributor

I suggest to open new PR in cloud repo that includes this fix.

@RafalKorepta RafalKorepta deleted the fix/idempotent-pod-restart branch October 9, 2025 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants