Skip to content

fix: handle startup-probe-phase pods in rolling update categorization#435

Open
gflarity wants to merge 2 commits intoai-dynamo:mainfrom
gflarity:rolling_update_stuck_400
Open

fix: handle startup-probe-phase pods in rolling update categorization#435
gflarity wants to merge 2 commits intoai-dynamo:mainfrom
gflarity:rolling_update_stuck_400

Conversation

@gflarity
Copy link
Contributor

/kind bug

What this PR does / why we need it:

Fixes a bug where rolling updates get permanently stuck or prematurely complete when an old-hash pod is in the startup probe phase. computeUpdateWork had no category for pods with Started=false (startup probe not yet passed), causing them to silently fall through all classification branches.

This PR:

  • Adds a HasAnyContainerNotStarted utility to detect pods still in the startup probe phase
  • Adds explicit oldTemplateHashStartingPods and oldTemplateHashUncategorizedPods buckets to updateWork
  • Deletes all non-ready old-hash pods (pending, unhealthy, starting, uncategorized) immediately rather than only pending/unhealthy
  • Adds unit tests for computeUpdateWork categorization and HasAnyContainerNotStarted

Which issue(s) this PR fixes:

Fixes #400

Special notes for your reviewer:

The root cause: when a second spec change arrives while a replacement pod from a previous change is still in its startup probe phase, the replacement pod (now old-hash) has Phase=Running, Started=false, Ready=false. It fails all existing predicates (not Pending, not Started-but-not-Ready, not Ready) and is dropped from the update work entirely. This causes either an infinite requeue loop or premature updateEndedAt marking.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

julienmancuso
julienmancuso previously approved these changes Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rolling Update Gets Stuck When New Update Initiated During In-Progress Update

2 participants