Skip to content

Conversation

@shmuel-runai
Copy link
Contributor

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a API change?


Additional documentation e.g., enhancement proposals, usage docs, etc.:


gflarity and others added 11 commits December 3, 2025 16:14
Test Changes:
- Uncomment Test_RU10_RollingUpdateInsufficientResources to verify
  delete-first rolling update strategy when nodes are cordoned
- Uncomment Test_RU18_RollingUpdateWithPodCliqueScaleOutDuringUpdate
  to verify rolling updates work correctly with concurrent scale-out

Bug Fix (RU18 intermittent failure):
- Fix race condition in PodClique status reconciliation where
  UpdatedReplicas was incorrectly calculated after rolling update
  completion but before CurrentPodTemplateHash was updated
- Add IsLastPCLQUpdateCompleted() helper to detect the window between
  markRollingUpdateEnd() and mutateCurrentHashes()
- Update mutateUpdatedReplica() to use RollingUpdateProgress.PodTemplateHash
  when the update just completed (UpdateEndedAt is set)
- Update mutateCurrentHashes() to properly set CurrentPodTemplateHash and
  CurrentPodCliqueSetGenerationHash from RollingUpdateProgress

Root cause: After markRollingUpdateEnd() set UpdateEndedAt,
IsPCLQUpdateInProgress() returned false, causing mutateUpdatedReplica()
to use the stale CurrentPodTemplateHash (old hash) instead of the new
hash from RollingUpdateProgress. This resulted in UpdatedReplicas=0,
which prevented mutateCurrentHashes() from setting
CurrentPodCliqueSetGenerationHash, causing the PCS controller to timeout.

E2E Setup:
- Add Docker authentication support for image pulls from Docker Hub
- Skip pulling images that already exist locally
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants