fix: resolve deadlock when maxSurge>0 rolling update on single-replica LWS by veast · Pull Request #779 · kubernetes-sigs/lws

veast · 2026-03-16T03:48:42Z

What type of PR is this?

/kind bug

What this PR does / why we need it

Fixes a deadlock where a LeaderWorkerSet with replicas=1 and
maxSurge=1 could never complete a rolling update.

Symptom: immediately after triggering an update the controller emits
a "deleting surge replica <name>-1" event and the StatefulSet stalls
at partition=1, replicas=1 forever. The surge pod is never actually
created.

Root cause: in Case 2 of rollingUpdateParameters (a new rolling
update is detected) the code called wantReplicas(lwsReplicas). With
replicas=1 and maxSurge=1 the condition inside wantReplicas was:

unreadyReplicas(1) <= maxSurge(1)  →  true

which jumped straight into the "release surge" branch and returned
replicas=1. No surge replica was ever created, so the StatefulSet
partition could never advance.

Fix: two interrelated changes:

Case 2 returns burstReplicas directly instead of going through
wantReplicas. At the moment a new update is detected all existing
replicas are still running the old template (none are unready due to
the update yet), so the correct action is to expand to
lwsReplicas + maxSurge first. wantReplicas is only meaningful
once stsReplicas == burstReplicas and the surge pods are being
replaced.
wantReplicas condition tightened from <= to <. When
unreadyReplicas == maxSurge exactly (e.g. replicas=2, maxSurge=1,
one replica still unready) the surge pods should be kept alive until
that last replica becomes ready. Using strict-less-than ensures we
only enter the shrink path when there is genuine headroom.

Which issue(s) this PR fixes

Fixes #688

Special notes for your reviewer

The logic change is small (two lines in leaderworkerset_controller.go)
but the comments have been expanded to make the invariants clearer.

A new integration test "rolling update with maxSurge=1 and single replica creates surge before rolling" directly exercises the regression:
it verifies that the leader StatefulSet expands to replicas=2
immediately after the update is triggered (i.e. the surge is actually
created), then converges back to replicas=1 once all groups are ready.

Does this PR introduce a user-facing change?

Yes — maxSurge-based zero-downtime rolling updates now work correctly
for single-replica LeaderWorkerSet objects.

netlify · 2026-03-16T03:48:49Z

✅ Deploy Preview for kubernetes-sigs-lws canceled.

Name	Link
🔨 Latest commit	`09a6e8d`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-lws/deploys/69b7ac333b70ab0008793b36

k8s-ci-robot · 2026-03-16T03:48:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: veast
Once this PR has been reviewed and has the lgtm label, please assign kerthcet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

linux-foundation-easycla · 2026-03-16T03:48:50Z

The committers listed above are authorized under a signed CLA.

✅ login: veast / name: Veast (09a6e8d)

k8s-ci-robot · 2026-03-16T03:48:52Z

Welcome @veast!

It looks like this is your first PR to kubernetes-sigs/lws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/lws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-03-16T03:48:53Z

Hi @veast. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

yankay · 2026-03-16T03:52:23Z

/ok-to-test

…a LWS When a LeaderWorkerSet has replicas=1 and maxSurge=1, triggering a rolling update caused the controller to immediately emit a "deleting surge replica" event and return replicas=1 (no surge ever created), leaving the update permanently stuck with the StatefulSet at partition=1, replicas=1. Root cause: in Case 2 of rollingUpdateParameters (a new rolling update is detected) the code called wantReplicas(lwsReplicas). With replicas=1 and maxSurge=1 the condition inside wantReplicas was: unreadyReplicas(1) <= maxSurge(1) → true which jumped straight into the "release surge" branch and returned replicas=1. No surge replica was ever created, so the StatefulSet partition could never advance. Fix: Case 2 now returns burstReplicas directly instead of going through wantReplicas. At the moment a new update is detected all existing replicas are still running the old template (none are unready due to the update yet), so the correct action is to expand to lwsReplicas+maxSurge first. wantReplicas is only meaningful once stsReplicas==burstReplicas and the surge pods are being replaced. A new integration test "rolling update with maxSurge=1 and single replica creates surge before rolling" directly exercises the regression: it verifies that the leader StatefulSet expands to replicas=2 immediately after the update is triggered, then converges back to replicas=1 once all groups are ready. Fixes: kubernetes-sigs#688 Signed-off-by: veast <veast@users.noreply.github.com>

veast · 2026-03-16T07:10:24Z

/retest

k8s-ci-robot · 2026-03-16T07:20:33Z

@veast: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-lws-test-integration-main	`09a6e8d`	link	true	`/test pull-lws-test-integration-main`
pull-lws-test-e2e-main-1-32	`09a6e8d`	link	true	`/test pull-lws-test-e2e-main-1-32`
pull-lws-test-e2e-main-cert-manager	`09a6e8d`	link	true	`/test pull-lws-test-e2e-main-cert-manager`
pull-lws-test-e2e-main-1-33	`09a6e8d`	link	true	`/test pull-lws-test-e2e-main-1-33`
pull-lws-test-e2e-main-1-34	`09a6e8d`	link	true	`/test pull-lws-test-e2e-main-1-34`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Edwinhr716 · 2026-03-17T02:22:47Z

#688, or rather, the feature request for it, is being fixed in #779. Can you check that PR to see if addresses what you are trying to fix here?

k8s-ci-robot · 2026-03-21T07:58:19Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 16, 2026

k8s-ci-robot requested review from ahg-g and yankay March 16, 2026 03:48

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 16, 2026

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 16, 2026

veast force-pushed the fix-maxsurge-rolling-update-deadlock branch from ca057f0 to 8b2e727 Compare March 16, 2026 04:26

veast force-pushed the fix-maxsurge-rolling-update-deadlock branch from 8b2e727 to 09a6e8d Compare March 16, 2026 07:07

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 16, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve deadlock when maxSurge>0 rolling update on single-replica LWS#779

fix: resolve deadlock when maxSurge>0 rolling update on single-replica LWS#779
veast wants to merge 1 commit intokubernetes-sigs:mainfrom
veast:fix-maxsurge-rolling-update-deadlock

veast commented Mar 16, 2026

Uh oh!

netlify bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 16, 2026

Uh oh!

linux-foundation-easycla bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 16, 2026

Uh oh!

k8s-ci-robot commented Mar 16, 2026

Uh oh!

yankay commented Mar 16, 2026

Uh oh!

veast commented Mar 16, 2026

Uh oh!

k8s-ci-robot commented Mar 16, 2026

Uh oh!

Edwinhr716 commented Mar 17, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

veast commented Mar 16, 2026

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

Uh oh!

netlify bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-lws canceled.

Uh oh!

k8s-ci-robot commented Mar 16, 2026

Uh oh!

linux-foundation-easycla bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 16, 2026

Uh oh!

k8s-ci-robot commented Mar 16, 2026

Uh oh!

yankay commented Mar 16, 2026

Uh oh!

veast commented Mar 16, 2026

Uh oh!

k8s-ci-robot commented Mar 16, 2026

Uh oh!

Edwinhr716 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Mar 16, 2026 •

edited

Loading

linux-foundation-easycla bot commented Mar 16, 2026 •

edited

Loading

Edwinhr716 commented Mar 17, 2026 •

edited

Loading