Skip to content

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
This PR fixes two race conditions in the logic that is responsible for scaling down OldMSs when performing rollingRollout.
The two race conditions were surfaced and documented in the context of the work for #12804. More specifically:

  • fails when the MD controller is called twice in a row, without MS controller being triggered in the between, e.g.
    • first reconcile scales down ms1, 6-->5 (-1)
    • second reconcile is not taking into account scales down already in progress, unhealthy count is wrongly computed as -1 instead of 0, this leads to increasing replica count instead of keeping it as it is (or scaling down), and then the safeguard below errors out.
  • fails when the MD controller is called twice in a row e.g. reconcile of md 6 replicas MaxSurge=3, MaxUnavailable=1
    • when current state is: ms1, 6/5 replicas << one is scaling down, but scale down not yet processed by the MS controller, ms2, 3/3 replicas
    • reconcile leads to: ms1, 6/1 replicas << it further scaled down by 4, which leads to totAvailable machines is less than MinUnavailable, which should not happen

Notably after the fix:

  • All the rollout sequence tests with default rollout order are completed without any change from the old logic
  • It is now possible to run rollout sequence tests with random rollout order, increasing the number of tested scenarios for 9 to 918

Which issue(s) this PR fixes *(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the Part of #12291

/area machinedeployment

@k8s-ci-robot k8s-ci-robot added area/machinedeployment Issues or PRs related to machinedeployments cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 1, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 1, 2025
Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise everything looks fine to me. I'm being a little nit picky on some docstring comments, just thinking from the context of reading through this code at some point in the future needing to understand what is happening once the current context is lost.

// NOTE: we are scaling up unavailable machines first in order to increase chances for the rollout to progress;
// however, the MS controller might have different opinion on which machines to scale down.
// As a consequence, the scale down operation must continuously assess if reducing the number of replicas
// for an older MS could further impact availability under the assumption than any scale down could further impact availability (same as above).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time parsing this sentence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased + added an example, PTAL

Comment on lines 222 to 223
// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount.
// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also having trouble parsing this sentence too. I think you're saying we will scale down the MS decrementing by totalScaleDownCount without going below 0?

Suggested change
// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount.
// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability,
// Then scale down old MS by totalScaleDownCount to get to zero.
// NOTE: also in this case, continuously assess if reducing the number of replicas could further impact availability,

@fabriziopandini
Copy link
Member Author

@sbueringer @stmcginnis thanks for the first round of feedback,
I have reviewed the PR to add examples, improve godoc + also reviewed part of the logic in scaleDownOldMSs
PTAL

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-e2e-main

@sbueringer
Copy link
Member

Reviewing now. I'm also going to open a PR against your PR to fixup some minor doc findings

Signed-off-by: Stefan Büringer [email protected]
@sbueringer
Copy link
Member

sbueringer commented Oct 7, 2025

PR with minor fixups. It should not contain any logic changes (just godoc / variable renames / etc), but please double check: fabriziopandini#314

🌱 Fix ScaleDownOldMS: Fix review findings
@sbueringer
Copy link
Member

Thank you!

/lgtm
/approve

/hold
for other reviews

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 7, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 45c79a98bd7773b41dc97d99dcab2d8b613ecdac

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2025
@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/machinedeployment Issues or PRs related to machinedeployments cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants