-
Notifications
You must be signed in to change notification settings - Fork 1.4k
🐛 Fix race conditions ScaleDownOldMS #12812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
🐛 Fix race conditions ScaleDownOldMS #12812
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code wise everything looks fine to me. I'm being a little nit picky on some docstring comments, just thinking from the context of reading through this code at some point in the future needing to understand what is happening once the current context is lost.
// NOTE: we are scaling up unavailable machines first in order to increase chances for the rollout to progress; | ||
// however, the MS controller might have different opinion on which machines to scale down. | ||
// As a consequence, the scale down operation must continuously assess if reducing the number of replicas | ||
// for an older MS could further impact availability under the assumption than any scale down could further impact availability (same as above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time parsing this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased + added an example, PTAL
// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount. | ||
// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also having trouble parsing this sentence too. I think you're saying we will scale down the MS decrementing by totalScaleDownCount
without going below 0?
// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount. | |
// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability, | |
// Then scale down old MS by totalScaleDownCount to get to zero. | |
// NOTE: also in this case, continuously assess if reducing the number of replicas could further impact availability, |
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
@sbueringer @stmcginnis thanks for the first round of feedback, |
/test pull-cluster-api-e2e-main |
Reviewing now. I'm also going to open a PR against your PR to fixup some minor doc findings |
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling_test.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling_test.go
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling_test.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling_test.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rollout_sequence_test.go
Show resolved
Hide resolved
Signed-off-by: Stefan Büringer [email protected]
PR with minor fixups. It should not contain any logic changes (just godoc / variable renames / etc), but please double check: fabriziopandini#314 |
🌱 Fix ScaleDownOldMS: Fix review findings
internal/controllers/machinedeployment/machinedeployment_rolling_test.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinedeployment/machinedeployment_rolling_test.go
Show resolved
Hide resolved
151b53a
to
40f23d9
Compare
Thank you! /lgtm /hold |
LGTM label has been added. Git tree hash: 45c79a98bd7773b41dc97d99dcab2d8b613ecdac
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-cluster-api-e2e-main |
What this PR does / why we need it:
This PR fixes two race conditions in the logic that is responsible for scaling down OldMSs when performing rollingRollout.
The two race conditions were surfaced and documented in the context of the work for #12804. More specifically:
Notably after the fix:
Which issue(s) this PR fixes *(optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the Part of #12291/area machinedeployment