-
Notifications
You must be signed in to change notification settings - Fork 460
fix: ensure Machines with delete-machine annotation are deleted first #4949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ensure Machines with delete-machine annotation are deleted first #4949
Conversation
|
Hi @mweibel. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4949 +/- ##
==========================================
+ Coverage 51.14% 51.17% +0.02%
==========================================
Files 274 274
Lines 24669 24690 +21
==========================================
+ Hits 12617 12634 +17
- Misses 11264 11267 +3
- Partials 788 789 +1 ☔ View full report in Codecov by Sentry. |
Looks good to me overall, but I am curious to know if your approach changes with Jonathan's comment👆🏼 , @mweibel |
|
Thanks for the reviews!
That's a very good question. prioritize failed machines: That's debatable. It can make sense, however e.g. in the case of windows nodes we sometimes have temporarily failed machines if they reboot for some reason but they come back online. Maybe prioritize marked as delete machines would make more sense. prioritize machines without the latest model: This is another question. In our use case it's even debatable if we want to delete those at all, because we run batch workloads (each on their own machine). This means that machines without the latest model aren't bothering us and we'd rather keep them online. That sounds like a separate issue which needs consideration, though. I'd happily adjust the code to prioritize machines with the delete annotations - what are your opinions about this, given the additional context I provided? |
|
about my previous comment:
Just figured out that it actually does bother us, we're currently facing an issue with machines not getting updated and the VMSS rejecting to scale because we have too many VM models at the same time (10 is the limit apparently). Will file a separate issue for that . |
reading the code again, I believe this is not what the code does, or do I misunderstand something? Lines 146 to 151 in 52df930
This code ensures that for each of those machine groups (failed, deleting, ready, without latest model), it'll prioritize those with the delete machine annotation over not annotated ones. This is because each machine group is sorted using Lines 307 to 317 in 52df930
Though indeed, we could think about adjusting this code to first delete machines with the annotations, regardless of the group, and then only afterwards potentially delete other machines (maybe in the next reconcile iteration only). Would that be what we want? I think that would make sense. |
e6f1522 to
eb033f2
Compare
|
/retest
|
|
I assume this needs #4939 to be merged to make all tests work. |
This comment was marked as outdated.
This comment was marked as outdated.
Coming to this thought; it makes sense to me that "delete annotations" should be respected over cleanup. I vote for that change. Will you be adding that change via this PR? |
We are migrating CAPZ tests onto community clusters, and as a part of that effort |
Created a PR kubernetes/test-infra#32926 to remove |
|
/test pull-cluster-api-provider-azure-e2e-aks |
e9bdf9d to
4447721
Compare
4447721 to
b074410
Compare
|
@mweibel: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| // if we have machines annotated with delete machine, remove them | ||
| if len(deleteAnnotatedMachines) > 0 { | ||
| log.Info("delete annotated machines", "desiredReplicaCount", desiredReplicaCount, "maxUnavailable", maxUnavailable, "deleteAnnotatedMachines", getProviderIDs(deleteAnnotatedMachines)) | ||
| return deleteAnnotatedMachines, nil | ||
| } | ||
|
|
||
| // if we have failed or deleting machines, remove them | ||
| if len(failedMachines) > 0 || len(deletingMachines) > 0 { | ||
| log.Info("failed or deleting machines", "desiredReplicaCount", desiredReplicaCount, "maxUnavailable", maxUnavailable, "failedMachines", getProviderIDs(failedMachines), "deletingMachines", getProviderIDs(deletingMachines)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if we should include the annotated machines in this condition (instead of making a separate one).
I decided against this approach because it would require a bit more code (filter out duplicates) and the benefit didn't seem huge - in the next reconciliation the failed/deleting machines would be sorted out, too.
Please let me know what you think about this, thanks!
|
@nawazkh @Jont828 Please review again. I updated the implementation to always prefer delete machine annotated machines and removes the ordering of the other kind of machines. Some e2e tests fail currently but to me it looks like they're a bit flaky today. I'll wait until you reviewed before I retest and potentially investigate deeper into if my changes could have an impact on e2e tests. |
| "deletingMachines", len(deletingMachines), | ||
| ) | ||
|
|
||
| // if we have machines annotated with delete machine, remove them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // if we have machines annotated with delete machine, remove them | |
| // If we have machines with the delete machine annotation, remove them. |
Overall I think the PR looks solid to me, I think the only thing would be how to prioritize machines with the annotation vs failed/outdated machines. I remember when I implemented this, we had some discussion about whether to prefer failed/outdated over the annotated machines and it wasn't super clear which course was best. I ended up deleting outdated machines first to be consistent with the DockerMachinePool implementation, and to ensure that we don't get in a situation where we are unable to meet the ready replica count b/c a failed or outdated machine can't get deleted. If I understand correctly, does your use case benefit from giving priority to annotated machines instead? |
|
/retest |
the code as is in this PR would give priority to those, yes. I'm not certain about this either. I guess there are three ways to handle it:
Do you have any preference regarding which path to take? |
In certain circumstances the deleteMachine annotation isn't yet propagated from the Machine to the AzureMachinePoolMachine. This can result in deleting the wrong machines. This change ensures the owner Machine's deleteMachine annotation is always considered and those machines are deleted before other kind of machines.
b074410 to
0b71e08
Compare
|
@nawazkh @Jont828 after testing this change in production under high load (>800 windows machines), I noticed an issue which is related to the discussion/open question. When there are too many VMSS VMs in Failed state, the VMSS is marked as Unhealthy, preventing further scale up/down. Therefore, I adjusted this PR to first delete failed/deleting machines (sorted by delete annotation) and only afterwards delete those with the delete machine annotation. |
|
Looks good to me, thank you for putting this together. |
|
LGTM label has been added. Git tree hash: 8d4532c838a358d39fbb0cc38faf598e94f3c2f7
|
Thanks for your hard work! And yep that's why I originally had it to prioritize deleting failing machines over the annotation. We want to make sure we can get out of a unhealthy state before worrying about autoscaling behavior. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jont828 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
As outlined in the linked issue: when a Machine has the delete annotation, CAPZ needs to prioritize those Machines or AzureMachinePoolMachines for downscaling. This is achieved by fetching the DeleteMachine annotation from the owner Machine before starting to look at scaling decisions.
Which issue(s) this PR fixes:
Fixes #4941
Special notes for your reviewer:
TODOs:
Release note: