-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Add IsNodeCandidateForScaleDown cloud provider interface #8531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IsNodeCandidateForScaleDown cloud provider interface #8531
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
i'm still working on some clusterapi specific unit tests, but they are quite challenging given the mocks that are needed. |
+1 to what @sbueringer is saying, and also this problem is currently confined to clusterapi provider but it could affect any provider who does node updating in a similar fashion as clusterapi. i think we need to make the autoscaler smarter in these scenarios where a cloud provider needs to have more control over which nodes are being marked for removal during a maintenance window. |
cae87c7
to
150cf78
Compare
updated with @sbueringer 's suggestions. |
Answered above |
This function allows cloud providers to specify when a node is not a good candidate for scaling down. This will occur before the autoscaler has begun to cordon, drain, and taint any node for scale down. Also adds a unit test for the prefiltering node processor.
The initial implementation of this function for clusterapi will return that a node is not a good candidate for scale down when it belongs to a MachineDeployment that is currently rolling out an upgrade.
150cf78
to
a81913f
Compare
updated again with the change i missed and the correction to the logic. |
Prod code changes lgtm from my side |
Adding this annotation to a node always prevents CA from removing it. How is this insufficient? The "if Cluster API is doing a rollout" part would have to be handled on the Cluster API side, of course. Since it's already cordoning and draining the node, it seems reasonable to assume it could add an annotation, too. |
At the point where the issue occurs Cluster API is not even aware of that autoscaler is cordoning/tainting/draining the Node cluster-autoscaler will cordon/taint/drain the node and then eventually at the end in RemoveNodes tell Cluster API to delete the Node/Machine. Up until that point Cluster API has no idea about what autoscaler has been doing with the Node. Until then it assume this Node/Machine is perfectly fine and usable. |
i think @sbueringer summed it up nicely, we need a way for the cloud provider to have an influence on how the autoscaler will perform, or not, the steps prior to reducing the size of a node group. i think it is also worth noting that for cloud providers who may create more nodes that their node group size indicates, we need an automatic way for the provider to indicate to the autoscaler that a node should be excluded from being considered for download. although adding the annotation might help in some scenarios, i think we need a mechanism that the provider could exercise which does not require an update to a resource object or api server. |
Thanks for clarifying that draining referred to in #8494 is part of the scale-down process, not upgrade process. From the chart in #8494 I assumed Cluster API was also draining nodes before deleting the underlying machine as part of the upgrade (hence "CA observes (...) old nodes that have been cordoned/drained").
CA's job in a scale-down scenarios is "choose which node will be deleted and make it happen". If something else overrides CA's choice and chooses which node will be removed, then CA can't do its job. I don't think there's any way to make CA smarter as in "allow it to do its job even when an external API is actively preventing it". Quite possibly the best it can do is to stop trying (disable scale-down) when it believes an external API will override its decisions anyway. This is exactly what this PR does, and to be clear, I don't object to this mitigation. My only objection is adding a new method, and one very specific to this scenario, to the CloudProvider interface. There are other ways to achieve the same effect, and I'd like to explore them before growing that interface. If adding the annotation isn't possible because Cluster API doesn't drain nodes before deleting machines (does it?), then perhaps this can be handled at a node group level? Does a MachineDeployment or a MachineSet correspond to a NodeGroup object? If yes, the option to disable scale-down can probably be added to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/config/autoscaling_options.go#L39. |
CAPI does drain Nodes before it deletes Machines, but as mentioned above CAPI is not aware that cluster-autoscaler is starting to delete the Node. This works something like this:
Only after 4. CAPI is aware that this Node is in deletion, before that for CAPI this Machine is a healthy Machine like every other. The problem we are trying to address is that during 4. autoscaler makes assumptions on how it can delete a specific Machine ("by scaling down the MD and marking a Machine so it gets prioritized for deletion") that are not true during MachineDeployment (MD) rollouts. Because of that we want to ensure that the autoscaler does not scale down a Machine when a MD is rolling out (including that it should not do steps 1-3). The CAPI Machine deletion process looks like this: (very short version)
Another orthogonal problem in our opinion is that with CAPI the autoscaler should never do steps 1.-3. as CAPI has it's own logic for that. But the current PR just tries to solve the issue during MD rollouts for now as it can have catastrophic consequences (scaling down a MD to 0 during MD rollouts).
We are attempting to tell CA: "this MD is going through a rollout, you should not scale down any Node/Machine of a MD during rollout".
I personally don't have a strong opinion on how it's implemented, I only would like to be able to not have Nodes of a MD during rollout in scale down candidates so we can avoid the scale down entirely (including steps 1.-3.). I think today we can only block in step 4., which is way too late as the Node is already cordoned/tainted/drained at this point.
For the current case a MachineDeployment corresponds to a NodeGroup object |
Thanks for the detailed explanation, much appreciated!
This sounds like a node group option will work then. It can be checked at the same point as in the current PR.
Does Cluster API ensure that all pods from the node it selects in (2) will be able to run elsewhere in the cluster? If it does, then I guess it could work, although at this point CAPI may just as well do (4), too. If not, doing (4) without (2) would risk some workloads becoming unschedulable before triggering a scale-up, essentially just causing an unnecessary disruption. I believe the only scenario where it's guaranteed to work well is removing 1 empty node at a time, which is pretty limited. But as you said, this discussion is unrelated to this PR. |
I think you are right. The difference is that this only allows to disable scale down for an entire node group, not for single nodes. But our current use case is also only to disable it for an entire node group.
Yup, let's discuss this separately once we get to that |
we discussed this issue/pr at the sig office hours today, @towca suggested looking at the custom scale down processors as another possible option. i had looked at the scale down processors initially, but didn't arrive at a solution. i am going to investigate that path again as i tend to agree with @aleksandra-malinowska 's point about not adding complexity to the cloud provider interface if we don't need it. |
Since it comes to this discussion point, do we still consider putting the logic into capi provider's NodeGroupForNode() implementation approach? and I see nil is not processed by all of this api's references, and we may enhance there if nil is commented to be 'if the node should not be processed by cluster autoscaler'. @towca @elmiko |
I think this stills applies: #8531 (comment) Still seems like the riskiest option to me. I'm not sure I follow your suggestion. Just to highlight it again, we want to specifically only disable scale down during MD rollout (eg not scale up) |
no, we can't use |
i have been investigating an alternative implementation that would allow cloud providers to inject custom processors into the core autoscaler configuration. so far, it seems that the processors change would also require at the least a refactor of the i think the idea of allowing cloud providers to add customer processors is actually really powerful. i will sketch that out as well, and bring it up for discussion this week at the sig meeting. |
we discussed this PR at the sig meeting yesterday. i am going to proceed with creating the alternate patch, i think it will give a better end result in terms of giving providers more functionality. i will test the alternate solution to ensure it continues to solve the problem. |
the refactor is looking good and does not add complexity to the cloud provider interface, although it does modify how cloud providers are constructed. i will post a PR once i finish some local testing. |
i have created #8583 as an alternate solution. it appears to solve the same issues in my local test environment and has the advantage of not introducing a new cloud provider interface function. i think i prefer the 8583 solution, but i will leave both PRs open for a few days to gather comments. if there are no objections, i will most likely close this PR by the end of the week. |
i am closing this in favor of #8583. i think the other PR is less complex and gives us more options in the future. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #8494
Special notes for your reviewer:
this is a challenging scenario to debug, please see the related issue.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: