-
Notifications
You must be signed in to change notification settings - Fork 426
Description
Description
What problem are you trying to solve?
Node Auto Repair is a great feature to have Karpenter auto terminate nodes which are impaired, however the way it's built, and isn't able to be configured, makes it unsuitable for more disruption intolerant nodepools/clusters. It would be nice to see the ability to configure it's behaviour per nodepool to make it work within given constraints.
For context a component inadvertently marked healthy nodes as unhealthy and we terminated too many nodes with this feature enabled.
The docs current mention some hard coded controls currently in place:
To prevent cascading failures, Karpenter includes safety mechanisms: it will not perform repairs if more than 20% of nodes in a NodePool are unhealthy, and for standalone NodeClaims, it evaluates this threshold against all nodes in the cluster.
I'd like to see these being extended and configurable per-nodepool with:
- The ability to attempt a graceful termination first, most of the time for a truly unhealthy node this won't work but if 15m into the 30m wait Karpenter first tried gracefully terminating the node this at least attempts to solve the node issues without ungraceful deletes first
- Configuration about concurrent terminations per nodepool, like how we can configure disruption budgets for other reason if we could configure a disruption budget for autorepair terminations of say 1 at a time, once per 5m, that could as well protect some workloads which have PDB requirements
- Configuration about the time Karpenter waits to "auto repair" a node, per condition, per nodepool
How important is this feature to you?
We likely won't be confident enough enabling this feature until these protections are in place and would probably build something similar out of Karpenter to handle terminating in this way.
- Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment