Skip to content

Allow configuring Node Auto Repair per nodepool and extend the options #2811

@domgoodwin

Description

@domgoodwin

Description

What problem are you trying to solve?
Node Auto Repair is a great feature to have Karpenter auto terminate nodes which are impaired, however the way it's built, and isn't able to be configured, makes it unsuitable for more disruption intolerant nodepools/clusters. It would be nice to see the ability to configure it's behaviour per nodepool to make it work within given constraints.

For context a component inadvertently marked healthy nodes as unhealthy and we terminated too many nodes with this feature enabled.

The docs current mention some hard coded controls currently in place:

To prevent cascading failures, Karpenter includes safety mechanisms: it will not perform repairs if more than 20% of nodes in a NodePool are unhealthy, and for standalone NodeClaims, it evaluates this threshold against all nodes in the cluster.

I'd like to see these being extended and configurable per-nodepool with:

  • The ability to attempt a graceful termination first, most of the time for a truly unhealthy node this won't work but if 15m into the 30m wait Karpenter first tried gracefully terminating the node this at least attempts to solve the node issues without ungraceful deletes first
  • Configuration about concurrent terminations per nodepool, like how we can configure disruption budgets for other reason if we could configure a disruption budget for autorepair terminations of say 1 at a time, once per 5m, that could as well protect some workloads which have PDB requirements
  • Configuration about the time Karpenter waits to "auto repair" a node, per condition, per nodepool

How important is this feature to you?

We likely won't be confident enough enabling this feature until these protections are in place and would probably build something similar out of Karpenter to handle terminating in this way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions