Improve cluster-autoscaler integration

While investigating an [autoscaler issue](https://github.com/kubernetes/autoscaler/issues/8494) we identified a few areas of improvement of the autoscaler - CAPI integration

We should look into the following areas:
* [ ] First temporary/stopgap solution for CAPI controller and autoscaler fighting over replicas during MD rollouts (see: https://github.com/kubernetes/autoscaler/issues/8494)
* [ ] Improve behavior during Machine deletion (including Node drain etc.)
  * Today autoscaler cordons/taints/drains Nodes before triggering Machine deletion
  * This means that the CAPI Machine deletion logic is not respected (pre-drain hooks, MachineDrainRules, drain observability, ...)
  * An idea: Maybe we want to disable cordon/taint/drain in autoscaler, options:
    * Via a global flag to allow disabling drain
    * Extend the `CloudProvider` interface with a new Method to allow disabling drain per node group
    * Extending the `GetOptions` method of the `NodeGroup`  interface to allow disabling drain per node group
* [ ] Double-check that autoscaler does not scale up to many Machines based on pending Pods
  * Not entirely sure, but it looks like we observed autoscaler scaling up twice within 12 seconds because of just 1 pending Pod
* [ ] Find a final solution for autoscaling during MD rollouts
  * [ ] Improve how autoscaler triggers Machine deletion (`delete-machine` annotation on MS-level + MD scale down is a weak/no API)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve cluster-autoscaler integration #12762

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve cluster-autoscaler integration #12762

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions