|
| 1 | +# Granular Resource Limits in Node Autoscalers |
| 2 | + |
| 3 | +## Objective |
| 4 | + |
| 5 | +Node Autoscalers should allow setting more granular resource limits that would |
| 6 | +apply to arbitrary subsets of nodes, beyond the existing limiting mechanisms. |
| 7 | + |
| 8 | +## Background |
| 9 | + |
| 10 | +Cluster Autoscaler supports cluster-wide limits on resources (like total CPU and |
| 11 | +memory) and per-node-group node count limits. Karpenter supports |
| 12 | +setting [resource limits on a NodePool](https://karpenter.sh/docs/concepts/nodepools/#speclimits). |
| 13 | +Also, as mentioned |
| 14 | +in [AWS docs](https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html), |
| 15 | +cluster-wide limits are not supported too. This is not flexible enough for many |
| 16 | +use cases. |
| 17 | + |
| 18 | +Users often need to configure more granular limits. For instance, a user might |
| 19 | +want to limit the total resources consumed by nodes of a specific machine |
| 20 | +family, nodes with a particular OS, or nodes with specialized hardware like |
| 21 | +GPUs. The current resource limits implementations in both node autoscalers do |
| 22 | +not support these scenarios. |
| 23 | + |
| 24 | +This proposal introduces a new API to extend the Node Autoscalers’ |
| 25 | +functionality, allowing limits to be applied to arbitrary sets of nodes. |
| 26 | + |
| 27 | +## Proposal: The CapacityQuota API |
| 28 | + |
| 29 | +We propose a new Kubernetes custom resource, CapacityQuota, to define |
| 30 | +resource limits on specific subsets of nodes. Node subsets are targeted using |
| 31 | +standard Kubernetes label selectors, offering a flexible way to group nodes. |
| 32 | + |
| 33 | +A node's eligibility for provisioning operation will be checked against all |
| 34 | +CapacityQuota objects that select it. The operation will only be |
| 35 | +permitted if it does not violate any of the applicable limits. This should be |
| 36 | +compatible with the existing limiting mechanisms, i.e. CAS’ cluster-wide limits |
| 37 | +and Karpenter’s NodePool limits. Therefore, if the operation doesn’t violate |
| 38 | +CapacityQuota, but violates existing limiting mechanisms, it should |
| 39 | +be rejected. |
| 40 | + |
| 41 | +### API Specification |
| 42 | + |
| 43 | +An CapacityQuota object would look as follows: |
| 44 | + |
| 45 | +```yaml |
| 46 | +apiVersion: autoscaling.x-k8s.io/v1beta1 |
| 47 | +kind: CapacityQuota |
| 48 | +metadata: |
| 49 | + name: example-resource-quota |
| 50 | +spec: |
| 51 | + selector: |
| 52 | + matchLabels: |
| 53 | + example.cloud.com/machine-family: e2 |
| 54 | + limits: |
| 55 | + resources: |
| 56 | + cpu: 64 |
| 57 | + memory: 256Gi |
| 58 | +``` |
| 59 | +
|
| 60 | +* `selector`: A standard Kubernetes label selector that determines which nodes |
| 61 | + the limits apply to. This allows for fine-grained control based on any label |
| 62 | + present on the nodes, such as zone, region, OS, machine family, or custom |
| 63 | + user-defined labels. |
| 64 | +* `limits`: Defines the limits of summed up resources of the selected nodes. |
| 65 | + |
| 66 | +This approach is highly flexible – adding a new dimension for limits only |
| 67 | +requires ensuring the nodes are labeled appropriately, with no code changes |
| 68 | +needed in the autoscaler. |
| 69 | + |
| 70 | +### Node as a Resource |
| 71 | + |
| 72 | +The CapacityQuota API can be naturally extended to treat the number |
| 73 | +of nodes itself as a limitable resource, as shown in one of the examples below. |
| 74 | + |
| 75 | +### CapacityQuota Status |
| 76 | + |
| 77 | +For better observability, the CapacityQuota resource could be |
| 78 | +enhanced with a status field. This field, updated by a controller, would display |
| 79 | +the current resource usage for the selected nodes, allowing users to quickly |
| 80 | +check usage against the defined limits via kubectl describe. The controller can |
| 81 | +run in a separate thread as a part of the node autoscaler component. |
| 82 | + |
| 83 | +An example of the status field: |
| 84 | + |
| 85 | +```yaml |
| 86 | +status: |
| 87 | + used: |
| 88 | + resources: |
| 89 | + cpu: 32 |
| 90 | + memory: 128Gi |
| 91 | + nodes: 50 |
| 92 | +``` |
| 93 | + |
| 94 | +## Alternatives considered |
| 95 | + |
| 96 | +### Minimum limits support |
| 97 | + |
| 98 | +The initial design, besides the maximum limits, also included minimum limits. |
| 99 | +Minimum limits were supposed to affect the node consolidation in the node |
| 100 | +autoscalers. A consolidation would be allowed only if removing the node wouldn’t |
| 101 | +violate any minimum limits. Cluster-wide minimum limits are implemented in CAS |
| 102 | +together with the maximum limits, so at first, it seemed logical to include both |
| 103 | +limit directions in the design. |
| 104 | + |
| 105 | +Despite being conceptually similar, minimum and maximum limits cover completely |
| 106 | +different use cases. Maximum limits can be used to control the cloud provider |
| 107 | +costs, to limit scaling certain types of compute, or to control distribution of |
| 108 | +compute resources between teams working on the same cluster. Minimum limits’ |
| 109 | +main use case is ensuring a baseline capacity for users’ workloads, for example |
| 110 | +to handle sudden spikes in traffic. However, minimum limits defined as a minimum |
| 111 | +amount of resources in the cluster or a subset of nodes do not guarantee that |
| 112 | +the workloads will be schedulable on those resources. For example, two nodes |
| 113 | +with 2 CPUs each satisfy the minimum limit of 4 CPUs. If a user created a |
| 114 | +workload requesting 2 CPUs, that workload would not fit into existing nodes, |
| 115 | +making the baseline capacity effectively useless. This scenario will be better |
| 116 | +handled by |
| 117 | +the [CapacityBuffer API](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/buffers.md), |
| 118 | +which allows the user to provide an exact shape of their workloads, including |
| 119 | +the resource requests. In our example, the user would create a CapacityBuffer |
| 120 | +with a pod template requesting 2 CPUs. Such a CapacityBuffer would ensure that a |
| 121 | +pod with that shape is always schedulable on the existing nodes. |
| 122 | + |
| 123 | +Therefore, we decided to remove minimum limits from the design of granular |
| 124 | +limits, as CapacityBuffers are a better way to provide a baseline capacity for |
| 125 | +user workloads. |
| 126 | + |
| 127 | +### Kubernetes LimitRange and ResourceQuota |
| 128 | + |
| 129 | +It has been discussed whether the same result could be accomplished by using the |
| 130 | +standard Kubernetes |
| 131 | +resources: [LimitRange](https://kubernetes.io/docs/concepts/policy/limit-range/) |
| 132 | +and [ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/). |
| 133 | + |
| 134 | +LimitRange is a resource used to configure minimum and maximum resource |
| 135 | +constraints for a namespace. For example, it can define the default CPU and |
| 136 | +memory requests for pods and containers within a namespace, or enforce a minimum |
| 137 | +and maximum CPU request for a pod. However, its scope is limited to a single |
| 138 | +resource, meaning that it doesn’t look at all pods in the namespace, but just |
| 139 | +looks if the pod requests and limits are within defined bounds. |
| 140 | + |
| 141 | +ResourceQuota allows to define and limit the aggregate resource consumption per |
| 142 | +namespace. This includes limiting the total CPU, memory, and storage that all |
| 143 | +pods and persistent volume claims within a namespace can request or consume. It |
| 144 | +also supports limiting the count of various Kubernetes objects, such as pods, |
| 145 | +services, and replication controllers. While resource quotas can be used to |
| 146 | +limit the resources provisioned by the CA to some degree, it’s not possible to |
| 147 | +guarantee that CA won’t scale up above the defined limit. Since the quotas |
| 148 | +operate on pod requests, and CA does not guarantee that bin packing will yield |
| 149 | +the optimal result, setting the quota to e.g. 64 CPUs does not mean that CA will |
| 150 | +stop scaling at 64 CPUs. |
| 151 | + |
| 152 | +Moreover, both of those resources are namespaced, so their scope is limited to |
| 153 | +the namespace in which they are defined, while the nodes are global. We can’t |
| 154 | +use namespaced resources to limit the creation and deletion of global resources. |
| 155 | + |
| 156 | +## User Stories |
| 157 | + |
| 158 | +### Story 1 |
| 159 | + |
| 160 | +As a cluster administrator, I want to configure cluster-wide resource limits to |
| 161 | +avoid excessive cloud provider costs. |
| 162 | + |
| 163 | +**Note:** This is already supported in CAS, but not in Karpenter. |
| 164 | + |
| 165 | +Example CapacityQuota: |
| 166 | + |
| 167 | +```yaml |
| 168 | +apiVersion: autoscaling.x-k8s.io/v1beta1 |
| 169 | +kind: CapacityQuota |
| 170 | +metadata: |
| 171 | + name: cluster-wide-limits |
| 172 | +spec: |
| 173 | + limits: |
| 174 | + resources: |
| 175 | + cpu: 128 |
| 176 | + memory: 256Gi |
| 177 | +``` |
| 178 | + |
| 179 | +### Story 2 |
| 180 | + |
| 181 | +As a cluster administrator, I want to configure separate resource limits for |
| 182 | +specific groups of nodes on top of cluster-wide limits, to avoid a situation |
| 183 | +where one group of nodes starves others of resources. |
| 184 | + |
| 185 | +**Note:** A specific group of nodes can be either a NodePool in Karpenter, a |
| 186 | +ComputeClass in GKE, or simply a set of nodes grouped by a user-defined label. |
| 187 | +This can be useful e.g. for organizations where multiple teams are running |
| 188 | +workloads in a shared cluster, and these teams have separate sets of nodes. This |
| 189 | +way, a cluster administrator can ensure that each team has a proper limit for |
| 190 | +their resources and it doesn’t starve other teams. This story is partly |
| 191 | +supported by Karpenter’s NodePool limits. |
| 192 | + |
| 193 | +Example CapacityQuota: |
| 194 | + |
| 195 | +```yaml |
| 196 | +apiVersion: autoscaling.x-k8s.io/v1beta1 |
| 197 | +kind: CapacityQuota |
| 198 | +metadata: |
| 199 | + name: team-a-limits |
| 200 | +spec: |
| 201 | + selector: |
| 202 | + matchLabels: |
| 203 | + team: a |
| 204 | + limits: |
| 205 | + resources: |
| 206 | + cpu: 32 |
| 207 | +``` |
| 208 | + |
| 209 | +### Story 3 |
| 210 | + |
| 211 | +As a cluster administrator, I want to allow scaling up machines that are more |
| 212 | +expensive or less suitable for my workloads when better machines are |
| 213 | +unavailable, but I want to limit how many of them can be created, so that I can |
| 214 | +control extra cloud provider costs, or limit the impact of using non-optimal |
| 215 | +machine for my workloads. |
| 216 | + |
| 217 | +Example CapacityQuota: |
| 218 | + |
| 219 | +```yaml |
| 220 | +apiVersion: autoscaling.x-k8s.io/v1beta1 |
| 221 | +kind: CapacityQuota |
| 222 | +metadata: |
| 223 | + name: max-e2-resources |
| 224 | +spec: |
| 225 | + selector: |
| 226 | + matchLabels: |
| 227 | + example.cloud.com/machine-family: e2 |
| 228 | + limits: |
| 229 | + resources: |
| 230 | + cpu: 32 |
| 231 | + memory: 64Gi |
| 232 | +``` |
| 233 | + |
| 234 | +### Story 4 |
| 235 | + |
| 236 | +As a cluster administrator, I want to limit the number of nodes in a specific |
| 237 | +zone if my cluster is unbalanced for any reason, so that I can avoid exhausting |
| 238 | +IP space in that zone, or enforce better balancing across zones. |
| 239 | + |
| 240 | +**Note:** Originally requested |
| 241 | +in [https://github.com/kubernetes/autoscaler/issues/6940](https://github.com/kubernetes/autoscaler/issues/6940). |
| 242 | + |
| 243 | +Example CapacityQuota: |
| 244 | + |
| 245 | +```yaml |
| 246 | +apiVersion: autoscaling.x-k8s.io/v1beta1 |
| 247 | +kind: CapacityQuota |
| 248 | +metadata: |
| 249 | + name: max-nodes-us-central1-b |
| 250 | +spec: |
| 251 | + selector: |
| 252 | + matchLabels: |
| 253 | + topology.kubernetes.io/zone: us-central1-b |
| 254 | + limits: |
| 255 | + resources: |
| 256 | + nodes: 64 |
| 257 | +``` |
| 258 | + |
| 259 | +### Story 5 (obsolete) |
| 260 | + |
| 261 | +As a cluster administrator, I want to ensure there is always a baseline capacity |
| 262 | +in my cluster or specific parts of my cluster below which the node autoscaler |
| 263 | +won’t consolidate the nodes, so that my workloads can quickly react to sudden |
| 264 | +spikes in traffic. |
| 265 | + |
| 266 | +This user story is obsolete. CapacityBuffer API covers this use case in a more |
| 267 | +flexible way. |
| 268 | + |
| 269 | +## Other CapacityQuota examples |
| 270 | + |
| 271 | +The following examples illustrate the flexibility of the proposed API and |
| 272 | +demonstrate other possible use cases not described in the user stories. |
| 273 | + |
| 274 | +#### **Maximum Windows Nodes** |
| 275 | + |
| 276 | +Limit the total number of nodes running the Windows operating system to 8. |
| 277 | + |
| 278 | +```yaml |
| 279 | +apiVersion: autoscaling.x-k8s.io/v1beta1 |
| 280 | +kind: CapacityQuota |
| 281 | +metadata: |
| 282 | + name: max-windows-nodes |
| 283 | +spec: |
| 284 | + selector: |
| 285 | + matchLabels: |
| 286 | + kubernetes.io/os: windows |
| 287 | + limits: |
| 288 | + resources: |
| 289 | + nodes: 8 |
| 290 | +``` |
| 291 | + |
| 292 | +#### **Maximum NVIDIA T4 GPUs** |
| 293 | + |
| 294 | +Limit the total number of NVIDIA T4 GPUs in the cluster to 16. |
| 295 | + |
| 296 | +```yaml |
| 297 | +apiVersion: autoscaling.x-k8s.io/v1beta1 |
| 298 | +kind: CapacityQuota |
| 299 | +metadata: |
| 300 | + name: max-t4-gpus |
| 301 | +spec: |
| 302 | + selector: |
| 303 | + matchLabels: |
| 304 | + example.cloud.com/gpu-type: nvidia-t4 |
| 305 | + limits: |
| 306 | + resources: |
| 307 | + nvidia.com/gpu: 16 |
| 308 | +``` |
| 309 | + |
| 310 | +#### **Cluster-wide Limits Excluding Control Plane Nodes** |
| 311 | + |
| 312 | +Apply cluster-wide CPU and memory limits while excluding nodes with the |
| 313 | +control-plane role. |
| 314 | + |
| 315 | +```yaml |
| 316 | +apiVersion: autoscaling.x-k8s.io/v1beta1 |
| 317 | +kind: CapacityQuota |
| 318 | +metadata: |
| 319 | + name: cluster-limits-no-control-plane |
| 320 | +spec: |
| 321 | + selector: |
| 322 | + matchExpressions: |
| 323 | + - key: node-role.kubernetes.io/control-plane |
| 324 | + operator: DoesNotExist |
| 325 | + limits: |
| 326 | + resources: |
| 327 | + cpu: 64 |
| 328 | + memory: 128Gi |
| 329 | +``` |
0 commit comments