Skip to content

Commit b668556

Browse files
committed
Granular resource limits proposal
1 parent 48dfe75 commit b668556

File tree

1 file changed

+328
-0
lines changed

1 file changed

+328
-0
lines changed
Lines changed: 328 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,328 @@
1+
# Granular Resource Limits in Node Autoscalers
2+
3+
## Objective
4+
5+
Node Autoscalers should allow setting more granular resource limits that would
6+
apply to arbitrary subsets of nodes, beyond the existing limiting mechanisms.
7+
8+
## Background
9+
10+
Cluster Autoscaler supports cluster-wide limits on resources (like total CPU and
11+
memory) and per-node-group node count limits. Karpenter supports
12+
setting [resource limits on a NodePool](https://karpenter.sh/docs/concepts/nodepools/#speclimits).
13+
Also, as mentioned
14+
in [AWS docs](https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html),
15+
cluster-wide limits are not supported too. This is not flexible enough for many
16+
use cases.
17+
18+
Users often need to configure more granular limits. For instance, a user might
19+
want to limit the total resources consumed by nodes of a specific machine
20+
family, nodes with a particular OS, or nodes with specialized hardware like
21+
GPUs. The current resource limits implementations in both node autoscalers do
22+
not support these scenarios.
23+
24+
This proposal introduces a new API to extend the Node Autoscalers’
25+
functionality, allowing limits to be applied to arbitrary sets of nodes.
26+
27+
## Proposal: The AutoscalingResourceQuota API
28+
29+
We propose a new Kubernetes custom resource, AutoscalingResourceQuota, to define
30+
resource limits on specific subsets of nodes. Node subsets are targeted using
31+
standard Kubernetes label selectors, offering a flexible way to group nodes.
32+
33+
A node's eligibility for provisioning operation will be checked against all
34+
AutoscalingResourceQuota objects that select it. The operation will only be
35+
permitted if it does not violate any of the applicable limits. This should be
36+
compatible with the existing limiting mechanisms, i.e. CAS’ cluster-wide limits
37+
and Karpenter’s NodePool limits. Therefore, if the operation doesn’t violate
38+
AutoscalingResourceQuota, but violates existing limiting mechanisms, it should
39+
be rejected.
40+
41+
### API Specification
42+
43+
An AutoscalingResourceQuota object would look as follows:
44+
45+
```yaml
46+
apiVersion: autoscaling.x-k8s.io/v1beta1
47+
kind: AutoscalingResourceQuota
48+
metadata:
49+
name: example-resource-quota
50+
spec:
51+
selector:
52+
matchLabels:
53+
example.cloud.com/machine-family: e2
54+
limits:
55+
resources:
56+
cpu: 64
57+
memory: 256Gi
58+
```
59+
60+
* `selector`: A standard Kubernetes label selector that determines which nodes
61+
the limits apply to. This allows for fine-grained control based on any label
62+
present on the nodes, such as zone, region, OS, machine family, or custom
63+
user-defined labels.
64+
* `limits`: Defines the limits of summed up resources of the selected nodes.
65+
66+
This approach is highly flexible – adding a new dimension for limits only
67+
requires ensuring the nodes are labeled appropriately, with no code changes
68+
needed in the autoscaler.
69+
70+
### Node as a Resource
71+
72+
The AutoscalingResourceQuota API can be naturally extended to treat the number
73+
of nodes itself as a limitable resource, as shown in one of the examples below.
74+
75+
### AutoscalingResourceQuota Status
76+
77+
For better observability, the AutoscalingResourceQuota resource could be
78+
enhanced with a status field. This field, updated by a controller, would display
79+
the current resource usage for the selected nodes, allowing users to quickly
80+
check usage against the defined limits via kubectl describe. The controller can
81+
run in a separate thread as a part of the node autoscaler component.
82+
83+
An example of the status field:
84+
85+
```yaml
86+
status:
87+
usage:
88+
cpu: 32
89+
memory: 128Gi
90+
nodes: 50
91+
```
92+
93+
## Alternatives considered
94+
95+
### Minimum limits support
96+
97+
The initial design, besides the maximum limits, also included minimum limits.
98+
Minimum limits were supposed to affect the node consolidation in the node
99+
autoscalers. A consolidation would be allowed only if removing the node wouldn’t
100+
violate any minimum limits. Cluster-wide minimum limits are implemented in CAS
101+
together with the maximum limits, so at first, it seemed logical to include both
102+
limit directions in the design.
103+
104+
Despite being conceptually similar, minimum and maximum limits cover completely
105+
different use cases. Maximum limits can be used to control the cloud provider
106+
costs, to limit scaling certain types of compute, or to control distribution of
107+
compute resources between teams working on the same cluster. Minimum limits’
108+
main use case is ensuring a baseline capacity for users’ workloads, for example
109+
to handle sudden spikes in traffic. However, minimum limits defined as a minimum
110+
amount of resources in the cluster or a subset of nodes do not guarantee that
111+
the workloads will be schedulable on those resources. For example, two nodes
112+
with 2 CPUs each satisfy the minimum limit of 4 CPUs. If a user created a
113+
workload requesting 2 CPUs, that workload would not fit into existing nodes,
114+
making the baseline capacity effectively useless. This scenario will be better
115+
handled by
116+
the [CapacityBuffer API](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/buffers.md),
117+
which allows the user to provide an exact shape of their workloads, including
118+
the resource requests. In our example, the user would create a CapacityBuffer
119+
with a pod template requesting 2 CPUs. Such a CapacityBuffer would ensure that a
120+
pod with that shape is always schedulable on the existing nodes.
121+
122+
Therefore, we decided to remove minimum limits from the design of granular
123+
limits, as CapacityBuffers are a better way to provide a baseline capacity for
124+
user workloads.
125+
126+
### Kubernetes LimitRange and ResourceQuota
127+
128+
It has been discussed whether the same result could be accomplished by using the
129+
standard Kubernetes
130+
resources: [LimitRange](https://kubernetes.io/docs/concepts/policy/limit-range/)
131+
and [ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/).
132+
133+
LimitRange is a resource used to configure minimum and maximum resource
134+
constraints for a namespace. For example, it can define the default CPU and
135+
memory requests for pods and containers within a namespace, or enforce a minimum
136+
and maximum CPU request for a pod. However, its scope is limited to a single
137+
resource, meaning that it doesn’t look at all pods in the namespace, but just
138+
looks if the pod requests and limits are within defined bounds.
139+
140+
ResourceQuota allows to define and limit the aggregate resource consumption per
141+
namespace. This includes limiting the total CPU, memory, and storage that all
142+
pods and persistent volume claims within a namespace can request or consume. It
143+
also supports limiting the count of various Kubernetes objects, such as pods,
144+
services, and replication controllers. While resource quotas can be used to
145+
limit the resources provisioned by the CA to some degree, it’s not possible to
146+
guarantee that CA won’t scale up above the defined limit. Since the quotas
147+
operate on pod requests, and CA does not guarantee that bin packing will yield
148+
the optimal result, setting the quota to e.g. 64 CPUs does not mean that CA will
149+
stop scaling at 64 CPUs.
150+
151+
Moreover, both of those resources are namespaced, so their scope is limited to
152+
the namespace in which they are defined, while the nodes are global. We can’t
153+
use namespaced resources to limit the creation and deletion of global resources.
154+
155+
## User Stories
156+
157+
### Story 1
158+
159+
As a cluster administrator, I want to configure cluster-wide resource limits to
160+
avoid excessive cloud provider costs.
161+
162+
**Note:** This is already supported in CAS, but not in Karpenter.
163+
164+
Example AutoscalingResourceQuota:
165+
166+
```yaml
167+
apiVersion: autoscaling.x-k8s.io/v1beta1
168+
kind: AutoscalingResourceQuota
169+
metadata:
170+
name: cluster-wide-limits
171+
spec:
172+
limits:
173+
resources:
174+
cpu: 128
175+
memory: 256Gi
176+
```
177+
178+
### Story 2
179+
180+
As a cluster administrator, I want to configure separate resource limits for
181+
specific groups of nodes on top of cluster-wide limits, to avoid a situation
182+
where one group of nodes starves others of resources.
183+
184+
**Note:** A specific group of nodes can be either a NodePool in Karpenter, a
185+
ComputeClass in GKE, or simply a set of nodes grouped by a user-defined label.
186+
This can be useful e.g. for organizations where multiple teams are running
187+
workloads in a shared cluster, and these teams have separate sets of nodes. This
188+
way, a cluster administrator can ensure that each team has a proper limit for
189+
their resources and it doesn’t starve other teams. This story is partly
190+
supported by Karpenter’s NodePool limits.
191+
192+
Example AutoscalingResourceQuota:
193+
194+
```yaml
195+
apiVersion: autoscaling.x-k8s.io/v1beta1
196+
kind: AutoscalingResourceQuota
197+
metadata:
198+
name: team-a-limits
199+
spec:
200+
selector:
201+
matchLabels:
202+
team: a
203+
limits:
204+
resources:
205+
cpu: 32
206+
```
207+
208+
### Story 3
209+
210+
As a cluster administrator, I want to allow scaling up machines that are more
211+
expensive or less suitable for my workloads when better machines are
212+
unavailable, but I want to limit how many of them can be created, so that I can
213+
control extra cloud provider costs, or limit the impact of using non-optimal
214+
machine for my workloads.
215+
216+
Example AutoscalingResourceQuota:
217+
218+
```yaml
219+
apiVersion: autoscaling.x-k8s.io/v1beta1
220+
kind: AutoscalingResourceQuota
221+
metadata:
222+
name: max-e2-resources
223+
spec:
224+
selector:
225+
matchLabels:
226+
example.cloud.com/machine-family: e2
227+
limits:
228+
resources:
229+
cpu: 32
230+
memory: 64Gi
231+
```
232+
233+
### Story 4
234+
235+
As a cluster administrator, I want to limit the number of nodes in a specific
236+
zone if my cluster is unbalanced for any reason, so that I can avoid exhausting
237+
IP space in that zone, or enforce better balancing across zones.
238+
239+
**Note:** Originally requested
240+
in [https://github.com/kubernetes/autoscaler/issues/6940](https://github.com/kubernetes/autoscaler/issues/6940).
241+
242+
Example AutoscalingResourceQuota:
243+
244+
```yaml
245+
apiVersion: autoscaling.x-k8s.io/v1beta1
246+
kind: AutoscalingResourceQuota
247+
metadata:
248+
name: max-nodes-us-central1-b
249+
spec:
250+
selector:
251+
matchLabels:
252+
topology.kubernetes.io/zone: us-central1-b
253+
limits:
254+
resources:
255+
nodes: 64
256+
```
257+
258+
### Story 5 (obsolete)
259+
260+
As a cluster administrator, I want to ensure there is always a baseline capacity
261+
in my cluster or specific parts of my cluster below which the node autoscaler
262+
won’t consolidate the nodes, so that my workloads can quickly react to sudden
263+
spikes in traffic.
264+
265+
This user story is obsolete. CapacityBuffer API covers this use case in a more
266+
flexible way.
267+
268+
## Other AutoscalingResourceQuota examples
269+
270+
The following examples illustrate the flexibility of the proposed API and
271+
demonstrate other possible use cases not described in the user stories.
272+
273+
#### **Maximum Windows Nodes**
274+
275+
Limit the total number of nodes running the Windows operating system to 8.
276+
277+
```yaml
278+
apiVersion: autoscaling.x-k8s.io/v1beta1
279+
kind: AutoscalingResourceQuota
280+
metadata:
281+
name: max-windows-nodes
282+
spec:
283+
selector:
284+
matchLabels:
285+
kubernetes.io/os: windows
286+
limits:
287+
resources:
288+
nodes: 8
289+
```
290+
291+
#### **Maximum NVIDIA T4 GPUs**
292+
293+
Limit the total number of NVIDIA T4 GPUs in the cluster to 16.
294+
295+
```yaml
296+
apiVersion: autoscaling.x-k8s.io/v1beta1
297+
kind: AutoscalingResourceQuota
298+
metadata:
299+
name: max-t4-gpus
300+
spec:
301+
selector:
302+
matchLabels:
303+
example.cloud.com/gpu-type: nvidia-t4
304+
limits:
305+
resources:
306+
nvidia.com/gpu: 16
307+
```
308+
309+
#### **Cluster-wide Limits Excluding Control Plane Nodes**
310+
311+
Apply cluster-wide CPU and memory limits while excluding nodes with the
312+
control-plane role.
313+
314+
```yaml
315+
apiVersion: autoscaling.x-k8s.io/v1beta1
316+
kind: AutoscalingResourceQuota
317+
metadata:
318+
name: cluster-limits-no-control-plane
319+
spec:
320+
selector:
321+
matchExpressions:
322+
- key: node-role.kubernetes.io/control-plane
323+
operator: DoesNotExist
324+
limits:
325+
resources:
326+
cpu: 64
327+
memory: 128Gi
328+
```

0 commit comments

Comments
 (0)