Skip to content

Commit fbb591b

Browse files
committed
Granular resource limits proposal
1 parent 48dfe75 commit fbb591b

File tree

1 file changed

+329
-0
lines changed

1 file changed

+329
-0
lines changed
Lines changed: 329 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,329 @@
1+
# Granular Resource Limits in Node Autoscalers
2+
3+
## Objective
4+
5+
Node Autoscalers should allow setting more granular resource limits that would
6+
apply to arbitrary subsets of nodes, beyond the existing limiting mechanisms.
7+
8+
## Background
9+
10+
Cluster Autoscaler supports cluster-wide limits on resources (like total CPU and
11+
memory) and per-node-group node count limits. Karpenter supports
12+
setting [resource limits on a NodePool](https://karpenter.sh/docs/concepts/nodepools/#speclimits).
13+
Also, as mentioned
14+
in [AWS docs](https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html),
15+
cluster-wide limits are not supported too. This is not flexible enough for many
16+
use cases.
17+
18+
Users often need to configure more granular limits. For instance, a user might
19+
want to limit the total resources consumed by nodes of a specific machine
20+
family, nodes with a particular OS, or nodes with specialized hardware like
21+
GPUs. The current resource limits implementations in both node autoscalers do
22+
not support these scenarios.
23+
24+
This proposal introduces a new API to extend the Node Autoscalers’
25+
functionality, allowing limits to be applied to arbitrary sets of nodes.
26+
27+
## Proposal: The CapacityQuota API
28+
29+
We propose a new Kubernetes custom resource, CapacityQuota, to define
30+
resource limits on specific subsets of nodes. Node subsets are targeted using
31+
standard Kubernetes label selectors, offering a flexible way to group nodes.
32+
33+
A node's eligibility for provisioning operation will be checked against all
34+
CapacityQuota objects that select it. The operation will only be
35+
permitted if it does not violate any of the applicable limits. This should be
36+
compatible with the existing limiting mechanisms, i.e. CAS’ cluster-wide limits
37+
and Karpenter’s NodePool limits. Therefore, if the operation doesn’t violate
38+
CapacityQuota, but violates existing limiting mechanisms, it should
39+
be rejected.
40+
41+
### API Specification
42+
43+
An CapacityQuota object would look as follows:
44+
45+
```yaml
46+
apiVersion: autoscaling.x-k8s.io/v1beta1
47+
kind: CapacityQuota
48+
metadata:
49+
name: example-resource-quota
50+
spec:
51+
selector:
52+
matchLabels:
53+
example.cloud.com/machine-family: e2
54+
limits:
55+
resources:
56+
cpu: 64
57+
memory: 256Gi
58+
```
59+
60+
* `selector`: A standard Kubernetes label selector that determines which nodes
61+
the limits apply to. This allows for fine-grained control based on any label
62+
present on the nodes, such as zone, region, OS, machine family, or custom
63+
user-defined labels.
64+
* `limits`: Defines the limits of summed up resources of the selected nodes.
65+
66+
This approach is highly flexible – adding a new dimension for limits only
67+
requires ensuring the nodes are labeled appropriately, with no code changes
68+
needed in the autoscaler.
69+
70+
### Node as a Resource
71+
72+
The CapacityQuota API can be naturally extended to treat the number
73+
of nodes itself as a limitable resource, as shown in one of the examples below.
74+
75+
### CapacityQuota Status
76+
77+
For better observability, the CapacityQuota resource could be
78+
enhanced with a status field. This field, updated by a controller, would display
79+
the current resource usage for the selected nodes, allowing users to quickly
80+
check usage against the defined limits via kubectl describe. The controller can
81+
run in a separate thread as a part of the node autoscaler component.
82+
83+
An example of the status field:
84+
85+
```yaml
86+
status:
87+
used:
88+
resources:
89+
cpu: 32
90+
memory: 128Gi
91+
nodes: 50
92+
```
93+
94+
## Alternatives considered
95+
96+
### Minimum limits support
97+
98+
The initial design, besides the maximum limits, also included minimum limits.
99+
Minimum limits were supposed to affect the node consolidation in the node
100+
autoscalers. A consolidation would be allowed only if removing the node wouldn’t
101+
violate any minimum limits. Cluster-wide minimum limits are implemented in CAS
102+
together with the maximum limits, so at first, it seemed logical to include both
103+
limit directions in the design.
104+
105+
Despite being conceptually similar, minimum and maximum limits cover completely
106+
different use cases. Maximum limits can be used to control the cloud provider
107+
costs, to limit scaling certain types of compute, or to control distribution of
108+
compute resources between teams working on the same cluster. Minimum limits’
109+
main use case is ensuring a baseline capacity for users’ workloads, for example
110+
to handle sudden spikes in traffic. However, minimum limits defined as a minimum
111+
amount of resources in the cluster or a subset of nodes do not guarantee that
112+
the workloads will be schedulable on those resources. For example, two nodes
113+
with 2 CPUs each satisfy the minimum limit of 4 CPUs. If a user created a
114+
workload requesting 2 CPUs, that workload would not fit into existing nodes,
115+
making the baseline capacity effectively useless. This scenario will be better
116+
handled by
117+
the [CapacityBuffer API](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/buffers.md),
118+
which allows the user to provide an exact shape of their workloads, including
119+
the resource requests. In our example, the user would create a CapacityBuffer
120+
with a pod template requesting 2 CPUs. Such a CapacityBuffer would ensure that a
121+
pod with that shape is always schedulable on the existing nodes.
122+
123+
Therefore, we decided to remove minimum limits from the design of granular
124+
limits, as CapacityBuffers are a better way to provide a baseline capacity for
125+
user workloads.
126+
127+
### Kubernetes LimitRange and ResourceQuota
128+
129+
It has been discussed whether the same result could be accomplished by using the
130+
standard Kubernetes
131+
resources: [LimitRange](https://kubernetes.io/docs/concepts/policy/limit-range/)
132+
and [ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/).
133+
134+
LimitRange is a resource used to configure minimum and maximum resource
135+
constraints for a namespace. For example, it can define the default CPU and
136+
memory requests for pods and containers within a namespace, or enforce a minimum
137+
and maximum CPU request for a pod. However, its scope is limited to a single
138+
resource, meaning that it doesn’t look at all pods in the namespace, but just
139+
looks if the pod requests and limits are within defined bounds.
140+
141+
ResourceQuota allows to define and limit the aggregate resource consumption per
142+
namespace. This includes limiting the total CPU, memory, and storage that all
143+
pods and persistent volume claims within a namespace can request or consume. It
144+
also supports limiting the count of various Kubernetes objects, such as pods,
145+
services, and replication controllers. While resource quotas can be used to
146+
limit the resources provisioned by the CA to some degree, it’s not possible to
147+
guarantee that CA won’t scale up above the defined limit. Since the quotas
148+
operate on pod requests, and CA does not guarantee that bin packing will yield
149+
the optimal result, setting the quota to e.g. 64 CPUs does not mean that CA will
150+
stop scaling at 64 CPUs.
151+
152+
Moreover, both of those resources are namespaced, so their scope is limited to
153+
the namespace in which they are defined, while the nodes are global. We can’t
154+
use namespaced resources to limit the creation and deletion of global resources.
155+
156+
## User Stories
157+
158+
### Story 1
159+
160+
As a cluster administrator, I want to configure cluster-wide resource limits to
161+
avoid excessive cloud provider costs.
162+
163+
**Note:** This is already supported in CAS, but not in Karpenter.
164+
165+
Example CapacityQuota:
166+
167+
```yaml
168+
apiVersion: autoscaling.x-k8s.io/v1beta1
169+
kind: CapacityQuota
170+
metadata:
171+
name: cluster-wide-limits
172+
spec:
173+
limits:
174+
resources:
175+
cpu: 128
176+
memory: 256Gi
177+
```
178+
179+
### Story 2
180+
181+
As a cluster administrator, I want to configure separate resource limits for
182+
specific groups of nodes on top of cluster-wide limits, to avoid a situation
183+
where one group of nodes starves others of resources.
184+
185+
**Note:** A specific group of nodes can be either a NodePool in Karpenter, a
186+
ComputeClass in GKE, or simply a set of nodes grouped by a user-defined label.
187+
This can be useful e.g. for organizations where multiple teams are running
188+
workloads in a shared cluster, and these teams have separate sets of nodes. This
189+
way, a cluster administrator can ensure that each team has a proper limit for
190+
their resources and it doesn’t starve other teams. This story is partly
191+
supported by Karpenter’s NodePool limits.
192+
193+
Example CapacityQuota:
194+
195+
```yaml
196+
apiVersion: autoscaling.x-k8s.io/v1beta1
197+
kind: CapacityQuota
198+
metadata:
199+
name: team-a-limits
200+
spec:
201+
selector:
202+
matchLabels:
203+
team: a
204+
limits:
205+
resources:
206+
cpu: 32
207+
```
208+
209+
### Story 3
210+
211+
As a cluster administrator, I want to allow scaling up machines that are more
212+
expensive or less suitable for my workloads when better machines are
213+
unavailable, but I want to limit how many of them can be created, so that I can
214+
control extra cloud provider costs, or limit the impact of using non-optimal
215+
machine for my workloads.
216+
217+
Example CapacityQuota:
218+
219+
```yaml
220+
apiVersion: autoscaling.x-k8s.io/v1beta1
221+
kind: CapacityQuota
222+
metadata:
223+
name: max-e2-resources
224+
spec:
225+
selector:
226+
matchLabels:
227+
example.cloud.com/machine-family: e2
228+
limits:
229+
resources:
230+
cpu: 32
231+
memory: 64Gi
232+
```
233+
234+
### Story 4
235+
236+
As a cluster administrator, I want to limit the number of nodes in a specific
237+
zone if my cluster is unbalanced for any reason, so that I can avoid exhausting
238+
IP space in that zone, or enforce better balancing across zones.
239+
240+
**Note:** Originally requested
241+
in [https://github.com/kubernetes/autoscaler/issues/6940](https://github.com/kubernetes/autoscaler/issues/6940).
242+
243+
Example CapacityQuota:
244+
245+
```yaml
246+
apiVersion: autoscaling.x-k8s.io/v1beta1
247+
kind: CapacityQuota
248+
metadata:
249+
name: max-nodes-us-central1-b
250+
spec:
251+
selector:
252+
matchLabels:
253+
topology.kubernetes.io/zone: us-central1-b
254+
limits:
255+
resources:
256+
nodes: 64
257+
```
258+
259+
### Story 5 (obsolete)
260+
261+
As a cluster administrator, I want to ensure there is always a baseline capacity
262+
in my cluster or specific parts of my cluster below which the node autoscaler
263+
won’t consolidate the nodes, so that my workloads can quickly react to sudden
264+
spikes in traffic.
265+
266+
This user story is obsolete. CapacityBuffer API covers this use case in a more
267+
flexible way.
268+
269+
## Other CapacityQuota examples
270+
271+
The following examples illustrate the flexibility of the proposed API and
272+
demonstrate other possible use cases not described in the user stories.
273+
274+
#### **Maximum Windows Nodes**
275+
276+
Limit the total number of nodes running the Windows operating system to 8.
277+
278+
```yaml
279+
apiVersion: autoscaling.x-k8s.io/v1beta1
280+
kind: CapacityQuota
281+
metadata:
282+
name: max-windows-nodes
283+
spec:
284+
selector:
285+
matchLabels:
286+
kubernetes.io/os: windows
287+
limits:
288+
resources:
289+
nodes: 8
290+
```
291+
292+
#### **Maximum NVIDIA T4 GPUs**
293+
294+
Limit the total number of NVIDIA T4 GPUs in the cluster to 16.
295+
296+
```yaml
297+
apiVersion: autoscaling.x-k8s.io/v1beta1
298+
kind: CapacityQuota
299+
metadata:
300+
name: max-t4-gpus
301+
spec:
302+
selector:
303+
matchLabels:
304+
example.cloud.com/gpu-type: nvidia-t4
305+
limits:
306+
resources:
307+
nvidia.com/gpu: 16
308+
```
309+
310+
#### **Cluster-wide Limits Excluding Control Plane Nodes**
311+
312+
Apply cluster-wide CPU and memory limits while excluding nodes with the
313+
control-plane role.
314+
315+
```yaml
316+
apiVersion: autoscaling.x-k8s.io/v1beta1
317+
kind: CapacityQuota
318+
metadata:
319+
name: cluster-limits-no-control-plane
320+
spec:
321+
selector:
322+
matchExpressions:
323+
- key: node-role.kubernetes.io/control-plane
324+
operator: DoesNotExist
325+
limits:
326+
resources:
327+
cpu: 64
328+
memory: 128Gi
329+
```

0 commit comments

Comments
 (0)