Skip to content

Commit c026200

Browse files
committed
Proposal document for improvement to accurate estimator for CRD scheduling
Signed-off-by: mszacillo <[email protected]>
1 parent c8acebc commit c026200

File tree

3 files changed

+195
-0
lines changed

3 files changed

+195
-0
lines changed
208 KB
Loading
104 KB
Loading
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
---
2+
title: CRD Component Scheduler Estimation
3+
authors:
4+
- "@mszacillo"
5+
- "@Dyex719"
6+
reviewers:
7+
- "@RainbowMango"
8+
- "@XiShanYongYe-Chang"
9+
- "@zhzhuang-zju"
10+
approvers:
11+
- "@RainbowMango"
12+
13+
create-date: 2024-06-17
14+
---
15+
# Multiple Pod Template Support
16+
17+
## Summary
18+
19+
Users may want to use Karmada for resource-aware scheduling of Custom Resources (CRDs). This can be done
20+
if the CRD is comprised of a single podTemplate, which Karmada can already parse if the user defines
21+
the ReplicaRequirements with this in mind. Resource-aware scheduling becomes more difficult however,
22+
if the CRD is comprised of multiple podTemplates or pods of differing resource requirements.
23+
24+
In the case of [FlinkDeployments](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/), there are two podTemplates representing the jobManager and taskManagers. Both components can
25+
different resourceRequirements which Karmada cannot currently distinguish while making maxReplica estimates. This is due to a limitation
26+
in the API definition of ReplicaRequirements which assumes that all replicas scheduled by Karmada will have the same resource request.
27+
28+
We could technically add up all the individual component requirements and input those into the replicaRequirements, but Karmada would
29+
treat this like a "super replica", and try to find a node in the destination namespace that could fit the entire replica. In many cases,
30+
this is simply not possible.
31+
32+
For this proposal, we would like to enhance the accurate scheduler to account of complex CRDs with multiple podTemplates or components.
33+
34+
## Background on our Use-Case
35+
36+
Karmada will be used as an intelligent scheduler for FlinkDeployments. We aim to use the accurate estimator (with the
37+
ResourceQuota plugin enabled), to estimate whether a FlinkDeployment can be fully scheduled on the potential destination namespace.
38+
In order to make this estimation, we need to take into account all of the resource requirements of the components that will be
39+
scheduled by the Flink Operator. Once the CRD is scheduled by Karmada, the Flink Operator will take over the rest of the component
40+
scheduling as seen below.
41+
42+
![Karmada-Scheduler](Karmada-Scheduler.png)
43+
44+
In the case of Flink, these components are the JobManager(s) as well as the TaskManager(s). Both of these components can be comprised of
45+
multiple pods, and the JM and TM frequently do not have the same resource requirements.
46+
47+
## Motivation
48+
49+
Karmada currently provides 2 methods of scheduling estimation through:
50+
1. The general estimator (which analyzes total cluster resources to determine scheduling)
51+
2. The accurate estimator (which can inspect namespaced resource quotas and determine
52+
number of potential replicas via the ResourceQuota plugin)
53+
54+
This proposal aims to improve the 2nd method by allowing users to define components for their replica
55+
and provide precise resourceRequirements.
56+
57+
## Goals
58+
59+
- Provide a declarative pattern for defining the resourceRequests for individual replica components
60+
- Allow more accurate scheduling estimates for CRDs
61+
62+
## Design Details
63+
64+
### API change
65+
66+
The main changes of this proposal are to the API definition of the ReplicaRequirements struct. We currently include the replicaCount and
67+
replicaRequirements as root level attributes to the ResourceBindingSpec. The limitation here is that we are unable to define unique
68+
replicaRequirements in the case that the resource has more than one podTemplate.
69+
70+
To address this, we can move the concept of replicas and replicaRequirements into a struct related to the individual resource's `Components`.
71+
72+
Each `Component` will have a `Name`, the number of `Replicas`, and corresponding `replicaRequirements`.
73+
These basic fields are necessary to allow the accurate estimator to determine whether all components of the CRD replica
74+
will be able to fit on the destination namespace.
75+
76+
The definition of ReplicaRequirements will stay the same - with the drawback that the user will need to define how Karmada
77+
interprets the individual components of the CRD. Karmada should also support a default component which will use one of the resource's
78+
podTemplates to find requirements.
79+
80+
```go
81+
82+
type ResourceBindingSpec struct {
83+
84+
. . .
85+
86+
// The total number of replicas scheduled by this resource. Each replica will represented by exactly one component of the resource.
87+
TotalReplicas int32 `json:"totalReplicas,omitempty"`
88+
89+
// Defines the requirements of an individual component of the resource.
90+
// +optional
91+
Components []Components `json:"components,omitempty"`
92+
93+
. . .
94+
}
95+
96+
// A component is a unique representation of a resource's replica. For simple resources, like Deployments, there will only be
97+
// one component, associated with the podTemplate in the Deployment definition.
98+
//
99+
// Complex resources can have multiple components controlled through different podTemplates.
100+
// Each replica for the resource will fall into a component type with requirements defined by its relevant podTemplate.
101+
type ComponentRequirements struct {
102+
103+
// Name of this component
104+
Name string `json:"name,omitempty"`
105+
106+
// Replicas represents the replica number of the resource's component
107+
// +optional
108+
Replicas int32 `json:"replicas,omitempty"`
109+
110+
// ReplicaRequirements represents the requirements required by each replica for this component.
111+
// +optional
112+
ReplicaRequirements *ReplicaRequirements `json:"replicaRequirements,omitempty"`
113+
114+
}
115+
116+
// ReplicaRequirements represents the requirements required by each replica.
117+
type ReplicaRequirements struct {
118+
119+
// NodeClaim represents the node claim HardNodeAffinity, NodeSelector and Tolerations required by each replica.
120+
// +optional
121+
NodeClaim *NodeClaim `json:"nodeClaim,omitempty"`
122+
123+
// ResourceRequest represents the resources required by each replica.
124+
// +optional
125+
ResourceRequest corev1.ResourceList `json:"resourceRequest,omitempty"`
126+
127+
// Namespace represents the resources namespaces
128+
// +optional
129+
Namespace string `json:"namespace,omitempty"`
130+
131+
// PriorityClassName represents the components priorityClassName
132+
// +optional
133+
PriorityClassName string `json:"priorityClassName,omitempty"`
134+
135+
}
136+
```
137+
138+
### Overview of Code Changes
139+
140+
Besides the change to the `ResourceBindingSpec` and the `ReplicaRequirements API`, we will need to make a code change to the accurate estimator's implementation,
141+
which can be found here: https://github.com/karmada-io/karmada/blob/5e354971c78952e4f992cc5e21ad3eddd8d6716e/pkg/estimator/server/estimate.go#L59, as well as the maxReplica estimation done by the ResourceQuota plugin.
142+
143+
Currently the accurate estimator will calculate the maxReplica count by:
144+
1. Running the maxReplica calculation (`maxReplicas`) for each plugin enabled by the accurate estimator.
145+
2. The accurate estimator will then loop through all nodes and sum up the amount of replicas (`sumReplicas`) that can fit in each node. This is to account for the resource fragmentation issue.
146+
3. The result returned will be: `Math.min(maxReplicas, sumReplicas)`.
147+
148+
![Accurate-Scheduler-Steps](Accurate-Scheduler-Steps.png)
149+
150+
For the proposed implementation, please refer to the next section.
151+
152+
### Accurate Estimator Changes
153+
154+
`Assumption 1`: Resources with more than one replica will always be scheduled to the same cluster.
155+
- This simplifies the scope of the problem, and accounts for the fact that it is non-trivial to schedule components of the same CRD across multiple clusters.
156+
157+
`Assumption 2`: MaxReplica estimation will use a sum of all the resource requirement for every component's replica.
158+
- We could run a maxReplica estimation for each component as is - the difficulty is determining if both components can be scheduled on the same cluster. If we maintain a maxReplica estimate for each component, not only is the estimation more complex, but it is possible to run into edge cases where both components cannot fit on the same cluster even though individually they could be scheduled.
159+
- Once the MaxReplica estimation is complete, we will return the unit to replicas by multiplying the result with `totalReplicas` field.
160+
161+
Given the above assumptions, we will describe the maxReplica estimation in two parts:
162+
163+
1. The accurate estimator will create an estimate for the maxReplicas that can be scheduled. This estimate will depend on the amount of components set for the resource. If the number of components is 1, then the estimation will be done in the same way that it's done today. If the number of components is greater than 1, then we will sum up all resources required by all replicas to see how many of the total CRD can fit in the available resources.
164+
- `If components = 1`: maxReplicas = Math.min(Available CPU Resources / replica_cpu, Available Memory Resources / replica_memory)
165+
- `If components > 1`: maxReplicas = (totalReplicas) * (Math.min(Available CPU Resource / sum(replica_cpu), Available Memory Resource / sum(replica_memory))).
166+
- `Note`: We multiply the calculation by totalReplicas to bring the unit back to replicas. The calculation is done from the perspective of the entire CRD, but Karmada interprets replicas. That means the maxReplica calculation will always be a multiple of the number of totalReplicas.
167+
168+
Here is an example to illustrate the case in which components > 1. Let's assume we have a CRD with `totalReplicas` = 3, with two components = {component_1: {replicas: 1, cpu: 1, memory: 2GB}, component_2: {replicas: 2, cpu: 1, memory: 1GB}}. We are estimating maxReplica count for a target cluster with a ResourceQuota that has 6CPU available and 8GB of memory available.
169+
170+
During maxReplica estimation, we will take the sum of all resource requirement for the CRD.
171+
172+
Total_CPU = component_1.replicas * (component_1.cpu) + component_2.replicas * (component_2.cpu) = (1 * 1) + (2 * 1) = 3 CPU.
173+
Total_Memory = component_1.replicas * (component_1.memory) + component_2.replicas * (component_2.memory) = (1 * 2GB) + (2 * 1GB) = 8GB.
174+
175+
Now that we have resource totals, we can calculate how many of the total CRD can fit in the available resources:
176+
177+
maxReplica = totalReplicas * Math.min (RQ.cpu / total_cpu, RQ.memory / total_memory) = (1 + 2) * Math.min (2, 2) = 6 replicas (or 2 total CRDs).
178+
179+
2. The accurate estimator will run a decision algorithm to verify that all component's replicas can fit in a combination of available nodes.
180+
- If the verification step returns `false`: We will return maxReplicas = 0, as the CRD cannot be fully scheduled to the cluster.
181+
- If the verification step returns `true`: We will return the maxReplica estimate from step 1. The estimate may not be fully accurate, but we will be certain that `at least one` full CRD can be scheduled on the target cluster.
182+
183+
### Bin Packing Verification Step
184+
185+
Optimally packing all component replicas of different resource requirement into nodes of differing sizes is an NP-hard problem. That said, we aren't interested in optimally packing (since Kubernetes will handle scheduling pods), but rather verifying that all replicas from different components of the same resource can be scheduled on the available nodes in the target cluster. This would be like the decision version of a standard bin-packing problem.
186+
187+
Given that we are not looking for an optimal packing, we can try an estimation for our verification. A greedy approach that has quite good performance for these types of problems is `first-fit decreasing`. We would:
188+
189+
1. Sort all our replicas in decreasing order (using cpu, memory, or cpu * memory)
190+
- We can also decide to sort nodes in decreasing order, in terms of available resources.
191+
2. For each replica, attempt to pack it into the first node that will fit.
192+
- If the node can fit the replica, we continue to the next replica (and update the node's available resources to reflect the packing)
193+
- If the node cannot fit the replica, we move to the next available node.
194+
195+
If at some point, a replica cannot fit in any node, we will return false. If we reach the end of our replicas and have packed them all, we will return true without looking at other packing possibilities. The general performance would depend on how many replicas (k) we are packing and how many nodes (n) we are searching. But the average known time complexity of such an algorithm is O(k*logk).

0 commit comments

Comments
 (0)