Skip to content

Commit d28b0bb

Browse files
committed
Take care of review comments.
1 parent e15a706 commit d28b0bb

File tree

4 files changed

+130
-14
lines changed

4 files changed

+130
-14
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
kep-number: 4540
2-
alpha:
2+
beta:
33
approver: "@soltysh"
44

keps/sig-node/4540-strict-cpu-reservation/README.md

Lines changed: 127 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212
- [Story 2](#story-2)
1313
- [Design Details](#design-details)
1414
- [Risks and Mitigations](#risks-and-mitigations)
15+
- [Archived Risk Mitigation (Option 1)](#archived-risk-mitigation-option-1)
16+
- [Archived Risk Mitigation (Option 2)](#archived-risk-mitigation-option-2)
1517
- [Test Plan](#test-plan)
1618
- [Prerequisite testing updates](#prerequisite-testing-updates)
1719
- [Unit tests](#unit-tests)
@@ -106,10 +108,6 @@ With the following Kubelet configuration:
106108
```yaml
107109
kind: KubeletConfiguration
108110
apiVersion: kubelet.config.k8s.io/v1beta1
109-
featureGates:
110-
...
111-
CPUManagerPolicyOptions: true
112-
CPUManagerPolicyBetaOptions: true
113111
cpuManagerPolicy: static
114112
cpuManagerPolicyOptions:
115113
strict-cpu-reservation: "true"
@@ -131,7 +129,7 @@ When `strict-cpu-reservation` is enabled:
131129

132130
### Risks and Mitigations
133131

134-
The feature is isolated to a specific policy option `strict-cpu-reservation` under `cpuManagerPolicyOptions` and is protected by feature gate `CPUManagerPolicyAlphaOptions` or `CPUManagerPolicyBetaOptions` before the feature graduates to `Stable` i.e. enabled by default.
132+
The feature is isolated to a specific policy option `strict-cpu-reservation` under `cpuManagerPolicyOptions` and is protected by feature gate `CPUManagerPolicyBetaOptions` before the feature graduates to `Stable` i.e. enabled by default.
135133

136134
Concern for feature impact on best-effort workloads, the workloads that do not have resource requests, is brought up.
137135

@@ -141,13 +139,132 @@ The concern is, when the feature graduates to `Stable`, it will be enabled by de
141139

142140
However, this is exactly the feature intent, best-effort workloads have no KPI requirement, they are meant to consume whatever CPU resources left on the node including starving from time to time. Best-effort workloads are not scheduled to run on the `reservedSystemCPUs` so they shall not be run on the `reservedSystemCPUs` to destablize the whole node.
143141

144-
We agree to start with the following node metrics of cpu pool sizes in Alpha and Beta stage to assess the actual impact in real deployment before revisiting if we need risk mitigation.
142+
Nevertheless, risk mitigation has been discussed in details (see archived options below) and we agree to start with the following node metrics of cpu pool sizes in Alpha and Beta stages to assess the actual impact in real deployment before revisiting if we need risk mitigation.
145143

146144
https://github.com/kubernetes/kubernetes/pull/127506
147145
- `cpu_manager_shared_pool_size_millicores`: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve
148146
- `cpu_manager_exclusive_cpu_allocation_count`: report exclusively allocated cores, counting full cores (e.g. 16)
149147

150148

149+
#### Archived Risk Mitigation (Option 1)
150+
151+
This option is to add `numMinSharedCPUs` in `strict-cpu-reservation` option as the minimum number of CPU cores not available for exclusive allocation and expose it to Kube-scheduler for enforcement.
152+
153+
In Kubelet, when `strict-cpu-reservation` is enabled as a policy option, we remove the reserved cores from the shared pool at the stage of calculation DefaultCPUSet and remove the `MinSharedCPUs` from the list of available cores for exclusive allocation.
154+
155+
![MinSharedCPUs](./strict-cpu-allocation.png)
156+
157+
When `strict-cpu-reservation` is disabled:
158+
```console
159+
Total CPU cores: 64
160+
ReservedSystemCPUs: 6
161+
defaultCPUSet = Reserved (6) + 58 (available for exclusive allocation)
162+
```
163+
164+
When `strict-cpu-reservation` is enabled:
165+
```console
166+
Total CPU cores: 64
167+
ReservedSystemCPUs: 6
168+
MinSharedCPUs: 4
169+
defaultCPUSet = MinSharedCPUs (4) + 54 (available for exclusive allocation)
170+
```
171+
172+
Prototype PR for the option is created:
173+
https://github.com/kubernetes/kubernetes/pull/123979/commits
174+
175+
Add `numMinSharedCPUs` as part of `strict-cpu-reservation` option in Kubelet configuration:
176+
177+
```yaml
178+
kind: KubeletConfiguration
179+
apiVersion: kubelet.config.k8s.io/v1beta1
180+
featureGates:
181+
...
182+
CPUManagerPolicyAlphaOptions: true
183+
cpuManagerPolicy: static
184+
cpuManagerPolicyOptions:
185+
strict-cpu-reservation: { "enable": "true", "numMinSharedCPUs": 4 }
186+
reservedSystemCPUs: "0,32,1,33,16,48"
187+
...
188+
```
189+
190+
In Node API, we add `exclusive-cpu` in Node Allocatable for Kube-scheduler to consume.
191+
192+
```
193+
"status": {
194+
"capacity": {
195+
"cpu": "64",
196+
"exclusive-cpu": "64",
197+
"ephemeral-storage": "832821572Ki",
198+
"hugepages-1Gi": "0",
199+
"hugepages-2Mi": "0",
200+
"memory": "196146004Ki",
201+
"pods": "110"
202+
},
203+
"allocatable": {
204+
"cpu": "58",
205+
"exclusive-cpu": "54",
206+
"ephemeral-storage": "767528359485",
207+
"hugepages-1Gi": "0",
208+
"hugepages-2Mi": "0",
209+
"memory": "186067796Ki",
210+
"pods": "110"
211+
},
212+
...
213+
```
214+
215+
In kube-scheduler, `ExlusiveMilliCPU` is added in scheduler's `Resource` structure and `NodeResourcesFit` plugin is extended to filter out nodes that can not meet pod's exclusive CPU request.
216+
217+
A new item `ExclusiveMilliCPU` is added in the scheduler `Resource` structure:
218+
219+
```
220+
// Resource is a collection of compute resource.
221+
type Resource struct {
222+
MilliCPU int64
223+
ExclusiveMilliCPU int64 // added
224+
Memory int64
225+
EphemeralStorage int64
226+
// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
227+
// explicitly as int, to avoid conversions and improve performance.
228+
AllowedPodNumber int
229+
// ScalarResources
230+
ScalarResources map[v1.ResourceName]int64
231+
}
232+
```
233+
234+
A new node fitting failure 'Insufficient exclusive cpu' is added in the `NodeResourcesFit` plugin:
235+
236+
```
237+
if podRequest.MilliCPU > 0 && podRequest.MilliCPU > (nodeInfo.Allocatable.MilliCPU-nodeInfo.Requested.MilliCPU) {
238+
insufficientResources = append(insufficientResources, InsufficientResource{
239+
ResourceName: v1.ResourceCPU,
240+
Reason: "Insufficient cpu",
241+
Requested: podRequest.MilliCPU,
242+
Used: nodeInfo.Requested.MilliCPU,
243+
Capacity: nodeInfo.Allocatable.MilliCPU,
244+
})
245+
}
246+
if nodeInfo.Allocatable.ExclusiveMilliCPU > 0 { // added
247+
if podRequest.ExclusiveMilliCPU > 0 && podRequest.ExclusiveMilliCPU > (nodeInfo.Allocatable.ExclusiveMilliCPU-nodeInfo.Requested.ExclusiveMilliCPU) {
248+
insufficientResources = append(insufficientResources, InsufficientResource{
249+
ResourceName: v1.ResourceExclusiveCPU,
250+
Reason: "Insufficient exclusive cpu",
251+
Requested: podRequest.ExclusiveMilliCPU,
252+
Used: nodeInfo.Requested.ExclusiveMilliCPU,
253+
Capacity: nodeInfo.Allocatable.ExclusiveMilliCPU,
254+
})
255+
}
256+
}
257+
```
258+
259+
#### Archived Risk Mitigation (Option 2)
260+
261+
The problem with `MinSharedCPUs` is that it creates another complication like memory and hugpages, new resources vs overlapping resources, exclusive-cpus is a subset of cpu.
262+
263+
Currently the noderesources scheduler plugin does not filter out the best-effort pods in the case there's no available CPU.
264+
265+
Another option is to force the cpu requests for best effort pods to 1 MilliCPU in kubelet for the purpose of resource availability checks (or, equivalently, check there's at least 1 MilliCPU allocatable). This option is meant to be simpler than option-1, but it can create runaway pods similar to that in https://github.com/kubernetes/kubernetes/issues/84869.
266+
267+
151268
### Test Plan
152269

153270
[X] I/we understand the owners of the involved components may require updates to
@@ -175,7 +292,7 @@ No new integration tests for kubelet are planned.
175292
- CPU Manager works with `strict-cpu-reservation` policy option
176293

177294
- Basic functionality
178-
1. Enable `CPUManagerPolicyBetaOptions` feature gate and `strict-cpu-reservation` policy option.
295+
1. Enable `strict-cpu-reservation` policy option.
179296
2. Create a simple pod of Burstable QoS type.
180297
3. Verify the pod is not using the reserved CPU cores.
181298
4. Delete the pod.
@@ -228,10 +345,9 @@ The `/var/lib/kubelet/cpu_manager_state` needs to be removed when enabling or di
228345
Yes. Reserved CPU cores will be strictly used for system daemons and interrupt processing no longer available for workloads.
229346

230347
The feature is only enabled when all following conditions are met:
231-
1. The `static` `CPUManager` policy must be selected
232-
2. The `CPUManagerPolicyBetaOptions` feature gate must be enabled
233-
3. The new `strict-cpu-reservation` policy option must be selected
234-
4. The `reservedSystemCPUs` is not empty
348+
1. The `static` `CPUManager` policy is selected
349+
2. The `CPUManagerPolicyBetaOptions` feature gate is enabled and the `strict-cpu-reservation` policy option is selected
350+
3. The `reservedSystemCPUs` is not empty
235351

236352
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
237353

keps/sig-node/4540-strict-cpu-reservation/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ reviewers:
1010
- "@ffromani"
1111
- "@swatisehgal"
1212
approvers:
13-
- "@sig-node-tech-leads"
13+
- "@ffromani"
1414
see-also: []
1515
replaces: []
1616

@@ -31,7 +31,7 @@ milestone:
3131
# The following PRR answers are required at alpha release
3232
# List the feature gate name and the components for which it must be enabled
3333
feature-gates:
34-
- name: "CPUManagerPolicyBetaOptions"
34+
- name: "CPUManagerPolicyAlphaOptions"
3535
components:
3636
- kubelet
3737
disable-supported: true
13.3 KB
Loading

0 commit comments

Comments
 (0)