You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -106,10 +108,6 @@ With the following Kubelet configuration:
106
108
```yaml
107
109
kind: KubeletConfiguration
108
110
apiVersion: kubelet.config.k8s.io/v1beta1
109
-
featureGates:
110
-
...
111
-
CPUManagerPolicyOptions: true
112
-
CPUManagerPolicyBetaOptions: true
113
111
cpuManagerPolicy: static
114
112
cpuManagerPolicyOptions:
115
113
strict-cpu-reservation: "true"
@@ -131,7 +129,7 @@ When `strict-cpu-reservation` is enabled:
131
129
132
130
### Risks and Mitigations
133
131
134
-
The feature is isolated to a specific policy option `strict-cpu-reservation` under `cpuManagerPolicyOptions` and is protected by feature gate `CPUManagerPolicyAlphaOptions` or `CPUManagerPolicyBetaOptions` before the feature graduates to `Stable` i.e. enabled by default.
132
+
The feature is isolated to a specific policy option `strict-cpu-reservation` under `cpuManagerPolicyOptions` and is protected by feature gate `CPUManagerPolicyBetaOptions` before the feature graduates to `Stable` i.e. enabled by default.
135
133
136
134
Concern for feature impact on best-effort workloads, the workloads that do not have resource requests, is brought up.
137
135
@@ -141,13 +139,132 @@ The concern is, when the feature graduates to `Stable`, it will be enabled by de
141
139
142
140
However, this is exactly the feature intent, best-effort workloads have no KPI requirement, they are meant to consume whatever CPU resources left on the node including starving from time to time. Best-effort workloads are not scheduled to run on the `reservedSystemCPUs` so they shall not be run on the `reservedSystemCPUs` to destablize the whole node.
143
141
144
-
We agree to start with the following node metrics of cpu pool sizes in Alpha and Beta stage to assess the actual impact in real deployment before revisiting if we need risk mitigation.
142
+
Nevertheless, risk mitigation has been discussed in details (see archived options below) and we agree to start with the following node metrics of cpu pool sizes in Alpha and Beta stages to assess the actual impact in real deployment before revisiting if we need risk mitigation.
-`cpu_manager_shared_pool_size_millicores`: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve
148
146
-`cpu_manager_exclusive_cpu_allocation_count`: report exclusively allocated cores, counting full cores (e.g. 16)
149
147
150
148
149
+
#### Archived Risk Mitigation (Option 1)
150
+
151
+
This option is to add `numMinSharedCPUs` in `strict-cpu-reservation` option as the minimum number of CPU cores not available for exclusive allocation and expose it to Kube-scheduler for enforcement.
152
+
153
+
In Kubelet, when `strict-cpu-reservation` is enabled as a policy option, we remove the reserved cores from the shared pool at the stage of calculation DefaultCPUSet and remove the `MinSharedCPUs` from the list of available cores for exclusive allocation.
154
+
155
+

156
+
157
+
When `strict-cpu-reservation` is disabled:
158
+
```console
159
+
Total CPU cores: 64
160
+
ReservedSystemCPUs: 6
161
+
defaultCPUSet = Reserved (6) + 58 (available for exclusive allocation)
162
+
```
163
+
164
+
When `strict-cpu-reservation` is enabled:
165
+
```console
166
+
Total CPU cores: 64
167
+
ReservedSystemCPUs: 6
168
+
MinSharedCPUs: 4
169
+
defaultCPUSet = MinSharedCPUs (4) + 54 (available for exclusive allocation)
In Node API, we add `exclusive-cpu` in Node Allocatable for Kube-scheduler to consume.
191
+
192
+
```
193
+
"status": {
194
+
"capacity": {
195
+
"cpu": "64",
196
+
"exclusive-cpu": "64",
197
+
"ephemeral-storage": "832821572Ki",
198
+
"hugepages-1Gi": "0",
199
+
"hugepages-2Mi": "0",
200
+
"memory": "196146004Ki",
201
+
"pods": "110"
202
+
},
203
+
"allocatable": {
204
+
"cpu": "58",
205
+
"exclusive-cpu": "54",
206
+
"ephemeral-storage": "767528359485",
207
+
"hugepages-1Gi": "0",
208
+
"hugepages-2Mi": "0",
209
+
"memory": "186067796Ki",
210
+
"pods": "110"
211
+
},
212
+
...
213
+
```
214
+
215
+
In kube-scheduler, `ExlusiveMilliCPU` is added in scheduler's `Resource` structure and `NodeResourcesFit` plugin is extended to filter out nodes that can not meet pod's exclusive CPU request.
216
+
217
+
A new item `ExclusiveMilliCPU` is added in the scheduler `Resource` structure:
218
+
219
+
```
220
+
// Resource is a collection of compute resource.
221
+
type Resource struct {
222
+
MilliCPU int64
223
+
ExclusiveMilliCPU int64 // added
224
+
Memory int64
225
+
EphemeralStorage int64
226
+
// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
227
+
// explicitly as int, to avoid conversions and improve performance.
228
+
AllowedPodNumber int
229
+
// ScalarResources
230
+
ScalarResources map[v1.ResourceName]int64
231
+
}
232
+
```
233
+
234
+
A new node fitting failure 'Insufficient exclusive cpu' is added in the `NodeResourcesFit` plugin:
235
+
236
+
```
237
+
if podRequest.MilliCPU > 0 && podRequest.MilliCPU > (nodeInfo.Allocatable.MilliCPU-nodeInfo.Requested.MilliCPU) {
The problem with `MinSharedCPUs` is that it creates another complication like memory and hugpages, new resources vs overlapping resources, exclusive-cpus is a subset of cpu.
262
+
263
+
Currently the noderesources scheduler plugin does not filter out the best-effort pods in the case there's no available CPU.
264
+
265
+
Another option is to force the cpu requests for best effort pods to 1 MilliCPU in kubelet for the purpose of resource availability checks (or, equivalently, check there's at least 1 MilliCPU allocatable). This option is meant to be simpler than option-1, but it can create runaway pods similar to that in https://github.com/kubernetes/kubernetes/issues/84869.
266
+
267
+
151
268
### Test Plan
152
269
153
270
[X] I/we understand the owners of the involved components may require updates to
@@ -175,7 +292,7 @@ No new integration tests for kubelet are planned.
175
292
- CPU Manager works with `strict-cpu-reservation` policy option
176
293
177
294
- Basic functionality
178
-
1. Enable `CPUManagerPolicyBetaOptions` feature gate and `strict-cpu-reservation` policy option.
295
+
1. Enable `strict-cpu-reservation` policy option.
179
296
2. Create a simple pod of Burstable QoS type.
180
297
3. Verify the pod is not using the reserved CPU cores.
181
298
4. Delete the pod.
@@ -228,10 +345,9 @@ The `/var/lib/kubelet/cpu_manager_state` needs to be removed when enabling or di
228
345
Yes. Reserved CPU cores will be strictly used for system daemons and interrupt processing no longer available for workloads.
229
346
230
347
The feature is only enabled when all following conditions are met:
231
-
1. The `static``CPUManager` policy must be selected
232
-
2. The `CPUManagerPolicyBetaOptions` feature gate must be enabled
233
-
3. The new `strict-cpu-reservation` policy option must be selected
234
-
4. The `reservedSystemCPUs` is not empty
348
+
1. The `static``CPUManager` policy is selected
349
+
2. The `CPUManagerPolicyBetaOptions` feature gate is enabled and the `strict-cpu-reservation` policy option is selected
350
+
3. The `reservedSystemCPUs` is not empty
235
351
236
352
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
0 commit comments