Skip to content

Commit e15a706

Browse files
committed
Move KEP-4540 to Beta.
Signed-off-by: Jing Zhang <[email protected]>
1 parent 7b82e05 commit e15a706

File tree

3 files changed

+31
-154
lines changed

3 files changed

+31
-154
lines changed

keps/sig-node/4540-strict-cpu-reservation/README.md

Lines changed: 28 additions & 150 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,6 @@
1212
- [Story 2](#story-2)
1313
- [Design Details](#design-details)
1414
- [Risks and Mitigations](#risks-and-mitigations)
15-
- [Archived Risk Mitigation (Option 1)](#archived-risk-mitigation-option-1)
16-
- [Archived Risk Mitigation (Option 2)](#archived-risk-mitigation-option-2)
1715
- [Test Plan](#test-plan)
1816
- [Prerequisite testing updates](#prerequisite-testing-updates)
1917
- [Unit tests](#unit-tests)
@@ -42,20 +40,20 @@
4240

4341
Items marked with (R) are required *prior to targeting to a milestone / release*.
4442

45-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
43+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
4644
- [x] (R) KEP approvers have approved the KEP status as `implementable`
4745
- [x] (R) Design details are appropriately documented
4846
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
49-
- [ ] e2e Tests for all Beta API Operations (endpoints)
47+
- [x] e2e Tests for all Beta API Operations (endpoints)
5048
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
5149
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
5250
- [ ] (R) Graduation criteria is in place
5351
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
5452
- [ ] (R) Production readiness review completed
5553
- [ ] (R) Production readiness review approved
5654
- [x] "Implementation History" section is up-to-date for milestone
57-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
58-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
55+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
56+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5957

6058
[kubernetes.io]: https://kubernetes.io/
6159
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -73,15 +71,14 @@ With this KEP, a new `CPUManager` policy option `strict-cpu-reservation` is intr
7371

7472
The static policy is used to reduce latency or improve performance. If you want to move system daemons or interrupt processing to dedicated cores, the obvious way is use the `reservedSystemCPUs` option. But in current implementation this isolation is implemented only for guaranteed pods with integer CPU requests not for burstable and best-effort pods (and guaranteed pods with fractional CPU requests).
7573
Admission is only comparing the cpu requests against the allocatable cpus. Since the cpu limit are higher than the request, it allows burstable and best-effort pods to use up the capacity of `reservedSystemCPUs` and cause host OS services to starve in real life deployments.
76-
Custom CPU allocation policies deployed as NRI plugins (e.g. Balloons) can separate infrastructure and workload into different CPU pools but they require extra software, additional tuning and reduced CPU pool size could affect performance of multi-threaded processes.
7774

7875
### Goals
7976
* Align scheduler and node view for Node Allocatable (total - reserved).
8077
* Ensure `reservedSystemCPUs` is only used by system daemons or interrupt processing not by workloads.
8178
* Ensure no breaking changes for the `static` policy of `CPUManager`.
8279

8380
### Non-Goals
84-
* Change scheduler interface to sub-partition `cpu` resource (as described in the archived Risk Mitigation Option 1).
81+
* Change interface between node and scheduler.
8582

8683
## Proposal
8784

@@ -112,7 +109,7 @@ apiVersion: kubelet.config.k8s.io/v1beta1
112109
featureGates:
113110
...
114111
CPUManagerPolicyOptions: true
115-
CPUManagerPolicyAlphaOptions: true
112+
CPUManagerPolicyBetaOptions: true
116113
cpuManagerPolicy: static
117114
cpuManagerPolicyOptions:
118115
strict-cpu-reservation: "true"
@@ -123,7 +120,7 @@ reservedSystemCPUs: "0,32,1,33,16,48"
123120
When `strict-cpu-reservation` is disabled:
124121
```console
125122
# cat /var/lib/kubelet/cpu_manager_state
126-
{"policyName":"static","defaultCpuSet":"0-79","checksum":1241370203}
123+
{"policyName":"static","defaultCpuSet":"0-64","checksum":1241370203}
127124
```
128125

129126
When `strict-cpu-reservation` is enabled:
@@ -144,131 +141,11 @@ The concern is, when the feature graduates to `Stable`, it will be enabled by de
144141

145142
However, this is exactly the feature intent, best-effort workloads have no KPI requirement, they are meant to consume whatever CPU resources left on the node including starving from time to time. Best-effort workloads are not scheduled to run on the `reservedSystemCPUs` so they shall not be run on the `reservedSystemCPUs` to destablize the whole node.
146143

147-
Nevertheless, risk mitigation has been discussed in details (see archived options below) and we agree to start with the following node metrics of cpu pool sizes in Alpha stage to assess the actual impact in real deployment before revisiting if we need risk mitigation.
144+
We agree to start with the following node metrics of cpu pool sizes in Alpha and Beta stage to assess the actual impact in real deployment before revisiting if we need risk mitigation.
148145

149146
https://github.com/kubernetes/kubernetes/pull/127506
150-
- `cpu\_manager\_shared\_pool\_size\_millicores`: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve
151-
- `cpu\_manager\_exclusive\_cpu\_allocation\_count`: report exclusively allocated cores, counting full cores (e.g. 16)
152-
153-
154-
#### Archived Risk Mitigation (Option 1)
155-
156-
This option is to add `numMinSharedCPUs` in `strict-cpu-reservation` option as the minimum number of CPU cores not available for exclusive allocation and expose it to Kube-scheduler for enforcement.
157-
158-
In Kubelet, when `strict-cpu-reservation` is enabled as a policy option, we remove the reserved cores from the shared pool at the stage of calculation DefaultCPUSet and remove the `MinSharedCPUs` from the list of available cores for exclusive allocation.
159-
160-
![MinSharedCPUs](./strict-cpu-allocation.png)
161-
162-
When `strict-cpu-reservation` is disabled:
163-
```console
164-
Total CPU cores: 64
165-
ReservedSystemCPUs: 6
166-
defaultCPUSet = Reserved (6) + 58 (available for exclusive allocation)
167-
```
168-
169-
When `strict-cpu-reservation` is enabled:
170-
```console
171-
Total CPU cores: 64
172-
ReservedSystemCPUs: 6
173-
MinSharedCPUs: 4
174-
defaultCPUSet = MinSharedCPUs (4) + 54 (available for exclusive allocation)
175-
```
176-
177-
Prototype PR for the option is created:
178-
https://github.com/kubernetes/kubernetes/pull/123979/commits
179-
180-
Add `numMinSharedCPUs` as part of `strict-cpu-reservation` option in Kubelet configuration:
181-
182-
```yaml
183-
kind: KubeletConfiguration
184-
apiVersion: kubelet.config.k8s.io/v1beta1
185-
featureGates:
186-
...
187-
CPUManagerPolicyOptions: true
188-
CPUManagerPolicyAlphaOptions: true
189-
cpuManagerPolicy: static
190-
cpuManagerPolicyOptions:
191-
strict-cpu-reservation: { "enable": "true", "numMinSharedCPUs": 4 }
192-
reservedSystemCPUs: "0,32,1,33,16,48"
193-
...
194-
```
195-
196-
In Node API, we add `exclusive-cpu` in Node Allocatable for Kube-scheduler to consume.
197-
198-
```
199-
"status": {
200-
"capacity": {
201-
"cpu": "64",
202-
"exclusive-cpu": "64",
203-
"ephemeral-storage": "832821572Ki",
204-
"hugepages-1Gi": "0",
205-
"hugepages-2Mi": "0",
206-
"memory": "196146004Ki",
207-
"pods": "110"
208-
},
209-
"allocatable": {
210-
"cpu": "58",
211-
"exclusive-cpu": "54",
212-
"ephemeral-storage": "767528359485",
213-
"hugepages-1Gi": "0",
214-
"hugepages-2Mi": "0",
215-
"memory": "186067796Ki",
216-
"pods": "110"
217-
},
218-
...
219-
```
220-
221-
In kube-scheduler, `ExlusiveMilliCPU` is added in scheduler's `Resource` structure and `NodeResourcesFit` plugin is extended to filter out nodes that can not meet pod's exclusive CPU request.
222-
223-
A new item `ExclusiveMilliCPU` is added in the scheduler `Resource` structure:
224-
225-
```
226-
// Resource is a collection of compute resource.
227-
type Resource struct {
228-
MilliCPU int64
229-
ExclusiveMilliCPU int64 // added
230-
Memory int64
231-
EphemeralStorage int64
232-
// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
233-
// explicitly as int, to avoid conversions and improve performance.
234-
AllowedPodNumber int
235-
// ScalarResources
236-
ScalarResources map[v1.ResourceName]int64
237-
}
238-
```
239-
240-
A new node fitting failure 'Insufficient exclusive cpu' is added in the `NodeResourcesFit` plugin:
241-
242-
```
243-
if podRequest.MilliCPU > 0 && podRequest.MilliCPU > (nodeInfo.Allocatable.MilliCPU-nodeInfo.Requested.MilliCPU) {
244-
insufficientResources = append(insufficientResources, InsufficientResource{
245-
ResourceName: v1.ResourceCPU,
246-
Reason: "Insufficient cpu",
247-
Requested: podRequest.MilliCPU,
248-
Used: nodeInfo.Requested.MilliCPU,
249-
Capacity: nodeInfo.Allocatable.MilliCPU,
250-
})
251-
}
252-
if nodeInfo.Allocatable.ExclusiveMilliCPU > 0 { // added
253-
if podRequest.ExclusiveMilliCPU > 0 && podRequest.ExclusiveMilliCPU > (nodeInfo.Allocatable.ExclusiveMilliCPU-nodeInfo.Requested.ExclusiveMilliCPU) {
254-
insufficientResources = append(insufficientResources, InsufficientResource{
255-
ResourceName: v1.ResourceExclusiveCPU,
256-
Reason: "Insufficient exclusive cpu",
257-
Requested: podRequest.ExclusiveMilliCPU,
258-
Used: nodeInfo.Requested.ExclusiveMilliCPU,
259-
Capacity: nodeInfo.Allocatable.ExclusiveMilliCPU,
260-
})
261-
}
262-
}
263-
```
264-
265-
#### Archived Risk Mitigation (Option 2)
266-
267-
The problem with `MinSharedCPUs` is that it creates another complication like memory and hugpages, new resources vs overlapping resources, exclusive-cpus is a subset of cpu.
268-
269-
Currently the noderesources scheduler plugin does not filter out the best-effort pods in the case there's no available CPU.
270-
271-
Another option is to force the cpu requests for best effort pods to 1 MilliCPU in kubelet for the purpose of resource availability checks (or, equivalently, check there's at least 1 MilliCPU allocatable). This option is meant to be simpler than option-1, but it can create runaway pods similar to that in https://github.com/kubernetes/kubernetes/issues/84869.
147+
- `cpu_manager_shared_pool_size_millicores`: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve
148+
- `cpu_manager_exclusive_cpu_allocation_count`: report exclusively allocated cores, counting full cores (e.g. 16)
272149

273150

274151
### Test Plan
@@ -298,7 +175,7 @@ No new integration tests for kubelet are planned.
298175
- CPU Manager works with `strict-cpu-reservation` policy option
299176

300177
- Basic functionality
301-
1. Enable `CPUManagerPolicyAlphaOptions` feature gate and `strict-cpu-reservation` policy option.
178+
1. Enable `CPUManagerPolicyBetaOptions` feature gate and `strict-cpu-reservation` policy option.
302179
2. Create a simple pod of Burstable QoS type.
303180
3. Verify the pod is not using the reserved CPU cores.
304181
4. Delete the pod.
@@ -313,8 +190,9 @@ No new integration tests for kubelet are planned.
313190

314191
#### Beta
315192

316-
- [ ] Gather feedback from consumers of the new policy option.
317-
- [ ] Verify no major bugs reported in the previous cycle.
193+
- [X] Gather feedback from consumers of the new policy option.
194+
- [X] Verify no major bugs reported in the previous cycle.
195+
- [X] Ensure proper e2e tests are in place.
318196

319197
#### GA
320198

@@ -333,33 +211,33 @@ No changes needed.
333211

334212
### Feature Enablement and Rollback
335213

336-
The `/var/lib/kubelet/cpu\_manager\_state` needs to be removed when enabling or disabling the feature.
214+
The `/var/lib/kubelet/cpu_manager_state` needs to be removed when enabling or disabling the feature.
337215

338216
###### How can this feature be enabled / disabled in a live cluster?
339217

340218
- [X] Feature gate (also fill in values in `kep.yaml`)
341-
- Feature gate name: `CPUManagerPolicyAlphaOptions`
219+
- Feature gate name: `CPUManagerPolicyBetaOptions`
342220
- Components depending on the feature gate: `kubelet`
343221
- [X] Change the kubelet configuration to set a `CPUManager` policy of `static` and a `CPUManager` policy option of `strict-cpu-reservation`
344222
- Will enabling / disabling the feature require downtime of the control plane? No
345-
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No -- removing `/var/lib/kubelet/cpu\_manager\_state` and restarting kubelet are enough.
223+
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No -- removing `/var/lib/kubelet/cpu_manager_state` and restarting kubelet are enough.
346224

347225

348226
###### Does enabling the feature change any default behavior?
349227

350228
Yes. Reserved CPU cores will be strictly used for system daemons and interrupt processing no longer available for workloads.
351229

352230
The feature is only enabled when all following conditions are met:
353-
1. The `CPUManagerPolicyAlphaOptions` feature gate must be enabled
354-
2. The `static` `CPUManager` policy must be selected
231+
1. The `static` `CPUManager` policy must be selected
232+
2. The `CPUManagerPolicyBetaOptions` feature gate must be enabled
355233
3. The new `strict-cpu-reservation` policy option must be selected
356234
4. The `reservedSystemCPUs` is not empty
357235

358236
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
359237

360-
Yes, the feature can be disabled by either:
361-
1. Disable feature gate `CPUManagerPolicyAlphaOptions` or remove `strict-cpu-reservation` from the list of `CPUManager` policy options
362-
2. Remove `/var/lib/kubelet/cpu\_manager\_state` and restart kubelet
238+
Yes, the feature can be disabled by:
239+
1. Disable feature gate `CPUManagerPolicyBetaOptions` or remove `strict-cpu-reservation` from the list of `CPUManager` policy options
240+
2. Remove `/var/lib/kubelet/cpu_manager_state` and restart kubelet
363241

364242
###### What happens if we reenable the feature if it was previously rolled back?
365243

@@ -381,7 +259,7 @@ If the feature rollout fails, burstable and best-efforts continue to run on the
381259
If the feature rollback fails, burstable and best-efforts continue not to run on the reserved CPU cores.
382260
In either case, existing workload will not be affected.
383261

384-
When enabling or disabling the feature, make sure `/var/lib/kubelet/cpu\_manager\_state` is removed before restarting kubelet otherwise kubelet restart could fail.
262+
When enabling or disabling the feature, make sure `/var/lib/kubelet/cpu_manager_state` is removed before restarting kubelet otherwise kubelet restart could fail.
385263

386264
<!--
387265
Try to be as paranoid as possible - e.g., what if some components will restart
@@ -411,7 +289,7 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
411289
are missing a bunch of machinery and tooling and can't do that now.
412290
-->
413291

414-
We manually test it in our internal environment and it works.
292+
We use the feature in our internal environment and it works.
415293

416294
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
417295

@@ -425,7 +303,7 @@ No.
425303

426304
###### How can an operator determine if the feature is in use by workloads?
427305

428-
Inspect the `defaultCpuSet` in `/var/lib/kubelet/cpu\_manager\_state`:
306+
Inspect the `defaultCpuSet` in `/var/lib/kubelet/cpu_manager_state`:
429307
- When the feature is disabled, the reserved CPU cores are included in the `defaultCpuSet`.
430308
- When the feature is enabled, the reserved CPU cores are not included in the `defaultCpuSet`.
431309

@@ -447,9 +325,9 @@ This feature allows users to protect infrastructure services from bursty workloa
447325

448326
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
449327

450-
https://github.com/kubernetes/kubernetes/pull/127506:
451-
- `cpu\_manager\_shared\_pool\_size\_millicores`: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve
452-
- `cpu\_manager\_exclusive\_cpu\_allocation\_count`: report exclusively allocated cores, counting full cores (e.g. 16)
328+
Monitor the following kubelet counters:
329+
- `cpu_manager_shared_pool_size_millicores`: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve
330+
- `cpu_manager_exclusive_cpu_allocation_count`: report exclusively allocated cores, counting full cores (e.g. 16)
453331

454332
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
455333

keps/sig-node/4540-strict-cpu-reservation/kep.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,19 @@ status: implementable
88
creation-date: 2024-03-06
99
reviewers:
1010
- "@ffromani"
11-
- "@klueska"
1211
- "@swatisehgal"
1312
approvers:
1413
- "@sig-node-tech-leads"
1514
see-also: []
1615
replaces: []
1716

1817
# The target maturity stage in the current dev cycle for this KEP.
19-
stage: alpha
18+
stage: beta
2019

2120
# The most recent milestone for which work toward delivery of this KEP has been
2221
# done. This can be the current (upcoming) milestone, if it is being actively
2322
# worked on.
24-
latest-milestone: "v1.32"
23+
latest-milestone: "v1.33"
2524

2625
# The milestone at which this feature was, or is targeted to be, at each stage.
2726
milestone:
@@ -32,7 +31,7 @@ milestone:
3231
# The following PRR answers are required at alpha release
3332
# List the feature gate name and the components for which it must be enabled
3433
feature-gates:
35-
- name: "CPUManagerPolicyAlphaOptions"
34+
- name: "CPUManagerPolicyBetaOptions"
3635
components:
3736
- kubelet
3837
disable-supported: true
Binary file not shown.

0 commit comments

Comments
 (0)