Skip to content

Commit f5c4952

Browse files
authored
Merge pull request kubernetes#2904 from klueska/kep-2902
KEP-2902: Add CPUManager policy option to distribute CPUs across NUMA nodes instead of packing them
2 parents 3160491 + 3fee079 commit f5c4952

File tree

3 files changed

+299
-0
lines changed

3 files changed

+299
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2902
2+
alpha:
3+
approver: "@johnbelamaric"
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# KEP-2902: Add CPUManager policy option to distribute CPUs across NUMA nodes instead of packing them
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Risks and Mitigations](#risks-and-mitigations)
11+
- [Design Details](#design-details)
12+
- [Test Plan](#test-plan)
13+
- [Graduation Criteria](#graduation-criteria)
14+
- [Alpha](#alpha)
15+
- [Beta](#beta)
16+
- [GA](#ga)
17+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
18+
- [Version Skew Strategy](#version-skew-strategy)
19+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
20+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
21+
- [Monitoring Requirements](#monitoring-requirements)
22+
- [Scalability](#scalability)
23+
- [Implementation History](#implementation-history)
24+
<!-- /toc -->
25+
26+
## Release Signoff Checklist
27+
28+
Items marked with (R) are required *prior to targeting to a milestone / release*.
29+
30+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
31+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
32+
- [ ] (R) Design details are appropriately documented
33+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
34+
- [ ] e2e Tests for all Beta API Operations (endpoints)
35+
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
36+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
37+
- [ ] (R) Graduation criteria is in place
38+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
39+
- [ ] (R) Production readiness review completed
40+
- [ ] (R) Production readiness review approved
41+
- [ ] "Implementation History" section is up-to-date for milestone
42+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
43+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
44+
45+
[kubernetes.io]: https://kubernetes.io/
46+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
47+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
48+
[kubernetes/website]: https://git.k8s.io/website
49+
50+
## Summary
51+
52+
Kubernetes 1.22 introduced a new framework for adding `CPUManager` policy options (#2625).
53+
These new policy options allow one to tweak the behaviour of a given `CPUManager` policy without the need to introduce an entirely new policy.
54+
Moreover, these policy options can build on one another, such that multiple tweaks can be made to a policy in an additive fashion.
55+
The first option introduced in conjunction with #2625 allows one to ensure that only **full** CPUs are allocated to a container, rather than handing out individual hyperthreads from each CPU to different containers.
56+
This KEP introduces a new `CPUManager` policy option to ensure that CPU allocations are evenly distributed across NUMA nodes in cases where more than one NUMA node is required to satisfy the allocation.
57+
58+
## Motivation
59+
60+
By default, the `CPUManager` will pack CPUs onto one NUMA node until it is filled, with any remaining CPUs simply spilling over to the next NUMA node.
61+
This can cause undesired bottlenecks in parallel code relying on barriers (and similar synchronization primitivies), as this type of code tends to run only as fast as its slowest worker (which is slowed down by the fact that fewer CPUs are available on at least one NUMA node).
62+
By distributing CPUs evenly across NUMA nodes, application developers can more easily ensure that no single worker suffers from NUMA effects more than any other, improving the overall performance of these types of applications.
63+
64+
### Goals
65+
* Enable parallel algorithms to run more efficiently when they request more CPUs than can be allocated by a single NUMA node
66+
67+
### Non-Goals
68+
* Provide a general solution for all types of CPU distributions across NUMA nodes
69+
70+
## Proposal
71+
72+
We propose to add a new `CPUManager` policy option called `distribute-cpus-across-numa` to the `static` `CPUManager` policy.
73+
When enabled, this will trigger the `CPUManager` to evenly distribute CPUs across NUMA nodes in cases where more than one NUMA node is required to satisfy the allocation.
74+
75+
### Risks and Mitigations
76+
77+
The risks of adding this new feature are quite low.
78+
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyExperimentalOptions` feature gate (which is disabled by default).
79+
80+
| Risk | Impact | Mitigation |
81+
| -------------------------------------------------| -------| ---------- |
82+
| Bugs in the implementation lead to kubelet crash | High | Disable the policy option and restart the kubelet. The workload will run but with CPU packing semantics - like it was before this new policy option was added. |
83+
84+
## Design Details
85+
86+
When `distribute-cpus-across-numa` is passed as a policy option, the following algorithm will be run to distribute CPUs across NUMA nodes instead of packing them:
87+
88+
```
89+
Foreach NUMA node:
90+
* If all requested CPUs can be allocated from this single NUMA node;
91+
--> do the allocation
92+
93+
For each pair of NUMA nodes:
94+
* If the set of requested CPUs (modulo 2) can be evenly split across the 2 NUMA nodes; AND
95+
* Any remaining CPUs (after the modulo operation) can be striped across some subset of the NUMA nodes;
96+
--> do the allocation
97+
98+
For each 3-tuple of NUMA nodes:
99+
* If the set of requested CPUs (modulo 3) can be evenly distributed across the 3 NUMA nodes; AND
100+
* Any remaining CPUs (after the modulo operation) can be striped across some subset of the NUMA nodes;
101+
--> do the allocation
102+
103+
...
104+
105+
For the set of all NUMA nodes:
106+
* If the set of requested CPUs (module NUM_NUMA_NODES) can be evenly distributed across all NUMA nodes; AND
107+
* Any remaining CPUs (after the modulo operation) can be striped across some subset of the NUMA nodes;
108+
--> do the allocation
109+
```
110+
111+
If none of the above conditions can be met, resort back to a best-effort fit of packing CPUs into NUMA nodes wherever they can fit.
112+
113+
NOTE: The striping operation after all CPUs have been evenly distributed will be performed such that the overall disribution of CPUs across those NUMA nodes remains as balanced as possible.
114+
115+
### Test Plan
116+
117+
We will extend both the unit test suite and the E2E test suite to cover the new policy option described in this KEP.
118+
119+
### Graduation Criteria
120+
121+
#### Alpha
122+
123+
- [X] Implement the new policy option.
124+
- [X] Ensure proper unit tests are in place.
125+
- [X] Ensure proper e2e node tests are in place.
126+
127+
#### Beta
128+
129+
- [X] Gather feedback from consumers of the new policy option.
130+
- [X] Verify no major bugs reported in the previous cycle.
131+
132+
#### GA
133+
134+
- [X] Allow time for feedback (1 year).
135+
- [X] Make sure all risks have been addressed.
136+
137+
### Upgrade / Downgrade Strategy
138+
139+
We expect no impact. The new policy option is opt-in and orthogonal to the existing ones.
140+
141+
### Version Skew Strategy
142+
143+
No changes needed
144+
145+
## Production Readiness Review Questionnaire
146+
147+
### Feature Enablement and Rollback
148+
149+
###### How can this feature be enabled / disabled in a live cluster?
150+
151+
- [X] Feature gate (also fill in values in `kep.yaml`)
152+
- Feature gate name: `CPUManagerPolicyExperimentalOptions`
153+
- Components depending on the feature gate: `kubelet`
154+
- [X] Change the kubelet configuration to set a `CPUManager` policy of `static` and a `CPUManager` policy option of `distribute-cpus-across-numa`
155+
- Will enabling / disabling the feature require downtime of the control
156+
plane? No
157+
- Will enabling / disabling the feature require downtime or reprovisioning
158+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
159+
Yes -- a kubelet restart is required.
160+
161+
###### Does enabling the feature change any default behavior?
162+
163+
No. In order to trigger any of the new logic, three things have to be true:
164+
1. The `CPUManagerPolicyExperimentalOptions` feature gate must be enabled
165+
1. The `static` `CPUManager` policy must be selected
166+
1. The new `distribute-cpus-across-numa` policy option must be selected
167+
168+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
169+
170+
Yes, the feature can be disabled by either:
171+
1. Disabling the `CPUManagerPolicyExperimentalOptions` feature gate
172+
1. Switching the `CPUManager` policy to `none`
173+
1. Removing `distribute-cpus-across-numa` from the list of `CPUManager` policy options
174+
175+
Existing workloads will continue to run uninterrupted, with any future workloads having their CPUs allocated according to the policy in place after the rollback.
176+
177+
###### What happens if we reenable the feature if it was previously rolled back?
178+
179+
No changes. Existing container will not see their allocation changed. New containers will.
180+
181+
###### Are there any tests for feature enablement/disablement?
182+
183+
- A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)
184+
185+
### Monitoring Requirements
186+
187+
###### How can an operator determine if the feature is in use by workloads?
188+
189+
Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.
190+
191+
###### How can someone using this feature know that it is working for their instance?
192+
193+
In order to verify this feature is working, one should:
194+
1. Pick an node with at least 2 NUMA nodes in your cluster
195+
1. Ensure no other pods with exclusive CPUs are running on that node
196+
1. Launch a pod with a `nodeSelector` to that node that has a single container in it
197+
1. Run a ´sleep infinity` command and request exclusive CPUs for the container in the amount of (1 + NUM\_CPUS\_PER\_NUMA\_NODE)
198+
1. Verify that the list of CPUs allocated to the container are evenly distributed across 2 NUMA nodes instead of packed
199+
200+
To verify the list of CPUs allocated to the container, one can either:
201+
- `exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container).
202+
- Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider).
203+
204+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
205+
206+
There are no specific SLOs for this feature.
207+
Parallel workloads will benefit from this feature in application specific ways.
208+
209+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
210+
211+
None
212+
213+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
214+
215+
None
216+
217+
###### Does this feature depend on any specific services running in the cluster?
218+
219+
This feature is `linux` specific, and requires a version of CRI that includes the `LinuxContainerResources.CpusetCpus` field.
220+
This has been available since `v1alpha2`.
221+
222+
### Scalability
223+
224+
###### Will enabling / using this feature result in any new API calls?
225+
226+
No
227+
228+
###### Will enabling / using this feature result in introducing new API types?
229+
230+
No
231+
232+
###### Will enabling / using this feature result in any new calls to the cloud provider?
233+
234+
No
235+
236+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
237+
238+
No
239+
240+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
241+
242+
The algorithm required to implement this feature could delay:
243+
1. Pod admission time
244+
2. The time it takes to launch each container after pod admission
245+
246+
This delay should be minimal.
247+
248+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
249+
250+
No, the algorithm will run on a single `goroutine` with minimal memory requirements.
251+
252+
## Implementation History
253+
254+
- 2021-08-26: Initial KEP created
255+
- 2021-08-30: Updates to fill out more sections, answer PRR questions
256+
- 2021-09-08: Change feature gate from `CPUManagerPolicyOptions` to `CPUManagerPolicyExperimentalOptions`
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
title: CPUManager Policy Option to Distribute CPUs Across NUMA Nodes Instead of Packing Them
2+
kep-number: 2902
3+
authors:
4+
- "@klueska"
5+
owning-sig: sig-node
6+
participating-sigs: []
7+
status: implementable
8+
creation-date: "2021-08-26"
9+
reviewers:
10+
- "@fromani"
11+
approvers:
12+
- "@sig-node-leads"
13+
see-also:
14+
- "keps/sig-node/2625-cpumanager-policies-thread-placement"
15+
replaces: []
16+
17+
# The target maturity stage in the current dev cycle for this KEP.
18+
stage: alpha
19+
20+
# The most recent milestone for which work toward delivery of this KEP has been
21+
# done. This can be the current (upcoming) milestone, if it is being actively
22+
# worked on.
23+
latest-milestone: "v1.23"
24+
25+
# The milestone at which this feature was, or is targeted to be, at each stage.
26+
milestone:
27+
alpha: "v1.23"
28+
beta: "v1.24"
29+
stable: "v1.25"
30+
31+
# The following PRR answers are required at alpha release
32+
# List the feature gate name and the components for which it must be enabled
33+
feature-gates:
34+
- name: "CPUManagerPolicyExperimentalOptions"
35+
components:
36+
- kubelet
37+
disable-supported: true
38+
39+
# The following PRR answers are required at beta release
40+
metrics: []

0 commit comments

Comments
 (0)