Skip to content

Commit 32b4019

Browse files
committed
Initial commit of KEP-3327
1 parent 5000eab commit 32b4019

File tree

3 files changed

+296
-0
lines changed

3 files changed

+296
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 3327
2+
alpha:
3+
approver:
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Risks and Mitigations](#risks-and-mitigations)
11+
- [Design Details](#design-details)
12+
- [Test Plan](#test-plan)
13+
- [Graduation Criteria](#graduation-criteria)
14+
- [Alpha](#alpha)
15+
- [Beta](#beta)
16+
- [GA](#ga)
17+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
18+
- [Version Skew Strategy](#version-skew-strategy)
19+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
20+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
21+
- [Monitoring Requirements](#monitoring-requirements)
22+
- [Scalability](#scalability)
23+
- [Implementation History](#implementation-history)
24+
<!-- /toc -->
25+
26+
## Release Signoff Checklist
27+
28+
Items marked with (R) are required *prior to targeting to a milestone / release*.
29+
30+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
31+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
32+
- [ ] (R) Design details are appropriately documented
33+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
34+
- [ ] e2e Tests for all Beta API Operations (endpoints)
35+
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
36+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
37+
- [ ] (R) Graduation criteria is in place
38+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
39+
- [ ] (R) Production readiness review completed
40+
- [ ] (R) Production readiness review approved
41+
- [ ] "Implementation History" section is up-to-date for milestone
42+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
43+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
44+
45+
[kubernetes.io]: https://kubernetes.io/
46+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
47+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
48+
[kubernetes/website]: https://git.k8s.io/website
49+
50+
## Summary
51+
52+
Starting with Kubernetes 1.22, a new CPUManager flag has facilitated the use of CPUManager Policy options(#2625) which enable users to customize their behavior based on workload requirements without having to introduce an entirely new policy. These policy options work together to ensure an optimized cpu set is allocated for workloads running on cluster. The two policy options that already exist are full-pcpus-only(#2625) and distribute-cpus-across-numa (#2902). With this KEP, new CPUManager policy option is introduced which ensures that all CPUs on a socket are considered to be aligned. Thus CPUManager will send a broader set of hints to TopologyManger, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by DeviceManager
53+
54+
55+
## Motivation
56+
57+
With the evolution of CPU architectures, the number of NUMA nodes per socket has increased. The devices managed by DeviceManager may not be uniformly distributed across all NUMA nodes. Thus there can be scenarios where perfect alignment between devices and CPU may not be possible. Latency sensitive applications desire resources to be aligned at least within the same socket if NUMA alignment is not possible for optimal performance. By default, CPUManager prefers CPU allocation which requires a minimum number of NUMA nodes. However if NUMA nodes selected for allocation are spread across sockets, it results in degraded performance. By ensuring the selected NUMA nodes to be socket aligned, predictable performance can be achieved. The best possible alignment of CPUs with other resources(viz. Which are managed by device Manager) is crucial to guarantee predictable performance for latency sensitive applications.
58+
59+
### Goals
60+
* Ensure CPUs are aligned at socket boundary which will result in latency sensitive applications and parallel algorithms to run more efficiently in predictable fashion by increasing the probability of hint selection in which NUMA nodes are socket aligned.
61+
62+
### Non-Goals
63+
* Guarantee optimal NUMA allocation for cpu distribution.
64+
65+
## Proposal
66+
67+
We propose to add a new CPUManager policy option called align-by-socket to the static CPUManager policy. With this policy, the CPUManager will prefer those hints which are within the same socket (as opposed to just within the same NUMA node) if it is possible to have all CPUs allocated from the same socket.
68+
69+
### Risks and Mitigations
70+
71+
The risks of adding this new feature are quite low.
72+
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyOptions` feature gate (which is disabled by default).
73+
74+
| Risk | Impact | Mitigation |
75+
| -------------------------------------------------| -------| ---------- |
76+
| Bugs in the implementation lead to kubelet crash | High | Disable the policy option and restart the kubelet. The workload will run but CPU allocations can spread across socket in cases when allocation could have been within same socket |
77+
78+
## Design Details
79+
80+
### Proposed Change
81+
82+
When align-by-socket is enabled as a policy option, the CPUManager’s GetTopologyHints() function will generate hints based on the sockets that a group of CPUs belong to, rather than the NUMA nodes they belong to.
83+
84+
To achieve this, the following updates are needed to the GetTopologyHints() function:
85+
```
86+
func (p *staticPolicy) generateCPUTopologyHints(availableCPUs cpuset.CPUSet, reusableCPUs cpuset.CPUSet, request int) []topologymanager.TopologyHint {
87+
...
88+
89+
// Loop back through all hints and update the 'Preferred' field based on
90+
// counting the number of bits sets in the affinity mask and comparing it
91+
// to the minAffinitySize. Only those with an equal number of bits set (and
92+
// with a minimal set of numa nodes) will be considered preferred.
93+
for i := range hints {
94+
if p.options.AlignBySocket && isSocketAligned(hints[i].NUMANodeAffinity) {
95+
hints[i].Preferred = true
96+
continue
97+
}
98+
if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
99+
hints[i].Preferred = true
100+
}
101+
}
102+
103+
return hints
104+
}
105+
106+
```
107+
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).
108+
109+
In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.
110+
111+
The policyOption align-by-socket can work in conjunction with TopologyManager “best-effort” and “restricted” policy without any conflict.
112+
113+
### Test Plan
114+
115+
We will extend both the unit test suite and the E2E test suite to cover the new policy option described in this KEP.
116+
117+
### Graduation Criteria
118+
119+
#### Alpha
120+
121+
- [X] Implement the new policy option.
122+
- [X] Ensure proper unit tests are in place.
123+
- [X] Ensure proper e2e node tests are in place.
124+
125+
#### Beta
126+
127+
- [X] Gather feedback from consumers of the new policy option.
128+
- [X] Verify no major bugs reported in the previous cycle.
129+
130+
#### GA
131+
132+
- [X] Allow time for feedback (1 year).
133+
- [X] Make sure all risks have been addressed.
134+
135+
### Upgrade / Downgrade Strategy
136+
137+
We expect no impact. The new policy option is opt-in and orthogonal to the existing ones.
138+
139+
### Version Skew Strategy
140+
141+
No changes needed
142+
143+
## Production Readiness Review Questionnaire
144+
145+
### Feature Enablement and Rollback
146+
147+
###### How can this feature be enabled / disabled in a live cluster?
148+
149+
- [X] Feature gate (also fill in values in `kep.yaml`)
150+
- Feature gate name: `CPUManagerPolicyAlphaOptions`
151+
- Components depending on the feature gate: `kubelet`
152+
- [X] Change the kubelet configuration to set a `CPUManager` policy of `static` and a `CPUManager` policy option of `align-by-socket`
153+
- Will enabling / disabling the feature require downtime of the control
154+
plane? No
155+
- Will enabling / disabling the feature require downtime or reprovisioning
156+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
157+
Yes -- a kubelet restart is required.
158+
159+
###### Does enabling the feature change any default behavior?
160+
161+
No. In order to trigger any of the new logic, three things have to be true:
162+
1. The `CPUManagerPolicyOptions` feature gate must be enabled
163+
1. The `static` `CPUManager` policy must be selected
164+
1. The new `align-by-socket` policy option must be selected
165+
166+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
167+
168+
Yes, the feature can be disabled by either:
169+
1. Disabling the `CPUManagerPolicyOptions` feature gate
170+
1. Switching the `CPUManager` policy to `none`
171+
1. Removing `align-by-socket` from the list of `CPUManager` policy options
172+
173+
Existing workloads will continue to run uninterrupted, with any future workloads having their CPUs allocated according to the policy in place after the rollback.
174+
175+
###### What happens if we reenable the feature if it was previously rolled back?
176+
177+
No changes. Existing container will not see their allocation changed. New containers will.
178+
179+
###### Are there any tests for feature enablement/disablement?
180+
181+
- A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)
182+
183+
### Monitoring Requirements
184+
185+
###### How can an operator determine if the feature is in use by workloads?
186+
187+
Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.
188+
189+
###### How can someone using this feature know that it is working for their instance?
190+
191+
In order to verify this feature is working, one should:
192+
Pick a node with at least 2 Sockets and 8 NUMA nodes
193+
Ensure no other pods with exclusive CPUs are running on that node
194+
Launch a 2 pods with a nodeSelector to that node that has a single container in it
195+
Run a `sleep infinity` command and request exclusive CPUs for the container in the amount of (4*NUM_CPUS_PER_NUMA_NODE - 8)
196+
Verify that for both pods, all CPU’s are within same socket instead of cpu’s distributed across sockets
197+
198+
To verify the list of CPUs allocated to the container, one can either:
199+
- `exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container).
200+
- Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider).
201+
202+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
203+
204+
There are no specific SLOs for this feature.
205+
Parallel workloads will benefit from this feature in application specific ways.
206+
207+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
208+
209+
None
210+
211+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
212+
213+
None
214+
215+
###### Does this feature depend on any specific services running in the cluster?
216+
217+
This feature is `linux` specific, and requires a version of CRI that includes the `LinuxContainerResources.CpusetCpus` field.
218+
This has been available since `v1alpha2`.
219+
220+
### Scalability
221+
222+
###### Will enabling / using this feature result in any new API calls?
223+
224+
No
225+
226+
###### Will enabling / using this feature result in introducing new API types?
227+
228+
No
229+
230+
###### Will enabling / using this feature result in any new calls to the cloud provider?
231+
232+
No
233+
234+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
235+
236+
No
237+
238+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
239+
240+
The algorithm required to implement this feature could delay:
241+
1. Pod admission time
242+
2. The time it takes to launch each container after pod admission
243+
244+
This delay should be minimal.
245+
246+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
247+
248+
No, the algorithm will run on a single `goroutine` with minimal memory requirements.
249+
250+
## Implementation History
251+
252+
- 2022-06-02: Initial KEP created
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
title: CPUManager policy option to align CPUs by Socket instead of by NUMA node
2+
kep-number: 3327
3+
authors:
4+
- "@klueska"
5+
- "@sanjaychatterjee"
6+
- "@arpitsardhana"
7+
owning-sig: sig-node
8+
participating-sigs: []
9+
status: implementable
10+
creation-date: "2022-06-02"
11+
reviewers:
12+
approvers:
13+
- "@sig-node-leads"
14+
see-also:
15+
- "keps/sig-node/2902-cpumanager-distribute-cpus-policy-option"
16+
replaces: []
17+
18+
# The target maturity stage in the current dev cycle for this KEP.
19+
stage: alpha
20+
21+
# The most recent milestone for which work toward delivery of this KEP has been
22+
# done. This can be the current (upcoming) milestone, if it is being actively
23+
# worked on.
24+
latest-milestone: "v1.25"
25+
26+
# The milestone at which this feature was, or is targeted to be, at each stage.
27+
milestone:
28+
alpha: "v1.25"
29+
beta: "v1.26"
30+
stable: "v1.27"
31+
32+
# The following PRR answers are required at alpha release
33+
# List the feature gate name and the components for which it must be enabled
34+
feature-gates:
35+
- name: "CPUManagerPolicyExperimentalOptions"
36+
components:
37+
- kubelet
38+
disable-supported: true
39+
40+
# The following PRR answers are required at beta release
41+
metrics: []

0 commit comments

Comments
 (0)