Skip to content

Commit 7133fed

Browse files
committed
Updating KEP to add more details & align with latest KEP format
1 parent 3cef13f commit 7133fed

File tree

2 files changed

+251
-27
lines changed

2 files changed

+251
-27
lines changed

keps/sig-node/3327-align-by-socket/README.md

Lines changed: 249 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,12 @@
99
- [Proposal](#proposal)
1010
- [Risks and Mitigations](#risks-and-mitigations)
1111
- [Design Details](#design-details)
12+
- [Proposed Change](#proposed-change)
1213
- [Test Plan](#test-plan)
14+
- [Prerequisite testing updates](#prerequisite-testing-updates)
15+
- [Unit tests](#unit-tests)
16+
- [Integration tests](#integration-tests)
17+
- [e2e tests](#e2e-tests)
1318
- [Graduation Criteria](#graduation-criteria)
1419
- [Alpha](#alpha)
1520
- [Beta](#beta)
@@ -18,9 +23,14 @@
1823
- [Version Skew Strategy](#version-skew-strategy)
1924
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
2025
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
26+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
2127
- [Monitoring Requirements](#monitoring-requirements)
28+
- [Dependencies](#dependencies)
2229
- [Scalability](#scalability)
30+
- [Troubleshooting](#troubleshooting)
2331
- [Implementation History](#implementation-history)
32+
- [Drawbacks](#drawbacks)
33+
- [Alternatives](#alternatives)
2434
<!-- /toc -->
2535

2636
## Release Signoff Checklist
@@ -53,8 +63,7 @@ Starting with Kubernetes 1.22, a new `CPUManager` flag has facilitated the use o
5363
These policy options work together to ensure an optimized cpu set is allocated for workloads running on a cluster.
5464
The two policy options that already exist are `full-pcpus-only`(#2625) and `distribute-cpus-across-numa` (#2902).
5565
With this KEP, a new `CPUManager` policy option is introduced which ensures that all CPUs on a socket are considered to be aligned.
56-
Thus, the `CPUManager` will send a broader set of hints to `TopologyManager`, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by `DeviceManager`.
57-
66+
Thus, the `CPUManager` will send a broader set of preferred hints to `TopologyManager`, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by `DeviceManager`.
5867

5968
## Motivation
6069

@@ -75,7 +84,8 @@ The best possible alignment of CPUs with other resources(viz. Which are managed
7584

7685
## Proposal
7786

78-
We propose to add a new `CPUManager` policy option called align-by-socket to the static `CPUManager` policy. With this policy, the `CPUManager` will prefer those hints which are within the same socket (as opposed to just within the same NUMA node) if it is possible to have all CPUs allocated from the same socket.
87+
We propose to add a new `CPUManager` policy option called `align-by-socket` to the `static` policy of `CPUManager`.
88+
With this policy option, the `CPUManager` will prefer those hints in which all CPUs are within same socket in addition to exisiting hints which require minimum NUMA nodes. With this policy option CPUs will be considered aligned at socket boundary instead of NUMA boundary during allocation. Thus if best hint consist of NUMA nodes within one socket, `CPUManager` may try to assign available CPUs from all NUMA nodes of socket.
7989

8090
### Risks and Mitigations
8191

@@ -90,47 +100,148 @@ It is isolated to a specific policy option within the `CPUManager`, and is prote
90100

91101
### Proposed Change
92102

93-
When `align-by-socket` is enabled as a policy option, the `CPUManager`’s GetTopologyHints() function will generate hints based on the sockets that a group of CPUs belong to, rather than the NUMA nodes they belong to.
103+
When `align-by-socket` is enabled as a policy option, the `CPUManager`’s function `GetTopologyHints` will prefer hints which are socket aligned in addition to hints which require minimum number of NUMA nodes.
94104

95-
To achieve this, the following updates are needed to the GetTopologyHints() function:
96-
```
105+
To achieve this, the following updates are needed to the `generateCPUTopologyHints` function of `static` policy of `CPUManager`:
106+
107+
```go
97108
func (p *staticPolicy) generateCPUTopologyHints(availableCPUs cpuset.CPUSet, reusableCPUs cpuset.CPUSet, request int) []topologymanager.TopologyHint {
98109
...
99110

100-
// Loop back through all hints and update the 'Preferred' field based on
101-
// counting the number of bits sets in the affinity mask and comparing it
102-
// to the minAffinitySize. Only those with an equal number of bits set (and
103-
// with a minimal set of numa nodes) will be considered preferred.
104-
for i := range hints {
105-
if p.options.AlignBySocket && isSocketAligned(hints[i].NUMANodeAffinity) {
106-
hints[i].Preferred = true
107-
continue
108-
}
109-
if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
110-
hints[i].Preferred = true
111-
}
112-
}
111+
// Loop back through all hints and update the 'Preferred' field based on
112+
// counting the number of bits sets in the affinity mask and comparing it
113+
// to the minAffinitySize. Those with an equal number of bits set (and
114+
// with a minimal set of numa nodes) will be considered preferred.
115+
// If align-by-socket policy option is enabled, socket aligned hints are
116+
// also considered preferred.
117+
for i := range hints {
118+
if p.options.AlignBySocket && isSocketAligned(hints[i].NUMANodeAffinity) {
119+
hints[i].Preferred = true
120+
continue
121+
}
122+
if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
123+
hints[i].Preferred = true
124+
}
125+
}
113126

114127
return hints
115128
}
116-
117129
```
130+
118131
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).
119132

120-
During CPU allocation, in function `allocatedCPUs()`, `alignedCPUs` will consist of CPUs which are socket aligned instead of all CPUs from NUMA nodes in `numaAffinity` hint when `align-by-socket` policy option is enabled.
121-
This will ensure that best effort to align CPUs by socket is made for alloction.
133+
During CPU allocation, in function `allocatedCPUs()`, `alignedCPUs` will consist of CPUs which are socket aligned instead of CPUs from NUMA nodes in `numaAffinity` hint when `align-by-socket` policy option is enabled.
122134

123-
In case `TopologyManager` `single-numa-node` policy is enabled, the policy option of `align-by-socket` is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of `align-by-socket` is enabled in conjunction with `TopologyManager` `single-numa-node` policy.
135+
```go
136+
func (p *staticPolicy) allocateCPUs(s state.State, numCPUs int, numaAffinity bitmask.BitMask, reusableCPUs cpuset.CPUSet) (cpuset.CPUSet, error) {
137+
...
138+
if numaAffinity != nil {
139+
alignedCPUs := cpuset.NewCPUSet()
140+
141+
bits := numaAffinity.GetBits()
142+
// If align-by-socket policy option is enabled, NUMA based hint is expanded to
143+
// socket aligned hint. It will ensure that first socket aligned available CPUs are
144+
// allocated before we try to find CPUs across socket to satify allocation request.
145+
if p.PolicyOptions.AlignBySocket {
146+
bits = p.topology.ExpandToFullSocketBits(bits)
147+
}
148+
for _, numaNodeID := range bits {
149+
alignedCPUs = alignedCPUs.Union(assignableCPUs.Intersection(p.topology.CPUDetails.CPUsInNUMANodes(numaNodeID)))
150+
}
151+
152+
...
153+
}
154+
```
155+
This will ensure that for purpose of allocation, CPUs are considered aligned at socket boundary rather than NUMA boundary
124156
125-
The policyOption `align-by-socket` can work in conjunction with `TopologyManager` `best-effort` and `restricted` policy without any conflict.
126-
Above policy options will work well for general case where number of NUMA nodes per socket are one or more.
157+
`align-by-socket` policy options will work well for general case where number of NUMA nodes per socket are one or more.
127158
In rare cases like `DualNumaMultiSocketPerNumaHT` where one NUMA can span multiple socket, above option is not applicable.
128-
We will error out in cases when `align-by-socket` is enabled when underlying topology consist of multiple socket per NUMA.
159+
We will error out in cases when `align-by-socket` is enabled and underlying topology consist of multiple socket per NUMA.
129160
We may address such scenarios in future if there is a usecase for it in real world.
130161
162+
In cases where number of NUMA nodes per socket is one or more as well as `TopologyManager` `single-numa-node` policy is enabled, the policy option of `align-by-socket` is redundant since allocation guarantees within the same NUMA are by definition socket aligned.
163+
Hence, we will error out in case the policy option of `align-by-socket` is enabled along with `TopologyManager` `single-numa-node` policy.
164+
165+
The policyOption `align-by-socket` can work in conjunction with `TopologyManager` `best-effort` and `restricted` policy without any conflict.
166+
131167
### Test Plan
132168
133-
We will extend both the unit test suite and the E2E test suite to cover the new policy option described in this KEP.
169+
<!--
170+
**Note:** *Not required until targeted at a release.*
171+
The goal is to ensure that we don't accept enhancements with inadequate testing.
172+
173+
All code is expected to have adequate tests (eventually with coverage
174+
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
175+
when drafting this test plan.
176+
177+
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
178+
-->
179+
180+
[X] I/we understand the owners of the involved components may require updates to
181+
existing tests to make this code solid enough prior to committing the changes necessary
182+
to implement this enhancement.
183+
184+
##### Prerequisite testing updates
185+
186+
<!--
187+
Based on reviewers feedback describe what additional tests need to be added prior
188+
implementing this enhancement to ensure the enhancements have also solid foundations.
189+
-->
190+
191+
##### Unit tests
192+
193+
<!--
194+
In principle every added code should have complete unit test coverage, so providing
195+
the exact set of tests will not bring additional value.
196+
However, if complete unit test coverage is not possible, explain the reason of it
197+
together with explanation why this is acceptable.
198+
-->
199+
200+
<!--
201+
Additionally, for Alpha try to enumerate the core package you will be touching
202+
to implement this enhancement and provide the current unit coverage for those
203+
in the form of:
204+
- <package>: <date> - <current test coverage>
205+
The data can be easily read from:
206+
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
207+
208+
This can inform certain test coverage improvements that we want to do before
209+
extending the production code to implement this enhancement.
210+
-->
211+
212+
- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/policy_static.go`: `06-13-2022` - `91.1`
213+
214+
##### Integration tests
215+
216+
<!--
217+
This question should be filled when targeting a release.
218+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
219+
220+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
221+
https://storage.googleapis.com/k8s-triage/index.html
222+
-->
223+
224+
- These cases will be added in the existing integration tests:
225+
- Feature gate enable/disable tests
226+
- `align-by-socket` policy option works as expected. When policy option is enabled
227+
- `generateCPUTopologyHints` prefers socket aligned hints in conjunction with hints with minimum NUMA nodes.
228+
- `allocateCPUs` allocated CPU at socket boundary.
229+
- Verify no significant performance degradation
230+
231+
##### e2e tests
232+
233+
<!--
234+
This question should be filled when targeting a release.
235+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
236+
237+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
238+
https://storage.googleapis.com/k8s-triage/index.html
239+
240+
We expect no non-infra related flakes in the last month as a GA graduation criteria.
241+
-->
242+
- These cases will be added in the existing e2e tests:
243+
- Feature gate enable/disable tests
244+
- `align-by-socket` policy option works as expected.
134245
135246
### Graduation Criteria
136247
@@ -198,6 +309,45 @@ No changes. Existing container will not see their allocation changed. New contai
198309
199310
- A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)
200311
312+
### Rollout, Upgrade and Rollback Planning
313+
314+
<!--
315+
This section must be completed when targeting beta to a release.
316+
-->
317+
318+
###### How can a rollout or rollback fail? Can it impact already running workloads?
319+
320+
<!--
321+
Try to be as paranoid as possible - e.g., what if some components will restart
322+
mid-rollout?
323+
324+
Be sure to consider highly-available clusters, where, for example,
325+
feature flags will be enabled on some API servers and not others during the
326+
rollout. Similarly, consider large clusters and how enablement/disablement
327+
will rollout across nodes.
328+
-->
329+
330+
###### What specific metrics should inform a rollback?
331+
332+
<!--
333+
What signals should users be paying attention to when the feature is young
334+
that might indicate a serious problem?
335+
-->
336+
337+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
338+
339+
<!--
340+
Describe manual testing that was done and the outcomes.
341+
Longer term, we may want to require automated upgrade/rollback tests, but we
342+
are missing a bunch of machinery and tooling and can't do that now.
343+
-->
344+
345+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
346+
347+
<!--
348+
Even if applying deprecation policies, they may still surprise some users.
349+
-->
350+
201351
### Monitoring Requirements
202352
203353
###### How can an operator determine if the feature is in use by workloads?
@@ -235,6 +385,29 @@ None
235385
This feature is `linux` specific, and requires a version of CRI that includes the `LinuxContainerResources.CpusetCpus` field.
236386
This has been available since `v1alpha2`.
237387
388+
### Dependencies
389+
390+
<!--
391+
This section must be completed when targeting beta to a release.
392+
-->
393+
394+
###### Does this feature depend on any specific services running in the cluster?
395+
396+
<!--
397+
Think about both cluster-level services (e.g. metrics-server) as well
398+
as node-level agents (e.g. specific version of CRI). Focus on external or
399+
optional services that are needed. For example, if this feature depends on
400+
a cloud provider API, or upon an external software-defined storage or network
401+
control plane.
402+
403+
For each of these, fill in the following—thinking about running existing user workloads
404+
and creating new ones, as well as about cluster-level services (e.g. DNS):
405+
- [Dependency name]
406+
- Usage description:
407+
- Impact of its outage on the feature:
408+
- Impact of its degraded performance or high-error rates on the feature:
409+
-->
410+
238411
### Scalability
239412
240413
###### Will enabling / using this feature result in any new API calls?
@@ -265,7 +438,56 @@ This delay should be minimal.
265438
266439
No, the algorithm will run on a single `goroutine` with minimal memory requirements.
267440
441+
### Troubleshooting
442+
443+
<!--
444+
This section must be completed when targeting beta to a release.
445+
446+
For GA, this section is required: approvers should be able to confirm the
447+
previous answers based on experience in the field.
448+
449+
The Troubleshooting section currently serves the `Playbook` role. We may consider
450+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
451+
details). For now, we leave it here.
452+
-->
453+
454+
###### How does this feature react if the API server and/or etcd is unavailable?
455+
456+
###### What are other known failure modes?
457+
458+
<!--
459+
For each of them, fill in the following information by copying the below template:
460+
- [Failure mode brief description]
461+
- Detection: How can it be detected via metrics? Stated another way:
462+
how can an operator troubleshoot without logging into a master or worker node?
463+
- Mitigations: What can be done to stop the bleeding, especially for already
464+
running user workloads?
465+
- Diagnostics: What are the useful log messages and their required logging
466+
levels that could help debug the issue?
467+
Not required until feature graduated to beta.
468+
- Testing: Are there any tests for failure mode? If not, describe why.
469+
-->
470+
471+
###### What steps should be taken if SLOs are not being met to determine the problem?
472+
268473
## Implementation History
269474
270475
- 2022-06-02: Initial KEP created
271476
- 2022-06-08: Addressed review comments on KEP
477+
478+
## Drawbacks
479+
480+
<!--
481+
Why should this KEP _not_ be implemented?
482+
-->
483+
`align-by-socket` policy option when enabled, it might result CPU allocation which may not be perfectly aligned by NUMA with other resources managed by `DeviceManager`.
484+
485+
## Alternatives
486+
487+
<!--
488+
What other approaches did you consider, and why did you rule them out? These do
489+
not need to be as detailed as the proposal, but should include enough
490+
information to express the idea and why it was not acceptable.
491+
-->
492+
`align-by-socket` can alternatively be introduced as policy of `TopologyManager` which can choose hints from hint provider which are socket aligned.
493+
However, it is concluded that fuction of `TopologyManager` is to act as arbitrator for hints provider and should make decisions based on hints provided rather than influencing the `BestHint` based on its own policy.

keps/sig-node/3327-align-by-socket/kep.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ participating-sigs: []
99
status: implementable
1010
creation-date: "2022-06-02"
1111
reviewers:
12+
- "@swatisehgal"
13+
- "@fromanirh"
1214
approvers:
1315
- "@sig-node-leads"
1416
see-also:

0 commit comments

Comments
 (0)