Skip to content

Commit baa7ccd

Browse files
committed
Addressing review comments
1 parent 32b4019 commit baa7ccd

File tree

3 files changed

+34
-15
lines changed

3 files changed

+34
-15
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
kep-number: 3327
22
alpha:
3-
approver:
3+
approver: "@johnbelamaric"

keps/sig-node/3327-align-by-socket/README.md

Lines changed: 32 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
3232
- [ ] (R) Design details are appropriately documented
3333
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
3434
- [ ] e2e Tests for all Beta API Operations (endpoints)
35-
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
35+
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
3636
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
3737
- [ ] (R) Graduation criteria is in place
38-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
38+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
3939
- [ ] (R) Production readiness review completed
4040
- [ ] (R) Production readiness review approved
4141
- [ ] "Implementation History" section is up-to-date for milestone
@@ -49,37 +49,48 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4949

5050
## Summary
5151

52-
Starting with Kubernetes 1.22, a new CPUManager flag has facilitated the use of CPUManager Policy options(#2625) which enable users to customize their behavior based on workload requirements without having to introduce an entirely new policy. These policy options work together to ensure an optimized cpu set is allocated for workloads running on cluster. The two policy options that already exist are full-pcpus-only(#2625) and distribute-cpus-across-numa (#2902). With this KEP, new CPUManager policy option is introduced which ensures that all CPUs on a socket are considered to be aligned. Thus CPUManager will send a broader set of hints to TopologyManger, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by DeviceManager
52+
Starting with Kubernetes 1.22, a new `CPUManager` flag has facilitated the use of `CPUManager` Policy options(#2625) which enable users to customize their behavior based on workload requirements without having to introduce an entirely new policy.
53+
These policy options work together to ensure an optimized cpu set is allocated for workloads running on a cluster.
54+
The two policy options that already exist are `full-pcpus-only`(#2625) and `distribute-cpus-across-numa` (#2902).
55+
With this KEP, a new `CPUManager` policy option is introduced which ensures that all CPUs on a socket are considered to be aligned.
56+
Thus, the `CPUManager` will send a broader set of hints to `TopologyManager`, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by `DeviceManager`.
5357

5458

5559
## Motivation
5660

57-
With the evolution of CPU architectures, the number of NUMA nodes per socket has increased. The devices managed by DeviceManager may not be uniformly distributed across all NUMA nodes. Thus there can be scenarios where perfect alignment between devices and CPU may not be possible. Latency sensitive applications desire resources to be aligned at least within the same socket if NUMA alignment is not possible for optimal performance. By default, CPUManager prefers CPU allocation which requires a minimum number of NUMA nodes. However if NUMA nodes selected for allocation are spread across sockets, it results in degraded performance. By ensuring the selected NUMA nodes to be socket aligned, predictable performance can be achieved. The best possible alignment of CPUs with other resources(viz. Which are managed by device Manager) is crucial to guarantee predictable performance for latency sensitive applications.
61+
With the evolution of CPU architectures, the number of NUMA nodes per socket has increased.
62+
The devices managed by `DeviceManager` may not be uniformly distributed across all NUMA nodes.
63+
Thus there can be scenarios where perfect alignment between devices and CPU may not be possible.
64+
Latency sensitive applications desire resources to be aligned at least within the same socket if NUMA alignment is not possible for optimal performance.
65+
By default, the `CPUManager` prefers CPU allocations which require a minimum number of NUMA nodes.
66+
However, if the NUMA nodes selected for allocation are spread across sockets, it results in degraded performance.
67+
By ensuring the selected NUMA nodes are socket aligned, predictable performance can be achieved.
68+
The best possible alignment of CPUs with other resources(viz. Which are managed by `DeviceManager`) is crucial to guarantee predictable performance for latency sensitive applications.
5869

5970
### Goals
60-
* Ensure CPUs are aligned at socket boundary which will result in latency sensitive applications and parallel algorithms to run more efficiently in predictable fashion by increasing the probability of hint selection in which NUMA nodes are socket aligned.
71+
* Ensure CPUs are aligned at socket boundary rather than NUMA node boundary.
6172

6273
### Non-Goals
6374
* Guarantee optimal NUMA allocation for cpu distribution.
6475

6576
## Proposal
6677

67-
We propose to add a new CPUManager policy option called align-by-socket to the static CPUManager policy. With this policy, the CPUManager will prefer those hints which are within the same socket (as opposed to just within the same NUMA node) if it is possible to have all CPUs allocated from the same socket.
78+
We propose to add a new `CPUManager` policy option called align-by-socket to the static `CPUManager` policy. With this policy, the `CPUManager` will prefer those hints which are within the same socket (as opposed to just within the same NUMA node) if it is possible to have all CPUs allocated from the same socket.
6879

6980
### Risks and Mitigations
7081

7182
The risks of adding this new feature are quite low.
72-
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyOptions` feature gate (which is disabled by default).
83+
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).
7384

7485
| Risk | Impact | Mitigation |
7586
| -------------------------------------------------| -------| ---------- |
7687
| Bugs in the implementation lead to kubelet crash | High | Disable the policy option and restart the kubelet. The workload will run but CPU allocations can spread across socket in cases when allocation could have been within same socket |
7788

7889
## Design Details
7990

80-
### Proposed Change
91+
### Proposed Change
8192

82-
When align-by-socket is enabled as a policy option, the CPUManager’s GetTopologyHints() function will generate hints based on the sockets that a group of CPUs belong to, rather than the NUMA nodes they belong to.
93+
When `align-by-socket` is enabled as a policy option, the `CPUManager`’s GetTopologyHints() function will generate hints based on the sockets that a group of CPUs belong to, rather than the NUMA nodes they belong to.
8394

8495
To achieve this, the following updates are needed to the GetTopologyHints() function:
8596
```
@@ -106,9 +117,16 @@ func (p *staticPolicy) generateCPUTopologyHints(availableCPUs cpuset.CPUSet, reu
106117
```
107118
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).
108119

109-
In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.
120+
During CPU allocation, in function `allocatedCPUs()`, `alignedCPUs` will consist of CPUs which are socket aligned instead of all CPUs from NUMA nodes in `numaAffinity` hint when `align-by-socket` policy option is enabled.
121+
This will ensure that best effort to align CPUs by socket is made for alloction.
110122

111-
The policyOption align-by-socket can work in conjunction with TopologyManager “best-effort” and “restricted” policy without any conflict.
123+
In case `TopologyManager` `single-numa-node` policy is enabled, the policy option of `align-by-socket` is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of `align-by-socket` is enabled in conjunction with `TopologyManager` `single-numa-node` policy.
124+
125+
The policyOption `align-by-socket` can work in conjunction with `TopologyManager` `best-effort` and `restricted` policy without any conflict.
126+
Above policy options will work well for general case where number of NUMA nodes per socket are one or more.
127+
In rare cases like `DualNumaMultiSocketPerNumaHT` where one NUMA can span multiple socket, above option is not applicable.
128+
We will error out in cases when `align-by-socket` is enabled when underlying topology consist of multiple socket per NUMA.
129+
We may address such scenarios in future if there is a usecase for it in real world.
112130

113131
### Test Plan
114132

@@ -159,14 +177,14 @@ No changes needed
159177
###### Does enabling the feature change any default behavior?
160178

161179
No. In order to trigger any of the new logic, three things have to be true:
162-
1. The `CPUManagerPolicyOptions` feature gate must be enabled
180+
1. The `CPUManagerPolicyAlphaOptions` feature gate must be enabled
163181
1. The `static` `CPUManager` policy must be selected
164182
1. The new `align-by-socket` policy option must be selected
165183

166184
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
167185

168186
Yes, the feature can be disabled by either:
169-
1. Disabling the `CPUManagerPolicyOptions` feature gate
187+
1. Disabling the `CPUManagerPolicyAlphaOptions` feature gate
170188
1. Switching the `CPUManager` policy to `none`
171189
1. Removing `align-by-socket` from the list of `CPUManager` policy options
172190

@@ -250,3 +268,4 @@ No, the algorithm will run on a single `goroutine` with minimal memory requireme
250268
## Implementation History
251269

252270
- 2022-06-02: Initial KEP created
271+
- 2022-06-08: Addressed review comments on KEP

keps/sig-node/3327-align-by-socket/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ milestone:
3232
# The following PRR answers are required at alpha release
3333
# List the feature gate name and the components for which it must be enabled
3434
feature-gates:
35-
- name: "CPUManagerPolicyExperimentalOptions"
35+
- name: "CPUManagerPolicyAlphaOptions"
3636
components:
3737
- kubelet
3838
disable-supported: true

0 commit comments

Comments
 (0)