You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/3327-align-by-socket/README.md
+32-13Lines changed: 32 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,10 +32,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
32
32
-[ ] (R) Design details are appropriately documented
33
33
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
34
34
-[ ] e2e Tests for all Beta API Operations (endpoints)
35
-
-[ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
35
+
-[ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
36
36
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
37
37
-[ ] (R) Graduation criteria is in place
38
-
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
38
+
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
39
39
-[ ] (R) Production readiness review completed
40
40
-[ ] (R) Production readiness review approved
41
41
-[ ] "Implementation History" section is up-to-date for milestone
@@ -49,37 +49,48 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
49
49
50
50
## Summary
51
51
52
-
Starting with Kubernetes 1.22, a new CPUManager flag has facilitated the use of CPUManager Policy options(#2625) which enable users to customize their behavior based on workload requirements without having to introduce an entirely new policy. These policy options work together to ensure an optimized cpu set is allocated for workloads running on cluster. The two policy options that already exist are full-pcpus-only(#2625) and distribute-cpus-across-numa (#2902). With this KEP, new CPUManager policy option is introduced which ensures that all CPUs on a socket are considered to be aligned. Thus CPUManager will send a broader set of hints to TopologyManger, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by DeviceManager
52
+
Starting with Kubernetes 1.22, a new `CPUManager` flag has facilitated the use of `CPUManager` Policy options(#2625) which enable users to customize their behavior based on workload requirements without having to introduce an entirely new policy.
53
+
These policy options work together to ensure an optimized cpu set is allocated for workloads running on a cluster.
54
+
The two policy options that already exist are `full-pcpus-only`(#2625) and `distribute-cpus-across-numa` (#2902).
55
+
With this KEP, a new `CPUManager` policy option is introduced which ensures that all CPUs on a socket are considered to be aligned.
56
+
Thus, the `CPUManager` will send a broader set of hints to `TopologyManager`, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by `DeviceManager`.
53
57
54
58
55
59
## Motivation
56
60
57
-
With the evolution of CPU architectures, the number of NUMA nodes per socket has increased. The devices managed by DeviceManager may not be uniformly distributed across all NUMA nodes. Thus there can be scenarios where perfect alignment between devices and CPU may not be possible. Latency sensitive applications desire resources to be aligned at least within the same socket if NUMA alignment is not possible for optimal performance. By default, CPUManager prefers CPU allocation which requires a minimum number of NUMA nodes. However if NUMA nodes selected for allocation are spread across sockets, it results in degraded performance. By ensuring the selected NUMA nodes to be socket aligned, predictable performance can be achieved. The best possible alignment of CPUs with other resources(viz. Which are managed by device Manager) is crucial to guarantee predictable performance for latency sensitive applications.
61
+
With the evolution of CPU architectures, the number of NUMA nodes per socket has increased.
62
+
The devices managed by `DeviceManager` may not be uniformly distributed across all NUMA nodes.
63
+
Thus there can be scenarios where perfect alignment between devices and CPU may not be possible.
64
+
Latency sensitive applications desire resources to be aligned at least within the same socket if NUMA alignment is not possible for optimal performance.
65
+
By default, the `CPUManager` prefers CPU allocations which require a minimum number of NUMA nodes.
66
+
However, if the NUMA nodes selected for allocation are spread across sockets, it results in degraded performance.
67
+
By ensuring the selected NUMA nodes are socket aligned, predictable performance can be achieved.
68
+
The best possible alignment of CPUs with other resources(viz. Which are managed by `DeviceManager`) is crucial to guarantee predictable performance for latency sensitive applications.
58
69
59
70
### Goals
60
-
* Ensure CPUs are aligned at socket boundary which will result in latency sensitive applications and parallel algorithms to run more efficiently in predictable fashion by increasing the probability of hint selection in which NUMA nodes are socket aligned.
71
+
* Ensure CPUs are aligned at socket boundary rather than NUMA node boundary.
61
72
62
73
### Non-Goals
63
74
* Guarantee optimal NUMA allocation for cpu distribution.
64
75
65
76
## Proposal
66
77
67
-
We propose to add a new CPUManager policy option called align-by-socket to the static CPUManager policy. With this policy, the CPUManager will prefer those hints which are within the same socket (as opposed to just within the same NUMA node) if it is possible to have all CPUs allocated from the same socket.
78
+
We propose to add a new `CPUManager` policy option called align-by-socket to the static `CPUManager` policy. With this policy, the `CPUManager` will prefer those hints which are within the same socket (as opposed to just within the same NUMA node) if it is possible to have all CPUs allocated from the same socket.
68
79
69
80
### Risks and Mitigations
70
81
71
82
The risks of adding this new feature are quite low.
72
-
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyOptions` feature gate (which is disabled by default).
83
+
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).
| Bugs in the implementation lead to kubelet crash | High | Disable the policy option and restart the kubelet. The workload will run but CPU allocations can spread across socket in cases when allocation could have been within same socket |
77
88
78
89
## Design Details
79
90
80
-
### Proposed Change
91
+
### Proposed Change
81
92
82
-
When align-by-socket is enabled as a policy option, the CPUManager’s GetTopologyHints() function will generate hints based on the sockets that a group of CPUs belong to, rather than the NUMA nodes they belong to.
93
+
When `align-by-socket` is enabled as a policy option, the `CPUManager`’s GetTopologyHints() function will generate hints based on the sockets that a group of CPUs belong to, rather than the NUMA nodes they belong to.
83
94
84
95
To achieve this, the following updates are needed to the GetTopologyHints() function:
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).
108
119
109
-
In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.
120
+
During CPU allocation, in function `allocatedCPUs()`, `alignedCPUs` will consist of CPUs which are socket aligned instead of all CPUs from NUMA nodes in `numaAffinity` hint when `align-by-socket` policy option is enabled.
121
+
This will ensure that best effort to align CPUs by socket is made for alloction.
110
122
111
-
The policyOption align-by-socket can work in conjunction with TopologyManager “best-effort” and “restricted” policy without any conflict.
123
+
In case `TopologyManager``single-numa-node` policy is enabled, the policy option of `align-by-socket` is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of `align-by-socket` is enabled in conjunction with `TopologyManager``single-numa-node` policy.
124
+
125
+
The policyOption `align-by-socket` can work in conjunction with `TopologyManager``best-effort` and `restricted` policy without any conflict.
126
+
Above policy options will work well for general case where number of NUMA nodes per socket are one or more.
127
+
In rare cases like `DualNumaMultiSocketPerNumaHT` where one NUMA can span multiple socket, above option is not applicable.
128
+
We will error out in cases when `align-by-socket` is enabled when underlying topology consist of multiple socket per NUMA.
129
+
We may address such scenarios in future if there is a usecase for it in real world.
112
130
113
131
### Test Plan
114
132
@@ -159,14 +177,14 @@ No changes needed
159
177
###### Does enabling the feature change any default behavior?
160
178
161
179
No. In order to trigger any of the new logic, three things have to be true:
162
-
1. The `CPUManagerPolicyOptions` feature gate must be enabled
180
+
1. The `CPUManagerPolicyAlphaOptions` feature gate must be enabled
163
181
1. The `static``CPUManager` policy must be selected
164
182
1. The new `align-by-socket` policy option must be selected
165
183
166
184
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
167
185
168
186
Yes, the feature can be disabled by either:
169
-
1. Disabling the `CPUManagerPolicyOptions` feature gate
187
+
1. Disabling the `CPUManagerPolicyAlphaOptions` feature gate
170
188
1. Switching the `CPUManager` policy to `none`
171
189
1. Removing `align-by-socket` from the list of `CPUManager` policy options
172
190
@@ -250,3 +268,4 @@ No, the algorithm will run on a single `goroutine` with minimal memory requireme
0 commit comments