You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -53,8 +63,7 @@ Starting with Kubernetes 1.22, a new `CPUManager` flag has facilitated the use o
53
63
These policy options work together to ensure an optimized cpu set is allocated for workloads running on a cluster.
54
64
The two policy options that already exist are `full-pcpus-only`(#2625) and `distribute-cpus-across-numa` (#2902).
55
65
With this KEP, a new `CPUManager` policy option is introduced which ensures that all CPUs on a socket are considered to be aligned.
56
-
Thus, the `CPUManager` will send a broader set of hints to `TopologyManager`, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by `DeviceManager`.
57
-
66
+
Thus, the `CPUManager` will send a broader set of preferred hints to `TopologyManager`, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by `DeviceManager`.
58
67
59
68
## Motivation
60
69
@@ -75,7 +84,8 @@ The best possible alignment of CPUs with other resources(viz. Which are managed
75
84
76
85
## Proposal
77
86
78
-
We propose to add a new `CPUManager` policy option called align-by-socket to the static `CPUManager` policy. With this policy, the `CPUManager` will prefer those hints which are within the same socket (as opposed to just within the same NUMA node) if it is possible to have all CPUs allocated from the same socket.
87
+
We propose to add a new `CPUManager` policy option called `align-by-socket` to the `static` policy of `CPUManager`.
88
+
With this policy option, the `CPUManager` will prefer those hints in which all CPUs are within same socket in addition to exisiting hints which require minimum NUMA nodes. With this policy option CPUs will be considered aligned at socket boundary instead of NUMA boundary during allocation. Thus if best hint consist of NUMA nodes within one socket, `CPUManager` may try to assign available CPUs from all NUMA nodes of socket.
79
89
80
90
### Risks and Mitigations
81
91
@@ -90,47 +100,148 @@ It is isolated to a specific policy option within the `CPUManager`, and is prote
90
100
91
101
### Proposed Change
92
102
93
-
When `align-by-socket` is enabled as a policy option, the `CPUManager`’s GetTopologyHints() function will generate hints based on the sockets that a group of CPUs belong to, rather than the NUMA nodes they belong to.
103
+
When `align-by-socket` is enabled as a policy option, the `CPUManager`’s function `GetTopologyHints`will prefer hints which are socket aligned in addition to hints which require minimum number of NUMA nodes.
94
104
95
-
To achieve this, the following updates are needed to the GetTopologyHints() function:
96
-
```
105
+
To achieve this, the following updates are needed to the `generateCPUTopologyHints` function of `static` policy of `CPUManager`:
// Loop back through all hints and update the 'Preferred' field based on
101
-
// counting the number of bits sets in the affinity mask and comparing it
102
-
// to the minAffinitySize. Only those with an equal number of bits set (and
103
-
// with a minimal set of numa nodes) will be considered preferred.
104
-
for i := range hints {
105
-
if p.options.AlignBySocket && isSocketAligned(hints[i].NUMANodeAffinity) {
106
-
hints[i].Preferred = true
107
-
continue
108
-
}
109
-
if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
110
-
hints[i].Preferred = true
111
-
}
112
-
}
111
+
// Loop back through all hints and update the 'Preferred' field based on
112
+
// counting the number of bits sets in the affinity mask and comparing it
113
+
// to the minAffinitySize. Those with an equal number of bits set (and
114
+
// with a minimal set of numa nodes) will be considered preferred.
115
+
// If align-by-socket policy option is enabled, socket aligned hints are
116
+
// also considered preferred.
117
+
fori:=range hints {
118
+
if p.options.AlignBySocket && isSocketAligned(hints[i].NUMANodeAffinity) {
119
+
hints[i].Preferred = true
120
+
continue
121
+
}
122
+
if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
123
+
hints[i].Preferred = true
124
+
}
125
+
}
113
126
114
127
return hints
115
128
}
116
-
117
129
```
130
+
118
131
At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).
119
132
120
-
During CPU allocation, in function `allocatedCPUs()`, `alignedCPUs` will consist of CPUs which are socket aligned instead of all CPUs from NUMA nodes in `numaAffinity` hint when `align-by-socket` policy option is enabled.
121
-
This will ensure that best effort to align CPUs by socket is made for alloction.
133
+
During CPU allocation, in function `allocatedCPUs()`, `alignedCPUs` will consist of CPUs which are socket aligned instead of CPUs from NUMA nodes in `numaAffinity` hint when `align-by-socket` policy option is enabled.
122
134
123
-
In case `TopologyManager``single-numa-node` policy is enabled, the policy option of `align-by-socket` is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of `align-by-socket` is enabled in conjunction with `TopologyManager``single-numa-node` policy.
This will ensure that for purpose of allocation, CPUs are considered aligned at socket boundary rather than NUMA boundary
124
156
125
-
The policyOption `align-by-socket` can work in conjunction with `TopologyManager``best-effort` and `restricted` policy without any conflict.
126
-
Above policy options will work well for general case where number of NUMA nodes per socket are one or more.
157
+
`align-by-socket` policy options will work well for general case where number of NUMA nodes per socket are one or more.
127
158
In rare cases like `DualNumaMultiSocketPerNumaHT` where one NUMA can span multiple socket, above option is not applicable.
128
-
We will error out in cases when `align-by-socket` is enabled when underlying topology consist of multiple socket per NUMA.
159
+
We will error out in cases when `align-by-socket` is enabled and underlying topology consist of multiple socket per NUMA.
129
160
We may address such scenarios in future if there is a usecase for it in real world.
130
161
162
+
In cases where number of NUMA nodes per socket is one or more as well as `TopologyManager``single-numa-node` policy is enabled, the policy option of `align-by-socket` is redundant since allocation guarantees within the same NUMA are by definition socket aligned.
163
+
Hence, we will error out in case the policy option of `align-by-socket` is enabled along with `TopologyManager``single-numa-node` policy.
164
+
165
+
The policyOption `align-by-socket` can work in conjunction with `TopologyManager``best-effort` and `restricted` policy without any conflict.
166
+
131
167
### Test Plan
132
168
133
-
We will extend both the unit test suite and the E2E test suite to cover the new policy option described in this KEP.
169
+
<!--
170
+
**Note:** *Not required until targeted at a release.*
171
+
The goal is to ensure that we don't accept enhancements with inadequate testing.
172
+
173
+
All code is expected to have adequate tests (eventually with coverage
174
+
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
We expect no non-infra related flakes in the last month as a GA graduation criteria.
241
+
-->
242
+
- These cases will be added in the existing e2e tests:
243
+
- Feature gate enable/disable tests
244
+
- `align-by-socket` policy option works as expected.
134
245
135
246
### Graduation Criteria
136
247
@@ -198,6 +309,45 @@ No changes. Existing container will not see their allocation changed. New contai
198
309
199
310
- A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)
200
311
312
+
### Rollout, Upgrade and Rollback Planning
313
+
314
+
<!--
315
+
This section must be completed when targeting beta to a release.
316
+
-->
317
+
318
+
###### How can a rollout or rollback fail? Can it impact already running workloads?
319
+
320
+
<!--
321
+
Try to be as paranoid as possible - e.g., what if some components will restart
322
+
mid-rollout?
323
+
324
+
Be sure to consider highly-available clusters, where, for example,
325
+
feature flags will be enabled on some API servers and not others during the
326
+
rollout. Similarly, consider large clusters and how enablement/disablement
327
+
will rollout across nodes.
328
+
-->
329
+
330
+
###### What specific metrics should inform a rollback?
331
+
332
+
<!--
333
+
What signals should users be paying attention to when the feature is young
334
+
that might indicate a serious problem?
335
+
-->
336
+
337
+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
338
+
339
+
<!--
340
+
Describe manual testing that was done and the outcomes.
341
+
Longer term, we may want to require automated upgrade/rollback tests, but we
342
+
are missing a bunch of machinery and tooling and can't do that now.
343
+
-->
344
+
345
+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
346
+
347
+
<!--
348
+
Even if applying deprecation policies, they may still surprise some users.
349
+
-->
350
+
201
351
### Monitoring Requirements
202
352
203
353
###### How can an operator determine if the feature is in use by workloads?
@@ -235,6 +385,29 @@ None
235
385
This feature is `linux` specific, and requires a version of CRI that includes the `LinuxContainerResources.CpusetCpus` field.
236
386
This has been available since `v1alpha2`.
237
387
388
+
### Dependencies
389
+
390
+
<!--
391
+
This section must be completed when targeting beta to a release.
392
+
-->
393
+
394
+
###### Does this feature depend on any specific services running in the cluster?
395
+
396
+
<!--
397
+
Think about both cluster-level services (e.g. metrics-server) as well
398
+
as node-level agents (e.g. specific version of CRI). Focus on external or
399
+
optional services that are needed. For example, if this feature depends on
400
+
a cloud provider API, or upon an external software-defined storage or network
401
+
control plane.
402
+
403
+
For each of these, fill in the following—thinking about running existing user workloads
404
+
and creating new ones, as well as about cluster-level services (e.g. DNS):
405
+
- [Dependency name]
406
+
- Usage description:
407
+
- Impact of its outage on the feature:
408
+
- Impact of its degraded performance or high-error rates on the feature:
409
+
-->
410
+
238
411
### Scalability
239
412
240
413
###### Will enabling / using this feature result in any new API calls?
@@ -265,7 +438,56 @@ This delay should be minimal.
265
438
266
439
No, the algorithm will run on a single `goroutine` with minimal memory requirements.
267
440
441
+
### Troubleshooting
442
+
443
+
<!--
444
+
This section must be completed when targeting beta to a release.
445
+
446
+
For GA, this section is required: approvers should be able to confirm the
447
+
previous answers based on experience in the field.
448
+
449
+
The Troubleshooting section currently serves the `Playbook` role. We may consider
450
+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
451
+
details). For now, we leave it here.
452
+
-->
453
+
454
+
###### How does this feature react if the API server and/or etcd is unavailable?
455
+
456
+
###### What are other known failure modes?
457
+
458
+
<!--
459
+
For each of them, fill in the following information by copying the below template:
460
+
- [Failure mode brief description]
461
+
- Detection: How can it be detected via metrics? Stated another way:
462
+
how can an operator troubleshoot without logging into a master or worker node?
463
+
- Mitigations: What can be done to stop the bleeding, especially for already
464
+
running user workloads?
465
+
- Diagnostics: What are the useful log messages and their required logging
466
+
levels that could help debug the issue?
467
+
Not required until feature graduated to beta.
468
+
- Testing: Are there any tests for failure mode? If not, describe why.
469
+
-->
470
+
471
+
###### What steps should be taken if SLOs are not being met to determine the problem?
472
+
268
473
## Implementation History
269
474
270
475
- 2022-06-02: Initial KEP created
271
476
- 2022-06-08: Addressed review comments on KEP
477
+
478
+
## Drawbacks
479
+
480
+
<!--
481
+
Why should this KEP _not_ be implemented?
482
+
-->
483
+
`align-by-socket` policy option when enabled, it might result CPU allocation which may not be perfectly aligned by NUMA with other resources managed by `DeviceManager`.
484
+
485
+
## Alternatives
486
+
487
+
<!--
488
+
What other approaches did you consider, and why did you rule them out? These do
489
+
not need to be as detailed as the proposal, but should include enough
490
+
information to express the idea and why it was not acceptable.
491
+
-->
492
+
`align-by-socket` can alternatively be introduced as policy of `TopologyManager` which can choose hints from hint provider which are socket aligned.
493
+
However, it is concluded that fuction of `TopologyManager` is to act as arbitrator for hints provider and should make decisions based on hints provided rather than influencing the `BestHint` based on its own policy.
0 commit comments