Skip to content

Commit 06d6d0d

Browse files
authored
Merge pull request #5117 from swatisehgal/distribute-cpus-across-numa-to-beta
KEP-2902: Promote CPUManager policy option to distribute CPUs across NUMA nodes to Beta
2 parents 99716a5 + 8d095f4 commit 06d6d0d

File tree

3 files changed

+132
-10
lines changed

3 files changed

+132
-10
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2902
22
alpha:
33
approver: "@johnbelamaric"
4+
beta:
5+
approver: "@johnbelamaric"

keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/README.md

Lines changed: 114 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,12 @@
99
- [Proposal](#proposal)
1010
- [Risks and Mitigations](#risks-and-mitigations)
1111
- [Design Details](#design-details)
12+
- [Compatibility with <code>full-pcpus-only</code> policy options](#compatibility-with-full-pcpus-only-policy-options)
1213
- [Test Plan](#test-plan)
14+
- [Prerequisite testing updates](#prerequisite-testing-updates)
15+
- [Unit tests](#unit-tests)
16+
- [Integration tests](#integration-tests)
17+
- [e2e tests](#e2e-tests)
1318
- [Graduation Criteria](#graduation-criteria)
1419
- [Alpha](#alpha)
1520
- [Beta](#beta)
@@ -18,8 +23,11 @@
1823
- [Version Skew Strategy](#version-skew-strategy)
1924
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
2025
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
26+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
2127
- [Monitoring Requirements](#monitoring-requirements)
28+
- [Dependencies](#dependencies)
2229
- [Scalability](#scalability)
30+
- [Troubleshooting](#troubleshooting)
2331
- [Implementation History](#implementation-history)
2432
<!-- /toc -->
2533

@@ -75,7 +83,7 @@ When enabled, this will trigger the `CPUManager` to evenly distribute CPUs acros
7583
### Risks and Mitigations
7684

7785
The risks of adding this new feature are quite low.
78-
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).
86+
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyBetaOptions` feature gate (which is disabled by default).
7987

8088
| Risk | Impact | Mitigation |
8189
| -------------------------------------------------| -------| ---------- |
@@ -112,10 +120,39 @@ If none of the above conditions can be met, resort back to a best-effort fit of
112120

113121
NOTE: The striping operation after all CPUs have been evenly distributed will be performed such that the overall disribution of CPUs across those NUMA nodes remains as balanced as possible.
114122

123+
### Compatibility with `full-pcpus-only` policy options
124+
125+
| Compatibility | alpha | beta | GA |
126+
| --- | --- | --- | --- |
127+
| full-pcpus-only | x | x | x |
128+
115129
### Test Plan
116130

117131
We will extend both the unit test suite and the E2E test suite to cover the new policy option described in this KEP.
118132

133+
[x] I/we understand the owners of the involved components may require updates to
134+
existing tests to make this code solid enough prior to committing the changes necessary
135+
to implement this enhancement.
136+
137+
##### Prerequisite testing updates
138+
139+
##### Unit tests
140+
141+
- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager`: `20250205` - 85.5% of statements
142+
143+
##### Integration tests
144+
145+
Not Applicable as Kubelet features don't have integration tests. We use a mix of `e2e_node` and `e2e` tests.
146+
147+
##### e2e tests
148+
149+
Currently no e2e tests are present for this particular policy option. E2E tests will be added as part of Beta graduation.
150+
151+
The plan is to add e2e tests to cover the basic flows for cases below:
152+
1. `distribute-cpus-across-numa` option is enabled: The test will ensure that the allocated CPUs are distributed across NUMA nodes according to the policy.
153+
1. `distribute-cpus-across-numa` option is disabled: The test will verify that the allocated CPUs are packed according to the default behavior.
154+
1. Test how this option interacts with `full-pcpus-only` policy option (and test for it enabled and disabled).
155+
119156
### Graduation Criteria
120157

121158
#### Alpha
@@ -149,7 +186,9 @@ No changes needed
149186
###### How can this feature be enabled / disabled in a live cluster?
150187

151188
- [X] Feature gate (also fill in values in `kep.yaml`)
189+
- Feature gate name: `CPUManagerPolicyOptions`
152190
- Feature gate name: `CPUManagerPolicyAlphaOptions`
191+
- Feature gate name: `CPUManagerPolicyBetaOptions`
153192
- Components depending on the feature gate: `kubelet`
154193
- [X] Change the kubelet configuration to set a `CPUManager` policy of `static` and a `CPUManager` policy option of `distribute-cpus-across-numa`
155194
- Will enabling / disabling the feature require downtime of the control
@@ -161,14 +200,14 @@ No changes needed
161200
###### Does enabling the feature change any default behavior?
162201

163202
No. In order to trigger any of the new logic, three things have to be true:
164-
1. The `CPUManagerPolicyAlphaOptions` feature gate must be enabled
203+
1. The `CPUManagerPolicyBetaOptions` feature gate must be enabled
165204
1. The `static` `CPUManager` policy must be selected
166205
1. The new `distribute-cpus-across-numa` policy option must be selected
167206

168207
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
169208

170209
Yes, the feature can be disabled by either:
171-
1. Disabling the `CPUManagerPolicyAlphaOptions` feature gate
210+
1. Disabling the `CPUManagerPolicyBetaOptions` feature gate
172211
1. Switching the `CPUManager` policy to `none`
173212
1. Removing `distribute-cpus-across-numa` from the list of `CPUManager` policy options
174213

@@ -182,12 +221,34 @@ No changes. Existing container will not see their allocation changed. New contai
182221

183222
- A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)
184223

224+
### Rollout, Upgrade and Rollback Planning
225+
226+
###### How can a rollout or rollback fail? Can it impact already running workloads?
227+
228+
- A rollout or rollback can fail if the feature gate and the policy option are not configured properly and kubelet fails to start.
229+
230+
###### What specific metrics should inform a rollback?
231+
232+
As part of graduation of this feature, we plan to add metric `cpu_manager_numa_allocation_spread` to see how the CPUs are distributed across NUMA nodes.
233+
This can be used to see the CPU distribution across NUMA and will provide an indication of a rollback.
234+
235+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
236+
237+
Not Applicable. This policy option only affects pods that meet certain conditions and are scheduled after the upgrade. Running pods will be unaffected
238+
by any change.
239+
240+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
241+
242+
No
243+
185244
### Monitoring Requirements
186245

187246
###### How can an operator determine if the feature is in use by workloads?
188247

189248
Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.
190249

250+
In addition to that, we can check the metric `cpu_manager_numa_allocation_spread` to determine how allocated CPUs are spread across NUMA node.
251+
191252
###### How can someone using this feature know that it is working for their instance?
192253

193254
In order to verify this feature is working, one should:
@@ -201,6 +262,31 @@ To verify the list of CPUs allocated to the container, one can either:
201262
- `exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container).
202263
- Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider).
203264

265+
Also, we can check `cpu_manager_numa_allocation_spread` metric. We plan to add metric to track how CPUs are distributed across NUMA zones
266+
with labels/buckets representing NUMA nodes (numa_node=0, numa_node=1, ..., numa_node=N).
267+
268+
With packed allocation (default, option off), the distribution should mostly be in numa_node=1, with a small tail to numa_node=2 (and possibly higher)
269+
in cases of severe fragmentation. Users can compare this spread metric with the `container_aligned_compute_resources_count` metric to determine
270+
if they are getting aligned packed allocation or just packed allocation due to implementation details.
271+
272+
For example, if a node has 2 NUMA nodes and a pod requests 8 CPUs (with no other pods requesting exclusive CPUs on the node), the metric would look like this:
273+
274+
cpu_manager_numa_allocation_spread{numa_node="0"} = 8
275+
cpu_manager_numa_allocation_spread{numa_node="1"} = 0
276+
277+
278+
When the option is enabled, we would expect a more even distribution of CPUs across NUMA nodes, with no sharp peaks as seen with packed allocation.
279+
Users can also check the `container_aligned_compute_resources_count` metric to assess resource alignment and system behavior.
280+
281+
In this case, the metric would show:
282+
cpu_manager_numa_allocation_spread{numa_node="0"} = 4
283+
cpu_manager_numa_allocation_spread{numa_node="1"} = 4
284+
285+
286+
Note: This example is simplified to clearly highlight the difference between the two cases. Existing pods may slightly skew the counts, but the general
287+
trend of peaks and troughs will still provide a good indication of CPU distribution across NUMA nodes, allowing users to determine if the policy option
288+
is enabled or not.
289+
204290
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
205291

206292
There are no specific SLOs for this feature.
@@ -212,13 +298,20 @@ None
212298

213299
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
214300

215-
None
301+
Yes, as part of graduation of this feature to Beta, we plan to add `cpu_manager_numa_allocation_spread` metric
302+
to provide data on how the CPUs are distributed across NUMA nodes.
216303

217304
###### Does this feature depend on any specific services running in the cluster?
218305

219306
This feature is `linux` specific, and requires a version of CRI that includes the `LinuxContainerResources.CpusetCpus` field.
220307
This has been available since `v1alpha2`.
221308

309+
### Dependencies
310+
311+
###### Does this feature depend on any specific services running in the cluster?
312+
313+
No
314+
222315
### Scalability
223316

224317
###### Will enabling / using this feature result in any new API calls?
@@ -249,9 +342,26 @@ This delay should be minimal.
249342

250343
No, the algorithm will run on a single `goroutine` with minimal memory requirements.
251344

345+
### Troubleshooting
346+
347+
###### How does this feature react if the API server and/or etcd is unavailable?
348+
349+
No impact. The behavior of the feature does not change when API Server and/or etcd is unavailable since the feature is node local.
350+
351+
###### What are other known failure modes?
352+
353+
Because of existing distribution of CPU resource across, a distributed allocation might not be possible. E.g. If all Available CPUs are present
354+
on the same NUMA node.
355+
356+
In that case we resort back to a best-effort fit of packing CPUs into NUMA nodes wherever they can fit.
357+
358+
###### What steps should be taken if SLOs are not being met to determine the problem?
359+
252360
## Implementation History
253361

254362
- 2021-08-26: Initial KEP created
255363
- 2021-08-30: Updates to fill out more sections, answer PRR questions
256364
- 2021-09-08: Change feature gate from `CPUManagerPolicyOptions` to `CPUManagerPolicyExperimentalOptions`
257365
- 2021-10-11: Change feature gate from `CPUManagerPolicyExperimentalOptions` to `CPUManagerPolicyAlphaOptions`
366+
- 2025-01-30: KEP update for Beta graduation of the policy option
367+
- 2025-02-05: KEP update to the latest template

keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/kep.yaml

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,39 +2,49 @@ title: CPUManager Policy Option to Distribute CPUs Across NUMA Nodes Instead of
22
kep-number: 2902
33
authors:
44
- "@klueska"
5+
- "@swatisehgal" # For Beta graduation
56
owning-sig: sig-node
67
participating-sigs: []
78
status: implementable
89
creation-date: "2021-08-26"
10+
last-updated: "2025-01-31"
911
reviewers:
10-
- "@fromani"
12+
- "@ffromani"
1113
approvers:
1214
- "@sig-node-tech-leads"
1315
see-also:
1416
- "keps/sig-node/2625-cpumanager-policies-thread-placement"
17+
- "keps/sig-node/3545-improved-multi-numa-alignment/"
18+
- "keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/"
19+
- "keps/sig-node/4540-strict-cpu-reservation"
20+
- "keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/"
21+
- "keps/sig-node/4800-cpumanager-split-uncorecache/"
1522
replaces: []
1623

1724
# The target maturity stage in the current dev cycle for this KEP.
18-
stage: alpha
25+
stage: beta
1926

2027
# The most recent milestone for which work toward delivery of this KEP has been
2128
# done. This can be the current (upcoming) milestone, if it is being actively
2229
# worked on.
23-
latest-milestone: "v1.23"
30+
latest-milestone: "v1.33"
2431

2532
# The milestone at which this feature was, or is targeted to be, at each stage.
2633
milestone:
2734
alpha: "v1.23"
28-
beta: "v1.24"
29-
stable: "v1.25"
35+
beta: "v1.33"
36+
stable: "v1.35"
3037

3138
# The following PRR answers are required at alpha release
3239
# List the feature gate name and the components for which it must be enabled
3340
feature-gates:
41+
- name: "CPUManagerPolicyOptions"
3442
- name: "CPUManagerPolicyAlphaOptions"
43+
- name: "CPUManagerPolicyBetaOptions"
3544
components:
3645
- kubelet
3746
disable-supported: true
3847

3948
# The following PRR answers are required at beta release
40-
metrics: []
49+
metrics:
50+
- cpu_manager_numa_allocation_spread

0 commit comments

Comments
 (0)