You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -75,7 +83,7 @@ When enabled, this will trigger the `CPUManager` to evenly distribute CPUs acros
75
83
### Risks and Mitigations
76
84
77
85
The risks of adding this new feature are quite low.
78
-
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).
86
+
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyBetaOptions` feature gate (which is disabled by default).
@@ -112,10 +120,39 @@ If none of the above conditions can be met, resort back to a best-effort fit of
112
120
113
121
NOTE: The striping operation after all CPUs have been evenly distributed will be performed such that the overall disribution of CPUs across those NUMA nodes remains as balanced as possible.
114
122
123
+
### Compatibility with `full-pcpus-only` policy options
124
+
125
+
| Compatibility | alpha | beta | GA |
126
+
| --- | --- | --- | --- |
127
+
| full-pcpus-only | x | x | x |
128
+
115
129
### Test Plan
116
130
117
131
We will extend both the unit test suite and the E2E test suite to cover the new policy option described in this KEP.
118
132
133
+
[x] I/we understand the owners of the involved components may require updates to
134
+
existing tests to make this code solid enough prior to committing the changes necessary
135
+
to implement this enhancement.
136
+
137
+
##### Prerequisite testing updates
138
+
139
+
##### Unit tests
140
+
141
+
-`k8s.io/kubernetes/pkg/kubelet/cm/cpumanager`: `20250205` - 85.5% of statements
142
+
143
+
##### Integration tests
144
+
145
+
Not Applicable as Kubelet features don't have integration tests. We use a mix of `e2e_node` and `e2e` tests.
146
+
147
+
##### e2e tests
148
+
149
+
Currently no e2e tests are present for this particular policy option. E2E tests will be added as part of Beta graduation.
150
+
151
+
The plan is to add e2e tests to cover the basic flows for cases below:
152
+
1.`distribute-cpus-across-numa` option is enabled: The test will ensure that the allocated CPUs are distributed across NUMA nodes according to the policy.
153
+
1.`distribute-cpus-across-numa` option is disabled: The test will verify that the allocated CPUs are packed according to the default behavior.
154
+
1. Test how this option interacts with `full-pcpus-only` policy option (and test for it enabled and disabled).
155
+
119
156
### Graduation Criteria
120
157
121
158
#### Alpha
@@ -149,7 +186,9 @@ No changes needed
149
186
###### How can this feature be enabled / disabled in a live cluster?
150
187
151
188
-[X] Feature gate (also fill in values in `kep.yaml`)
- Components depending on the feature gate: `kubelet`
154
193
-[X] Change the kubelet configuration to set a `CPUManager` policy of `static` and a `CPUManager` policy option of `distribute-cpus-across-numa`
155
194
- Will enabling / disabling the feature require downtime of the control
@@ -161,14 +200,14 @@ No changes needed
161
200
###### Does enabling the feature change any default behavior?
162
201
163
202
No. In order to trigger any of the new logic, three things have to be true:
164
-
1. The `CPUManagerPolicyAlphaOptions` feature gate must be enabled
203
+
1. The `CPUManagerPolicyBetaOptions` feature gate must be enabled
165
204
1. The `static``CPUManager` policy must be selected
166
205
1. The new `distribute-cpus-across-numa` policy option must be selected
167
206
168
207
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
169
208
170
209
Yes, the feature can be disabled by either:
171
-
1. Disabling the `CPUManagerPolicyAlphaOptions` feature gate
210
+
1. Disabling the `CPUManagerPolicyBetaOptions` feature gate
172
211
1. Switching the `CPUManager` policy to `none`
173
212
1. Removing `distribute-cpus-across-numa` from the list of `CPUManager` policy options
174
213
@@ -182,12 +221,34 @@ No changes. Existing container will not see their allocation changed. New contai
182
221
183
222
- A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)
184
223
224
+
### Rollout, Upgrade and Rollback Planning
225
+
226
+
###### How can a rollout or rollback fail? Can it impact already running workloads?
227
+
228
+
- A rollout or rollback can fail if the feature gate and the policy option are not configured properly and kubelet fails to start.
229
+
230
+
###### What specific metrics should inform a rollback?
231
+
232
+
As part of graduation of this feature, we plan to add metric `cpu_manager_numa_allocation_spread` to see how the CPUs are distributed across NUMA nodes.
233
+
This can be used to see the CPU distribution across NUMA and will provide an indication of a rollback.
234
+
235
+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
236
+
237
+
Not Applicable. This policy option only affects pods that meet certain conditions and are scheduled after the upgrade. Running pods will be unaffected
238
+
by any change.
239
+
240
+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
241
+
242
+
No
243
+
185
244
### Monitoring Requirements
186
245
187
246
###### How can an operator determine if the feature is in use by workloads?
188
247
189
248
Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.
190
249
250
+
In addition to that, we can check the metric `cpu_manager_numa_allocation_spread` to determine how allocated CPUs are spread across NUMA node.
251
+
191
252
###### How can someone using this feature know that it is working for their instance?
192
253
193
254
In order to verify this feature is working, one should:
@@ -201,6 +262,31 @@ To verify the list of CPUs allocated to the container, one can either:
201
262
-`exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container).
202
263
- Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider).
203
264
265
+
Also, we can check `cpu_manager_numa_allocation_spread` metric. We plan to add metric to track how CPUs are distributed across NUMA zones
266
+
with labels/buckets representing NUMA nodes (numa_node=0, numa_node=1, ..., numa_node=N).
267
+
268
+
With packed allocation (default, option off), the distribution should mostly be in numa_node=1, with a small tail to numa_node=2 (and possibly higher)
269
+
in cases of severe fragmentation. Users can compare this spread metric with the `container_aligned_compute_resources_count` metric to determine
270
+
if they are getting aligned packed allocation or just packed allocation due to implementation details.
271
+
272
+
For example, if a node has 2 NUMA nodes and a pod requests 8 CPUs (with no other pods requesting exclusive CPUs on the node), the metric would look like this:
Note: This example is simplified to clearly highlight the difference between the two cases. Existing pods may slightly skew the counts, but the general
287
+
trend of peaks and troughs will still provide a good indication of CPU distribution across NUMA nodes, allowing users to determine if the policy option
288
+
is enabled or not.
289
+
204
290
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
205
291
206
292
There are no specific SLOs for this feature.
@@ -212,13 +298,20 @@ None
212
298
213
299
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
214
300
215
-
None
301
+
Yes, as part of graduation of this feature to Beta, we plan to add `cpu_manager_numa_allocation_spread` metric
302
+
to provide data on how the CPUs are distributed across NUMA nodes.
216
303
217
304
###### Does this feature depend on any specific services running in the cluster?
218
305
219
306
This feature is `linux` specific, and requires a version of CRI that includes the `LinuxContainerResources.CpusetCpus` field.
220
307
This has been available since `v1alpha2`.
221
308
309
+
### Dependencies
310
+
311
+
###### Does this feature depend on any specific services running in the cluster?
312
+
313
+
No
314
+
222
315
### Scalability
223
316
224
317
###### Will enabling / using this feature result in any new API calls?
@@ -249,9 +342,26 @@ This delay should be minimal.
249
342
250
343
No, the algorithm will run on a single `goroutine` with minimal memory requirements.
251
344
345
+
### Troubleshooting
346
+
347
+
###### How does this feature react if the API server and/or etcd is unavailable?
348
+
349
+
No impact. The behavior of the feature does not change when API Server and/or etcd is unavailable since the feature is node local.
350
+
351
+
###### What are other known failure modes?
352
+
353
+
Because of existing distribution of CPU resource across, a distributed allocation might not be possible. E.g. If all Available CPUs are present
354
+
on the same NUMA node.
355
+
356
+
In that case we resort back to a best-effort fit of packing CPUs into NUMA nodes wherever they can fit.
357
+
358
+
###### What steps should be taken if SLOs are not being met to determine the problem?
359
+
252
360
## Implementation History
253
361
254
362
- 2021-08-26: Initial KEP created
255
363
- 2021-08-30: Updates to fill out more sections, answer PRR questions
256
364
- 2021-09-08: Change feature gate from `CPUManagerPolicyOptions` to `CPUManagerPolicyExperimentalOptions`
257
365
- 2021-10-11: Change feature gate from `CPUManagerPolicyExperimentalOptions` to `CPUManagerPolicyAlphaOptions`
366
+
- 2025-01-30: KEP update for Beta graduation of the policy option
0 commit comments