You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/4800-cpumanager-split-uncorecache/README.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -290,15 +290,16 @@ N/A. This feature requires a e2e test for testing.
290
290
#### Alpha
291
291
292
292
- Feature implemented behind a feature gate flag option
293
-
- E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta.
294
-
- Providing a metric to verify uncore cache alignment will be required to graduate to beta.
293
+
- Test cases created for feature
295
294
296
295
#### Beta
297
296
298
-
- Address bug fixes and missing features: ability to schedule odd-integer CPUs for uncore cache alignment
299
-
- Add tests to ensure functional compatibility with existing CPUManager options
300
-
- Add tests to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache
297
+
- Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment
298
+
- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order
299
+
- Add test cases to ensure functional compatibility with existing CPUManager options
300
+
- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache
301
301
- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature
302
+
- Add metric for uncore cache alignment and incorporate to E2E tests
302
303
303
304
### Upgrade / Downgrade Strategy
304
305
@@ -338,7 +339,6 @@ you need any help or guidance.
338
339
339
340
To enable this feature requires enabling the feature gates for static policy in the Kubelet configuration file for the CPUManager feature gate and add the policy option for uncore cache alignment
340
341
341
-
342
342
###### How can this feature be enabled / disabled in a live cluster?
343
343
344
344
For `CPUManager` it is a requirement going from `none` to `static` policy cannot be done dynamically because of the `cpu_manager_state file`. The node needs to be drained and the policy checkpoint file (`cpu_manager_state`) need to be removed before restarting Kubelet. This feature specifically relies on the `static` policy being enabled.
@@ -368,10 +368,9 @@ Feature will be enabled. Proper drain of node and restart of kubelet required. F
368
368
369
369
###### Are there any tests for feature enablement/disablement?
370
370
371
-
Option is not enabled dynamically. To enable/disable option, cpu_manager_state must be removed and kubelet must be restarted.
372
-
Unit tests will be implemented to test if the feature is enabled/disabled.
373
-
E2e node serial suite can be use to test the enablement/disablement of the feature since it allows the kubelet to be restarted.
374
-
371
+
E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled.
372
+
Metric created to check uncore cache alignment after cpuset is determined and utilized in E2E tests with feature enabled.
373
+
See PR#130133 (https://github.com/kubernetes/kubernetes/pull/130133)
375
374
376
375
### Rollout, Upgrade and Rollback Planning
377
376
@@ -381,12 +380,13 @@ This section must be completed when targeting beta to a release.
381
380
382
381
###### How can a rollout or rollback fail? Can it impact already running workloads?
383
382
384
-
Kubelet restarts are not expected to impact existing CPU assignments to already running workloads
385
-
383
+
Rollout/rollback should not fail since the feature is hidden behind feature gates and will not be enabled by default.
384
+
Enabling the feature will require the Kubelet to restart, introducing potential for kubelet to fail to start or crash, which can affect existing workloads.
385
+
In response, drain the node and restart the kubelet.
386
386
387
387
###### What specific metrics should inform a rollback?
388
388
389
-
Increased pod startup time/latency
389
+
`AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary.
390
390
391
391
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
392
392
@@ -405,7 +405,7 @@ Reference CPUID info in podresources API to be able to verify assignment.
405
405
###### How can an operator determine if the feature is in use by workloads?
406
406
407
407
Reference podresources API to determine CPU assignment and CacheID assignment per container.
408
-
Use proposed 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155).
408
+
Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155).
409
409
410
410
###### How can someone using this feature know that it is working for their instance?
411
411
@@ -417,16 +417,16 @@ Reference podresources API to determine CPU assignment.
417
417
418
418
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
419
419
420
-
Measure the time to deploy pods under default settings and compare to the time to deploy pods with align-by-uncorecache enabled. Time difference should be negligible.
420
+
CPUset allocation should be on the fewest amount of uncore caches as possible on the node.
421
421
422
422
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
423
423
424
424
- Metrics
425
-
-`topology_manager_admission_duration_ms`: Which measures the the duration of the admission process performed by Topology Manager.
425
+
-`container_aligned_compute_resource_count` can be used to determine Uncore Cache alignment
426
426
427
427
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
428
428
429
-
Utilized proposed 'container_aligned_compute_resources_count' in PR#127155 to be extended for uncore cache alignment count.
429
+
No.
430
430
431
431
<!--
432
432
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
0 commit comments