You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/4800-cpumanager-split-uncorecache/README.md
+17-18Lines changed: 17 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -262,7 +262,7 @@ The `prefer-align-cpus-by-uncorecache` feature will be enabled and tested indivi
262
262
-`full-pcpus-only`
263
263
- Topology Manager NUMA Affinity
264
264
265
-
The following CPU Topologies are representative of various uncore cache architectures and will be added to policy_test.go and represented in the unit testing.
265
+
The following CPU Topologies are representative of various uncore cache architectures and will be added to [policy_test.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_test.go) and represented in the unit testing.
@@ -279,27 +279,25 @@ N/A. This feature requires a e2e test for testing.
279
279
280
280
##### e2e tests
281
281
282
-
- For e2e testing, checks will be added to determine if the node has a split uncore cache topology. If node does not meet the requirement to have multiple uncore caches, the added tests will be skipped.
283
-
- e2e testing should cover the deployment of a pod that is following uncore cache alignment. CPU assignment can be determined by podresources API and programatically cross-referenced to syfs topology information to determine proper uncore cache alignment.
284
-
- For e2e testing, guaranteed pods will be deployed with various CPU size requirements on our own baremetal instances across different vendor architectures and confirming the CPU assignments to uncore cache core groupings. This feature is intended for baremetal only and not cloud instances.
285
-
- Update CI to test GCP instances of different architectures utilizing uncore cache alignment feature.
286
-
282
+
-[should update alignment counters when pod successfully run taking less than uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
283
+
-[should update alignment counters when pod successfully run taking a full uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
284
+
-[should not update alignment counters when pod successfully run taking more than a uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
287
285
288
286
### Graduation Criteria
289
287
290
288
#### Alpha
291
289
292
290
- Feature implemented behind a feature gate flag option
293
-
- Test cases created for feature
291
+
- Add unit test coverage
292
+
- Added metrics to cover observability needs
293
+
- Added e2e tests for metrics
294
294
295
295
#### Beta
296
296
297
297
- Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment
298
-
- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order
299
298
- Add test cases to ensure functional compatibility with existing CPUManager options
300
299
- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache
301
-
- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature
302
-
- Add metric for uncore cache alignment and incorporate to E2E tests
300
+
- Add E2E test coverage for feature
303
301
304
302
### Upgrade / Downgrade Strategy
305
303
@@ -370,7 +368,7 @@ Feature will be enabled. Proper drain of node and restart of kubelet required. F
370
368
371
369
E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled.
372
370
Metric created to check uncore cache alignment after cpuset is determined and utilized in E2E tests with feature enabled.
373
-
See PR#130133(https://github.com/kubernetes/kubernetes/pull/130133)
371
+
See [cpu_manager_metrics_test.go](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go)
374
372
375
373
### Rollout, Upgrade and Rollback Planning
376
374
@@ -380,13 +378,13 @@ This section must be completed when targeting beta to a release.
380
378
381
379
###### How can a rollout or rollback fail? Can it impact already running workloads?
382
380
383
-
Rollout/rollback should not fail since the feature is hidden behind feature gates and will not be enabled by default.
384
-
Enabling the feature will require the Kubelet to restart, introducing potential for kubelet to fail to start or crash, which can affect existing workloads.
385
-
In response, drain the node and restart the kubelet.
381
+
This feature is a best-effort alignment of CPUs to uncore caches that requires a kubelet restart that must not affect running workloads. No changes needed to cpu_manager_state file.
382
+
A rollout may fail based upon existing workloads that create fragmented uncore caches on the node, potentially resulting in CPUset distribution across multiple caches based upon the CPU quantity requirements and the best-effort policy.
383
+
Metrics below can help the user track alignment, but a rollback will not help because the feature is not a strict alignment to uncore caches, but a best-effort to reduce shared uncore caches.
386
384
387
385
###### What specific metrics should inform a rollback?
388
386
389
-
`AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary.
387
+
`kubelet_container_aligned_compute_resources_count` and `container_aligned_compute_resources_failure_count` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary.
390
388
391
389
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
392
390
@@ -405,7 +403,7 @@ Reference CPUID info in podresources API to be able to verify assignment.
405
403
###### How can an operator determine if the feature is in use by workloads?
406
404
407
405
Reference podresources API to determine CPU assignment and CacheID assignment per container.
408
-
Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155(https://github.com/kubernetes/kubernetes/pull/127155).
406
+
Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See [kubelet/metrics/metrics.go](https://github.com/kubernetes/kubernetes/blob/8f1f17a04f62ab64ebe4f0b9d7f5f799bf56a0d9/pkg/kubelet/metrics/metrics.go#L135).
409
407
410
408
###### How can someone using this feature know that it is working for their instance?
411
409
@@ -417,12 +415,13 @@ Reference podresources API to determine CPU assignment.
417
415
418
416
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
419
417
420
-
CPUset allocation should be on the fewest amount of uncore caches as possible on the node.
418
+
In default Kubernetes installation, 99th percentile per cluster-day <= X
419
+
This feature is best-effort and will not cause failed admission, but can introduce admission delay.
421
420
422
421
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
423
422
424
423
- Metrics
425
-
-`container_aligned_compute_resource_count` can be used to determine Uncore Cache alignment
424
+
-`topology_manager_admission_duration_ms` can be used to determine pod admission time
426
425
427
426
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
0 commit comments