Skip to content

Commit effb5cb

Browse files
committed
respond to feedback
1 parent 388cb73 commit effb5cb

File tree

1 file changed

+17
-18
lines changed
  • keps/sig-node/4800-cpumanager-split-uncorecache

1 file changed

+17
-18
lines changed

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ The `prefer-align-cpus-by-uncorecache` feature will be enabled and tested indivi
262262
- `full-pcpus-only`
263263
- Topology Manager NUMA Affinity
264264

265-
The following CPU Topologies are representative of various uncore cache architectures and will be added to policy_test.go and represented in the unit testing.
265+
The following CPU Topologies are representative of various uncore cache architectures and will be added to [policy_test.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_test.go) and represented in the unit testing.
266266

267267
- 1P AMD EPYC 7702P 64C (smt-on/off) NPS=1, 16 uncore cache instances/socket
268268
- 2P AMD EPYC 7303 32C (smt-on/off) NPS=1, 4 uncore cache instances/socket
@@ -279,27 +279,25 @@ N/A. This feature requires a e2e test for testing.
279279

280280
##### e2e tests
281281

282-
- For e2e testing, checks will be added to determine if the node has a split uncore cache topology. If node does not meet the requirement to have multiple uncore caches, the added tests will be skipped.
283-
- e2e testing should cover the deployment of a pod that is following uncore cache alignment. CPU assignment can be determined by podresources API and programatically cross-referenced to syfs topology information to determine proper uncore cache alignment.
284-
- For e2e testing, guaranteed pods will be deployed with various CPU size requirements on our own baremetal instances across different vendor architectures and confirming the CPU assignments to uncore cache core groupings. This feature is intended for baremetal only and not cloud instances.
285-
- Update CI to test GCP instances of different architectures utilizing uncore cache alignment feature.
286-
282+
- [should update alignment counters when pod successfully run taking less than uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
283+
- [should update alignment counters when pod successfully run taking a full uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
284+
- [should not update alignment counters when pod successfully run taking more than a uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
287285

288286
### Graduation Criteria
289287

290288
#### Alpha
291289

292290
- Feature implemented behind a feature gate flag option
293-
- Test cases created for feature
291+
- Add unit test coverage
292+
- Added metrics to cover observability needs
293+
- Added e2e tests for metrics
294294

295295
#### Beta
296296

297297
- Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment
298-
- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order
299298
- Add test cases to ensure functional compatibility with existing CPUManager options
300299
- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache
301-
- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature
302-
- Add metric for uncore cache alignment and incorporate to E2E tests
300+
- Add E2E test coverage for feature
303301

304302
### Upgrade / Downgrade Strategy
305303

@@ -370,7 +368,7 @@ Feature will be enabled. Proper drain of node and restart of kubelet required. F
370368

371369
E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled.
372370
Metric created to check uncore cache alignment after cpuset is determined and utilized in E2E tests with feature enabled.
373-
See PR#130133 (https://github.com/kubernetes/kubernetes/pull/130133)
371+
See [cpu_manager_metrics_test.go](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go)
374372

375373
### Rollout, Upgrade and Rollback Planning
376374

@@ -380,13 +378,13 @@ This section must be completed when targeting beta to a release.
380378

381379
###### How can a rollout or rollback fail? Can it impact already running workloads?
382380

383-
Rollout/rollback should not fail since the feature is hidden behind feature gates and will not be enabled by default.
384-
Enabling the feature will require the Kubelet to restart, introducing potential for kubelet to fail to start or crash, which can affect existing workloads.
385-
In response, drain the node and restart the kubelet.
381+
This feature is a best-effort alignment of CPUs to uncore caches that requires a kubelet restart that must not affect running workloads. No changes needed to cpu_manager_state file.
382+
A rollout may fail based upon existing workloads that create fragmented uncore caches on the node, potentially resulting in CPUset distribution across multiple caches based upon the CPU quantity requirements and the best-effort policy.
383+
Metrics below can help the user track alignment, but a rollback will not help because the feature is not a strict alignment to uncore caches, but a best-effort to reduce shared uncore caches.
386384

387385
###### What specific metrics should inform a rollback?
388386

389-
`AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary.
387+
`kubelet_container_aligned_compute_resources_count` and `container_aligned_compute_resources_failure_count` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary.
390388

391389
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
392390

@@ -405,7 +403,7 @@ Reference CPUID info in podresources API to be able to verify assignment.
405403
###### How can an operator determine if the feature is in use by workloads?
406404

407405
Reference podresources API to determine CPU assignment and CacheID assignment per container.
408-
Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155).
406+
Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See [kubelet/metrics/metrics.go](https://github.com/kubernetes/kubernetes/blob/8f1f17a04f62ab64ebe4f0b9d7f5f799bf56a0d9/pkg/kubelet/metrics/metrics.go#L135).
409407

410408
###### How can someone using this feature know that it is working for their instance?
411409

@@ -417,12 +415,13 @@ Reference podresources API to determine CPU assignment.
417415

418416
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
419417

420-
CPUset allocation should be on the fewest amount of uncore caches as possible on the node.
418+
In default Kubernetes installation, 99th percentile per cluster-day <= X
419+
This feature is best-effort and will not cause failed admission, but can introduce admission delay.
421420

422421
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
423422

424423
- Metrics
425-
- `container_aligned_compute_resource_count` can be used to determine Uncore Cache alignment
424+
- `topology_manager_admission_duration_ms` can be used to determine pod admission time
426425

427426
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
428427

0 commit comments

Comments
 (0)