|
164 | 164 |
|
165 | 165 | ### Risks and Mitigations
|
166 | 166 |
|
167 |
| -TBD |
| 167 | +Bugs in cpumanager can cause the kubelet to crash, or workloads to start with incorrect pinning. |
| 168 | +This can be mitigated with comprehensive testing and improving the observability of the system |
| 169 | +(see metrics). |
| 170 | + |
| 171 | +While the cpumanager core policy has seen no changes except for bugfixes since a while, |
| 172 | +we introduced the [cpumanager options policy framework](https://github.com/fromanirh/enhancements/blob/master/keps/sig-node/2625-cpumanager-policies-thread-placement/README.md) |
| 173 | +to enable the fine tuning of the static policy. |
| 174 | +This area is more active, so bugs introduced with policy options can cause the kubelet to crash. |
| 175 | +To mitigate this risk, we can make sure each policy option can be disabled independently, and |
| 176 | +is not coupled with others, avoiding cascading failures or unnecessary coupling. |
| 177 | +Graduation and testing criteria are deferred to the KEPs tracking the implementation of these features. |
168 | 178 |
|
169 | 179 | ## Design Details
|
170 | 180 |
|
@@ -530,7 +540,11 @@ Already running workload will not be affected if the node state is steady
|
530 | 540 |
|
531 | 541 | ###### What specific metrics should inform a rollback?
|
532 | 542 |
|
533 |
| -Pod creation errors on a node-by-node basis. |
| 543 | +"cpu_manager_pinning_errors_total". It must be noted that even in fully healthy system there are known benign condition |
| 544 | +that can cause CPU allocation failures. Few selected examples are: |
| 545 | + |
| 546 | +- requesting odd numbered cores (not a full physical core) when the cpumanager is configured with the `full-pcpus-only` option |
| 547 | +- requesting NUMA-aligned cores, with Topology Manager enabled. |
534 | 548 |
|
535 | 549 | ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
|
536 | 550 |
|
|
545 | 559 |
|
546 | 560 | ### Monitoring Requirements
|
547 | 561 |
|
548 |
| -Monitor the pod admission counter |
549 |
| -Monitor the pods not going running after successful schedule |
| 562 | +Monitor the metrics |
| 563 | +- "cpu_manager_pinning_requests_total" |
| 564 | +- "cpu_manager_pinning_errors_total" |
550 | 565 |
|
551 | 566 | ###### How can an operator determine if the feature is in use by workloads?
|
552 | 567 |
|
553 |
| -The operator need to inspect the node and verify the cpu pinning assignment either checking the cgroups on the node |
554 |
| -or accessing the podresources API of the kubelet. |
| 568 | +In order for pods to request exclusive CPUs allocation and pinning, they need to match |
| 569 | +all the following criteria: |
| 570 | +- the pod QoS must be "guaranteed" |
| 571 | +- the resources request of CPU (`pod.spec.containers[].resources.limits.cpu`) must be integral. |
555 | 572 |
|
556 |
| -###### How can someone using this feature know that it is working for their instance? |
| 573 | +On top of that, at kubelet level |
| 574 | +- the cpumanager policy must be `static`. |
557 | 575 |
|
| 576 | +If all the criteria are met, then the feature is in use by workloads. |
| 577 | + |
| 578 | +###### How can someone using this feature know that it is working for their instance? |
558 | 579 |
|
559 | 580 | - [X] Other (treat as last resort)
|
560 |
| - - Details: the containers need to check the cpu set they are allowed to run; in addition, node agents (e.g. node_exporter) |
561 |
| - can report the CPU assignment |
| 581 | + - Details: check the kubelet metric `cpu_manager_pinning_requests_total` |
562 | 582 |
|
563 | 583 | ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
|
564 | 584 |
|
565 |
| -- N/A |
| 585 | +"cpu_manager_pinning_requests_total" and "cpu_manager_pinning_errors_total" |
| 586 | +We need to find a careful balance here because we don't want to leak hardware details, or in general informations |
| 587 | +dependent on the worker node hardware configuration (example, even if arguable extreme, is the processor core layout). |
| 588 | + |
| 589 | +It is possible to infer which pod would trigger a CPU pinning from the |
| 590 | +[pod resources request](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy) |
| 591 | +but adding these two metrics is both very cheap and helping for the observability of the system. |
566 | 592 |
|
567 | 593 | ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
|
568 | 594 |
|
569 |
| -- [ ] Other (treat as last resort) |
570 |
| - - Details: |
571 |
| - a operator should check that pods go running correctly and the cpu pinning is performed. The latter can |
572 |
| - be checked by inspecting the cgroups at node level. |
| 595 | +- [X] Metrics |
| 596 | + - Metric name: |
| 597 | + - cpu_manager_pinning_requests_total |
| 598 | + - cpu_manager_pinning_errors_total |
573 | 599 |
|
574 | 600 | ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
|
575 | 601 |
|
576 |
| -No, because all the metrics we were aware of leaked hardware details. |
577 |
| -All of the metrics experimented by consumers of the feature so far require to expose hardware details of the |
578 |
| -worker nodes, and are dependent on the worker node hardware configuration (e.g. processor core layout). |
| 602 | +- "cpu_manager_pinning_requests_total" |
| 603 | +- "cpu_manager_pinning_errors_total" |
| 604 | + |
| 605 | +The addition of these metrics will be done before moving to GA |
| 606 | +([issue](https://github.com/kubernetes/kubernetes/issues/112854), |
| 607 | + [PR](https://github.com/kubernetes/kubernetes/pull/112855)). |
| 608 | + |
579 | 609 |
|
580 | 610 | ### Dependencies
|
581 | 611 |
|
|
0 commit comments