|
| 1 | +# Deploying the NodeResourceTopology plugins |
| 2 | + |
| 3 | +This document contains notes and outrline the challenges related to run the NodeResourceTopology plugin. |
| 4 | +The intent is to share awareness about the challenges and how to overcome them. |
| 5 | +**The content here applies to plugins running with the overreserve cache enabled.** |
| 6 | + |
| 7 | +Last updated: December 2023 - last released version: 0.27.8 |
| 8 | + |
| 9 | +## As a secondary scheduler |
| 10 | + |
| 11 | +Running the NodeResourceTopology plugin as part of the secondary scheduler is currently the recommended approach. |
| 12 | +This way only the workload which wants stricter NUMA alignment can opt-in, while any other workload, including all |
| 13 | +the infrastructure pods, will keep running as usual using the battle-tested main kubernetes scheduler. |
| 14 | + |
| 15 | +### Interference with the main scheduler |
| 16 | + |
| 17 | +For reasons explained in detail in the |
| 18 | +[cache design docs](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/cache/docs/NUMA-aware-scheduler-side-reserve-plugin.md#the-reconciliation-condition-a-node-state-definition) |
| 19 | +the NRT plugin needs to constantly compute the expected nodes state in terms of which pods are running there. |
| 20 | +Is not possible (nor actually desirable) to completely isolate worker nodes while the NRT code runs in the secondary |
| 21 | +scheduler. The main scheduler is always allowed to schedule pods on all the nodes, if nothing else to manage infra |
| 22 | +pods (pods needed by k8s itself to work). It is theoretically possible to achieve this form of isolation, but |
| 23 | +the setup would be complex and likely fragile, so this option is never fully explored. |
| 24 | + |
| 25 | +When both the main scheduler and the secondary scheduler bind pods to nodes, the latter needs to deal with the fact |
| 26 | +that pods can be scheduled outside its own control; thus, the expected computed state need to be robust against |
| 27 | +this form of interference. The logic to deal with this interference is implemented in the |
| 28 | +[foreign pods][1] [management](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/cache/foreign_pods.go). |
| 29 | + |
| 30 | +While NRT runs as secondary scheduler, this form of interference is practically unavoidable. |
| 31 | +The effect of the interference would be slower scheduling cycle, because the NRT plugin will need to compute the |
| 32 | +expected node state more often and to possibly invalidate partial computations, beginning again from scratch, |
| 33 | +which will take more time. |
| 34 | + |
| 35 | +In order to minimize the recomputation without compromising accuracy, we introduced a cache tuning option |
| 36 | +[ForeignPodsDetect](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/apis/config/v1/types.go#L179). |
| 37 | +Currently, the foreign pods detection algorithm is triggered by default when any foreign pod is detected. |
| 38 | +Testing is in progress to move the default to trigger when pods which require exclusive resources are |
| 39 | +requested, because only these exclusive resources can have NUMA locality, then contribute to the accounting done by the plugin. |
| 40 | + |
| 41 | +### Efficient computation of the expected node states |
| 42 | + |
| 43 | +In order to compute the expected node state, by default the plugin considers all the pods running on a node, regardless |
| 44 | +of their resource requests. This is wasteful and arguably incorrect (albeit not harmful) because only exclusive resources |
| 45 | +can possibly have NUMA affinity. So containers which don't ask for exclusive resources don't contribute to per-NUMA resource |
| 46 | +accounting, and can be safely ignored. |
| 47 | + |
| 48 | +Minimizing the set of pods considered for the state reduces the churn and make the state change less often. |
| 49 | +This decreases the scheduler load and gives better performance because minizes the downtime on which scheduling waits |
| 50 | +for recomputation. Thus, we introduced a cache tuning option |
| 51 | +[ResyncMethod](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/apis/config/v1/types.go#L186). |
| 52 | +Testing is in progress to switch the default and only compute the state using containers requiring exclusive resources. |
| 53 | + |
| 54 | +## As part of the main scheduler |
| 55 | + |
| 56 | +The NodeResourceTopology code is meant for eventual merge into core kubernetes. |
| 57 | +The process is still ongoing and requires some dependenciues to be sorted out before. |
| 58 | + |
| 59 | +### Patching the main scheduler |
| 60 | + |
| 61 | +Until the NodeResourceTopology code (and the noderesourcetopology API) becomes part of the core k8s codebase, users |
| 62 | +willing to run just one scheduler needs to patch their codebase and deploy and updated binary. |
| 63 | +While we don't support this flow yet, we're also *not* aware of any specific limitation preventing it. |
| 64 | + |
| 65 | +### Handling partial NRT data |
| 66 | + |
| 67 | +A common reason to want to use just one scheduler is to minimize the moving parts, and perhaps even stronger, to |
| 68 | +avoid partitioning the worker node set. In this case, NRT data may be available only by a subset of nodes, which |
| 69 | +are the ones meant to run the NUMA-alignment-requiring workload. |
| 70 | + |
| 71 | +By design, the NodeResourceTopology Filter plugin |
| 72 | +[allows nodes which don't have NRT data attached](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/filter.go#L212). |
| 73 | +This is a prerequisite to make the NodeResourceTopology plugin work at all in cases like this, with partial NRT |
| 74 | +availability. |
| 75 | + |
| 76 | +The NodeResourceTopology score plugin is designed to |
| 77 | +[always prefer nodes with NRT data available](https://github.com/kubernetes-sigs/scheduler-plugins/pull/685), |
| 78 | +everything else equal. |
| 79 | + |
| 80 | +The combined effect of the Filter and Score design is meant to make the NodeResourceTopology work in this scenario. |
| 81 | +However, up until now the recommendation was to partition the worker nodes if possible, so this scenario is less tested. |
| 82 | + |
| 83 | +The reader is advised to test this scenario carefully, while we keep promoting it and add more tests to ensure adequate coverage. |
| 84 | + |
| 85 | +### Efficient computation of the expected node states |
| 86 | + |
| 87 | +Same considerations applies like for the case of running as secondary schedulers, because computing the expected node |
| 88 | +state is a pillar of the `overreserve` cache functionality. |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +[1]: A pod is "foreign" when it is known by the scheduler only _ex-poste_, after it was scheduled; so the scheduling decision was |
| 93 | + made by a different scheduler and this instance needs to deal with that. |
0 commit comments