Skip to content

Commit 464b399

Browse files
authored
Merge pull request #686 from ffromani/nrt-doc-secondary-scheduler
nrt: doc: add notes and rationale about deploying
2 parents b3d5e20 + dfe8fde commit 464b399

File tree

2 files changed

+96
-1
lines changed

2 files changed

+96
-1
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Deploying the NodeResourceTopology plugins
2+
3+
This document contains notes and outrline the challenges related to run the NodeResourceTopology plugin.
4+
The intent is to share awareness about the challenges and how to overcome them.
5+
**The content here applies to plugins running with the overreserve cache enabled.**
6+
7+
Last updated: December 2023 - last released version: 0.27.8
8+
9+
## As a secondary scheduler
10+
11+
Running the NodeResourceTopology plugin as part of the secondary scheduler is currently the recommended approach.
12+
This way only the workload which wants stricter NUMA alignment can opt-in, while any other workload, including all
13+
the infrastructure pods, will keep running as usual using the battle-tested main kubernetes scheduler.
14+
15+
### Interference with the main scheduler
16+
17+
For reasons explained in detail in the
18+
[cache design docs](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/cache/docs/NUMA-aware-scheduler-side-reserve-plugin.md#the-reconciliation-condition-a-node-state-definition)
19+
the NRT plugin needs to constantly compute the expected nodes state in terms of which pods are running there.
20+
Is not possible (nor actually desirable) to completely isolate worker nodes while the NRT code runs in the secondary
21+
scheduler. The main scheduler is always allowed to schedule pods on all the nodes, if nothing else to manage infra
22+
pods (pods needed by k8s itself to work). It is theoretically possible to achieve this form of isolation, but
23+
the setup would be complex and likely fragile, so this option is never fully explored.
24+
25+
When both the main scheduler and the secondary scheduler bind pods to nodes, the latter needs to deal with the fact
26+
that pods can be scheduled outside its own control; thus, the expected computed state need to be robust against
27+
this form of interference. The logic to deal with this interference is implemented in the
28+
[foreign pods][1] [management](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/cache/foreign_pods.go).
29+
30+
While NRT runs as secondary scheduler, this form of interference is practically unavoidable.
31+
The effect of the interference would be slower scheduling cycle, because the NRT plugin will need to compute the
32+
expected node state more often and to possibly invalidate partial computations, beginning again from scratch,
33+
which will take more time.
34+
35+
In order to minimize the recomputation without compromising accuracy, we introduced a cache tuning option
36+
[ForeignPodsDetect](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/apis/config/v1/types.go#L179).
37+
Currently, the foreign pods detection algorithm is triggered by default when any foreign pod is detected.
38+
Testing is in progress to move the default to trigger when pods which require exclusive resources are
39+
requested, because only these exclusive resources can have NUMA locality, then contribute to the accounting done by the plugin.
40+
41+
### Efficient computation of the expected node states
42+
43+
In order to compute the expected node state, by default the plugin considers all the pods running on a node, regardless
44+
of their resource requests. This is wasteful and arguably incorrect (albeit not harmful) because only exclusive resources
45+
can possibly have NUMA affinity. So containers which don't ask for exclusive resources don't contribute to per-NUMA resource
46+
accounting, and can be safely ignored.
47+
48+
Minimizing the set of pods considered for the state reduces the churn and make the state change less often.
49+
This decreases the scheduler load and gives better performance because minizes the downtime on which scheduling waits
50+
for recomputation. Thus, we introduced a cache tuning option
51+
[ResyncMethod](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/apis/config/v1/types.go#L186).
52+
Testing is in progress to switch the default and only compute the state using containers requiring exclusive resources.
53+
54+
## As part of the main scheduler
55+
56+
The NodeResourceTopology code is meant for eventual merge into core kubernetes.
57+
The process is still ongoing and requires some dependenciues to be sorted out before.
58+
59+
### Patching the main scheduler
60+
61+
Until the NodeResourceTopology code (and the noderesourcetopology API) becomes part of the core k8s codebase, users
62+
willing to run just one scheduler needs to patch their codebase and deploy and updated binary.
63+
While we don't support this flow yet, we're also *not* aware of any specific limitation preventing it.
64+
65+
### Handling partial NRT data
66+
67+
A common reason to want to use just one scheduler is to minimize the moving parts, and perhaps even stronger, to
68+
avoid partitioning the worker node set. In this case, NRT data may be available only by a subset of nodes, which
69+
are the ones meant to run the NUMA-alignment-requiring workload.
70+
71+
By design, the NodeResourceTopology Filter plugin
72+
[allows nodes which don't have NRT data attached](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/filter.go#L212).
73+
This is a prerequisite to make the NodeResourceTopology plugin work at all in cases like this, with partial NRT
74+
availability.
75+
76+
The NodeResourceTopology score plugin is designed to
77+
[always prefer nodes with NRT data available](https://github.com/kubernetes-sigs/scheduler-plugins/pull/685),
78+
everything else equal.
79+
80+
The combined effect of the Filter and Score design is meant to make the NodeResourceTopology work in this scenario.
81+
However, up until now the recommendation was to partition the worker nodes if possible, so this scenario is less tested.
82+
83+
The reader is advised to test this scenario carefully, while we keep promoting it and add more tests to ensure adequate coverage.
84+
85+
### Efficient computation of the expected node states
86+
87+
Same considerations applies like for the case of running as secondary schedulers, because computing the expected node
88+
state is a pillar of the `overreserve` cache functionality.
89+
90+
---
91+
92+
[1]: A pod is "foreign" when it is known by the scheduler only _ex-poste_, after it was scheduled; so the scheduling decision was
93+
made by a different scheduler and this instance needs to deal with that.

pkg/noderesourcetopology/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Document capturing the NodeResourceTopology API Custom Resource Definition Stand
1515

1616
## Tutorial
1717

18+
For more specific notes about how to deploy this plugin, please check [README.deploy.md](README.deploy.md)
19+
1820
### Expectation
1921

2022
In case the cumulative count of node resource allocatable appear to be the same for both the nodes in the cluster, topology aware scheduler plugin uses the CRD instance corresponding to the nodes to obtain the resource topology information to make a topology-aware scheduling decision.
@@ -86,7 +88,7 @@ The quality of the scheduling decisions of the "NodeResourceTopologyMatch" filte
8688
When deployed on large clusters, or when facing high pod churn, or both, it's often impractical or impossible to have frequent enough updates, and the scheduler plugins
8789
may run with stale data, leading to suboptimal scheduling decisions.
8890
Using the Reserve plugin, the "NodeResourceTopologyMatch" Filter and Score can use a pessimistic overreserving cache which prevents these suboptimal decisions at the cost
89-
of leaving pods pending longer. This cache is described in detail in [the docs/ directory](docs/).
91+
of leaving pods pending longer. This cache is described in detail in [the cache/docs/ directory](cache/docs/).
9092
9193
To enable the cache, you need to **both** enable the Reserve plugin and to set the `cacheResyncPeriodSeconds` config options. Values less than 5 seconds are not recommended
9294
for performance reasons.

0 commit comments

Comments
 (0)