Skip to content

Commit 4a4a40f

Browse files
committed
TELCODOCS-374: Updates for CNF-3107 NUMA-aware scheduling
1 parent a698aed commit 4a4a40f

13 files changed

+1149
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2270,6 +2270,8 @@ Topics:
22702270
Distros: openshift-origin,openshift-enterprise
22712271
- Name: Using Topology Manager
22722272
File: using-topology-manager
2273+
- Name: Scheduling NUMA-aware workloads
2274+
File: cnf-numa-aware-scheduling
22732275
Distros: openshift-origin,openshift-enterprise
22742276
- Name: Scaling the Cluster Monitoring Operator
22752277
File: scaling-cluster-monitoring-operator
87.8 KB
Loading
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *scalability_and_performance/cnf-numa-aware-scheduling.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="cnf-about-numa-aware-scheduling_{context}"]
7+
= About NUMA-aware scheduling
8+
9+
Non-Uniform Memory Access (NUMA) is a compute platform architecture that allows different CPUs to access different regions of memory at different speeds. NUMA resource topology refers to the locations of CPUs, memory, and PCI devices relative to each other in the compute node. Co-located resources are said to be in the same _NUMA zone_. For high-performance applications, the cluster needs to process pod workloads in a single NUMA zone.
10+
11+
NUMA architecture allows a CPU with multiple memory controllers to use any available memory across CPU complexes, regardless of where the memory is located. This allows for increased flexibility at the expense of performance. A CPU processing a workload using memory that is outside its NUMA zone is slower than a workload processed in a single NUMA zone. Also, for I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application. High-performance workloads, such as telecommunications workloads, cannot operate to specification under these conditions. NUMA-aware scheduling aligns the requested cluster compute resources (CPUs, memory, devices) in the same NUMA zone to process latency-sensitive or high-performance workloads efficiently. NUMA-aware scheduling also improves pod density per compute node for greater resource efficiency.
12+
13+
The default {product-title} pod scheduler scheduling logic considers the available resources of the entire compute node, not individual NUMA zones. If the most restrictive resource alignment is requested in the kubelet topology manager, error conditions can occur when admitting the pod to a node. Conversely, if the most restrictive resource alignment is not requested, the pod can be admitted to the node without proper resource alignment, leading to worse or unpredictable performance. For example, runaway pod creation with `Topology Affinity Error` statuses can occur when the pod scheduler makes suboptimal scheduling decisions for guaranteed pod workloads by not knowing if the pod's requested resources are available. Scheduling mismatch decisions can cause indefinite pod startup delays. Also, depending on the cluster state and resource allocation, poor pod scheduling decisions can cause extra load on the cluster because of failed startup attempts.
14+
15+
The NUMA Resources Operator deploys a custom NUMA resources secondary scheduler and other resources to mitigate against the shortcomings of the default {product-title} pod scheduler. The following diagram provides a high-level overview of NUMA-aware pod scheduling.
16+
17+
.NUMA-aware scheduling overview
18+
image::216_OpenShift_Topology-aware_Scheduling_0222.png[Diagram of NUMA-aware scheduling that shows how the various components interact with each other in the cluster]
19+
20+
NodeResourceTopology API:: The `NodeResourceTopology` API describes the available NUMA zone resources in each compute node.
21+
NUMA-aware scheduler:: The NUMA-aware secondary scheduler receives information about the available NUMA zones from the `NodeResourceTopology` API and schedules high-performance workloads on a node where it can be optimally processed.
22+
Node topology exporter:: The node topology exporter exposes the available NUMA zone resources for each compute node to the `NodeResourceTopology` API. The node topology exporter daemon tracks the resource allocation from the kubelet by using the `PodResources` API.
23+
PodResources API:: The `PodResources` API is local to each node and exposes the resource topology and available resources to the kubelet.
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *scalability_and_performance/cnf-numa-aware-scheduling.adoc
4+
5+
:_module-type: PROCEDURE
6+
[id="cnf-checking-numa-aware-scheduler-logs_{context}"]
7+
= Checking the NUMA-aware scheduler logs
8+
9+
Troubleshoot problems with the NUMA-aware scheduler by reviewing the logs. If required, you can increase the scheduler log level by modifying the `spec.logLevel` field of the `NUMAResourcesScheduler` resource. Acceptable values are `Normal`, `Debug`, and `Trace`, with `Trace` being the most verbose option.
10+
11+
[NOTE]
12+
====
13+
To change the log level of the secondary scheduler, delete the running scheduler resource and re-deploy it with the changed log level. The scheduler is unavailable for scheduling new workloads during this downtime.
14+
====
15+
16+
.Prerequisites
17+
18+
* Install the OpenShift CLI (`oc`).
19+
* Log in as a user with `cluster-admin` privileges.
20+
21+
.Procedure
22+
23+
. Delete the currently running `NUMAResourcesScheduler` resource:
24+
25+
.. Get the active `NUMAResourcesScheduler` by running the following command:
26+
+
27+
[source,terminal]
28+
----
29+
$ oc get NUMAResourcesScheduler
30+
----
31+
+
32+
.Example output
33+
[source,terminal]
34+
----
35+
NAME AGE
36+
numaresourcesscheduler 90m
37+
----
38+
39+
.. Delete the secondary scheduler resource by running the following command:
40+
+
41+
[source,terminal]
42+
----
43+
$ oc delete NUMAResourcesScheduler numaresourcesscheduler
44+
----
45+
+
46+
.Example output
47+
[source,terminal]
48+
----
49+
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
50+
----
51+
52+
. Save the following YAML in the file `nro-scheduler-debug.yaml`. This example changes the log level to `Debug`:
53+
+
54+
[source,yaml]
55+
----
56+
apiVersion: nodetopology.openshift.io/v1alpha1
57+
kind: NUMAResourcesScheduler
58+
metadata:
59+
name: numaresourcesscheduler
60+
spec:
61+
imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.10"
62+
logLevel: Debug
63+
----
64+
65+
. Create the updated `Debug` logging `NUMAResourcesScheduler` resource by running the following command:
66+
+
67+
[source,terminal]
68+
----
69+
$ oc create -f nro-scheduler-debug.yaml
70+
----
71+
+
72+
.Example output
73+
[source,terminal]
74+
----
75+
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
76+
----
77+
78+
.Verification steps
79+
80+
. Check that the NUMA-aware scheduler was successfully deployed:
81+
82+
.. Run the following command to check that the CRD is created succesfully:
83+
+
84+
[source,terminal]
85+
----
86+
$ oc get crd | grep numaresourcesschedulers
87+
----
88+
+
89+
.Example output
90+
[source,terminal]
91+
----
92+
NAME CREATED AT
93+
numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
94+
----
95+
96+
.. Check that the new custom scheduler is available by running the following command:
97+
+
98+
[source,terminal]
99+
----
100+
$ oc get numaresourcesschedulers.nodetopology.openshift.io
101+
----
102+
+
103+
.Example output
104+
[source,terminal]
105+
----
106+
NAME AGE
107+
numaresourcesscheduler 3h26m
108+
----
109+
110+
. Check that the logs for the scheduler shows the increased log level:
111+
112+
.. Get the list of pods running in the `openshift-numaresources` namespace by running the following command:
113+
+
114+
[source,terminal]
115+
----
116+
$ oc get pods -n openshift-numaresources
117+
----
118+
+
119+
.Example output
120+
[source,terminal]
121+
----
122+
NAME READY STATUS RESTARTS AGE
123+
numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h
124+
numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h
125+
numaresourcesoperator-worker-pb75c 2/2 Running 0 45h
126+
secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
127+
----
128+
129+
.. Get the logs for the secondary scheduler pod by running the following command:
130+
+
131+
[source,terminal]
132+
----
133+
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
134+
----
135+
+
136+
.Example output
137+
[source,terminal]
138+
----
139+
...
140+
I0223 11:04:55.614788 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received
141+
I0223 11:04:56.609114 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received
142+
I0223 11:05:22.626818 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received
143+
I0223 11:05:31.610356 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received
144+
I0223 11:05:31.713032 1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
145+
I0223 11:05:53.461016 1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
146+
----

modules/cnf-creating-nrop-cr.adoc

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *scalability_and_performance/cnf-numa-aware-scheduling.adoc
4+
5+
:_module-type: PROCEDURE
6+
[id="cnf-creating-nrop-cr_{context}"]
7+
= Creating the NUMAResourcesOperator custom resource
8+
9+
When you have installed the NUMA Resources Operator, then create the `NUMAResourcesOperator` custom resource (CR) that instructs the NUMA Resources Operator to install all the cluster infrastructure needed to support the NUMA-aware scheduler, including daemon sets and APIs.
10+
11+
.Prerequisites
12+
13+
* Install the OpenShift CLI (`oc`).
14+
* Log in as a user with `cluster-admin` privileges.
15+
* Install the NUMA Resources Operator.
16+
17+
.Procedure
18+
19+
. Create the `MachineConfigPool` custom resource that enables custom kubelet configurations for worker nodes:
20+
21+
.. Save the following YAML in the `nro-machineconfig.yaml` file:
22+
+
23+
[source,yaml]
24+
----
25+
apiVersion: machineconfiguration.openshift.io/v1
26+
kind: MachineConfigPool
27+
metadata:
28+
labels:
29+
cnf-worker-tuning: enabled
30+
machineconfiguration.openshift.io/mco-built-in: ""
31+
pools.operator.machineconfiguration.openshift.io/worker: ""
32+
name: worker
33+
spec:
34+
machineConfigSelector:
35+
matchLabels:
36+
machineconfiguration.openshift.io/role: worker
37+
nodeSelector:
38+
matchLabels:
39+
node-role.kubernetes.io/worker: ""
40+
----
41+
42+
.. Create the `MachineConfigPool` CR by running the following command:
43+
+
44+
[source,terminal]
45+
----
46+
$ oc create -f nro-machineconfig.yaml
47+
----
48+
49+
. Create the `NUMAResourcesOperator` custom resource:
50+
51+
.. Save the following YAML in the `nrop.yaml` file:
52+
+
53+
[source,yaml]
54+
----
55+
apiVersion: nodetopology.openshift.io/v1alpha1
56+
kind: NUMAResourcesOperator
57+
metadata:
58+
name: numaresourcesoperator
59+
spec:
60+
nodeGroups:
61+
- machineConfigPoolSelector:
62+
matchLabels:
63+
pools.operator.machineconfiguration.openshift.io/worker: "" <1>
64+
----
65+
<1> Should match the label applied to worker nodes in the related `MachineConfigPool` CR.
66+
67+
.. Create the `NUMAResourcesOperator` CR by running the following command:
68+
+
69+
[source,terminal]
70+
----
71+
$ oc create -f nrop.yaml
72+
----
73+
74+
.Verification
75+
76+
Verify that the NUMA Resources Operator deployed successfully by running the following command:
77+
78+
[source,terminal]
79+
----
80+
$ oc get numaresourcesoperators.nodetopology.openshift.io
81+
----
82+
83+
.Example output
84+
[source,terminal]
85+
----
86+
NAME AGE
87+
numaresourcesoperator 10m
88+
----
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
// Module included in the following assemblies:
2+
//
3+
// *scalability_and_performance/cnf-numa-aware-scheduling.adoc
4+
5+
:_module-type: PROCEDURE
6+
[id="cnf-deploying-the-numa-aware-scheduler_{context}"]
7+
= Deploying the NUMA-aware secondary pod scheduler
8+
9+
After you install the NUMA Resources Operator, do the following to deploy the NUMA-aware secondary pod scheduler:
10+
11+
* Configure the pod admittance policy for the required machine profile
12+
13+
* Create the required machine config pool
14+
15+
* Deploy the NUMA-aware secondary scheduler
16+
17+
.Prerequisites
18+
19+
* Install the OpenShift CLI (`oc`).
20+
21+
* Log in as a user with `cluster-admin` privileges.
22+
23+
* Install the NUMA Resources Operator.
24+
25+
.Procedure
26+
. Create the `KubeletConfig` custom resource that configures the pod admittance policy for the machine profile:
27+
28+
.. Save the following YAML in the `nro-kubeletconfig.yaml` file:
29+
+
30+
[source,yaml]
31+
----
32+
apiVersion: machineconfiguration.openshift.io/v1
33+
kind: KubeletConfig
34+
metadata:
35+
name: cnf-worker-tuning
36+
spec:
37+
machineConfigPoolSelector:
38+
matchLabels:
39+
cnf-worker-tuning: enabled
40+
kubeletConfig:
41+
cpuManagerPolicy: "static"
42+
cpuManagerReconcilePeriod: "5s"
43+
reservedSystemCPUs: "0,1"
44+
memoryManagerPolicy: "Static"
45+
evictionHard:
46+
memory.available: "100Mi"
47+
kubeReserved:
48+
memory: "512Mi"
49+
reservedMemory:
50+
- numaNode: 0
51+
limits:
52+
memory: "1124Mi"
53+
systemReserved:
54+
memory: "512Mi"
55+
topologyManagerPolicy: "single-numa-node" <1>
56+
----
57+
<1> `topologyManagerPolicy` must be set to `single-numa-node`.
58+
59+
.. Create the `KubeletConfig` custom resource (CR) by running the following command:
60+
+
61+
[source,terminal]
62+
----
63+
$ oc create -f nro-kubeletconfig.yaml
64+
----
65+
66+
. Create the `NUMAResourcesScheduler` custom resource that deploys the NUMA-aware custom pod scheduler:
67+
68+
.. Save the following YAML in the `nro-scheduler.yaml` file:
69+
+
70+
[source,yaml]
71+
----
72+
apiVersion: nodetopology.openshift.io/v1alpha1
73+
kind: NUMAResourcesScheduler
74+
metadata:
75+
name: numaresourcesscheduler
76+
spec:
77+
imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.10"
78+
----
79+
80+
.. Create the `NUMAResourcesScheduler` CR by running the following command:
81+
+
82+
[source,terminal]
83+
----
84+
$ oc create -f nro-scheduler.yaml
85+
----
86+
87+
.Verification
88+
89+
Verify that the required resources deployed successfully by running the following command:
90+
91+
[source,terminal]
92+
----
93+
$ oc get all -n openshift-numaresources
94+
----
95+
96+
.Example output
97+
[source,terminal]
98+
----
99+
NAME READY STATUS RESTARTS AGE
100+
pod/numaresources-controller-manager-7575848485-bns4s 1/1 Running 0 13m
101+
pod/numaresourcesoperator-worker-dvj4n 2/2 Running 0 16m
102+
pod/numaresourcesoperator-worker-lcg4t 2/2 Running 0 16m
103+
pod/secondary-scheduler-56994cf6cf-7qf4q 1/1 Running 0 16m
104+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
105+
daemonset.apps/numaresourcesoperator-worker 2 2 2 2 2 node-role.kubernetes.io/worker= 16m
106+
NAME READY UP-TO-DATE AVAILABLE AGE
107+
deployment.apps/numaresources-controller-manager 1/1 1 1 13m
108+
deployment.apps/secondary-scheduler 1/1 1 1 16m
109+
NAME DESIRED CURRENT READY AGE
110+
replicaset.apps/numaresources-controller-manager-7575848485 1 1 1 13m
111+
replicaset.apps/secondary-scheduler-56994cf6cf 1 1 1 16m
112+
----

0 commit comments

Comments
 (0)