1
1
---
2
2
title : Control Topology Management Policies on a node
3
-
4
3
reviewers :
5
4
- ConnorDoyle
6
5
- klueska
7
6
- lmdaly
8
7
- nolancon
9
8
- bg-chun
10
-
11
9
content_type : task
12
10
min-kubernetes-server-version : v1.18
13
11
weight : 150
@@ -26,7 +24,7 @@ In order to extract the best performance, optimizations related to CPU isolation
26
24
device locality are required. However, in Kubernetes, these optimizations are handled by a
27
25
disjoint set of components.
28
26
29
- _ Topology Manager_ is a Kubelet component that aims to coordinate the set of components that are
27
+ _ Topology Manager_ is a kubelet component that aims to coordinate the set of components that are
30
28
responsible for these optimizations.
31
29
32
30
## {{% heading "prerequisites" %}}
@@ -38,24 +36,24 @@ responsible for these optimizations.
38
36
## How topology manager works
39
37
40
38
Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make
41
- resource allocation decisions independently of each other. This can result in undesirable
42
- allocations on multiple-socketed systems, performance/latency sensitive applications will suffer
43
- due to these undesirable allocations. Undesirable in this case meaning for example, CPUs and
44
- devices being allocated from different NUMA Nodes thus, incurring additional latency.
39
+ resource allocation decisions independently of each other. This can result in undesirable
40
+ allocations on multiple-socketed systems, and performance/latency sensitive applications will suffer
41
+ due to these undesirable allocations. Undesirable in this case meaning, for example, CPUs and
42
+ devices being allocated from different NUMA Nodes, thus incurring additional latency.
45
43
46
- The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet
44
+ The Topology Manager is a kubelet component, which acts as a source of truth so that other kubelet
47
45
components can make topology aligned resource allocation choices.
48
46
49
47
The Topology Manager provides an interface for components, called * Hint Providers* , to send and
50
- receive topology information. Topology Manager has a set of node level policies which are
48
+ receive topology information. The Topology Manager has a set of node level policies which are
51
49
explained below.
52
50
53
- The Topology manager receives Topology information from the * Hint Providers* as a bitmask denoting
51
+ The Topology Manager receives topology information from the * Hint Providers* as a bitmask denoting
54
52
NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform
55
53
a set of operations on the hints provided and converge on the hint determined by the policy to
56
- give the optimal result, if an undesirable hint is stored the preferred field for the hint will be
54
+ give the optimal result. If an undesirable hint is stored, the preferred field for the hint will be
57
55
set to false. In the current policies preferred is the narrowest preferred mask.
58
- The selected hint is stored as part of the Topology Manager. Depending on the policy configured
56
+ The selected hint is stored as part of the Topology Manager. Depending on the policy configured,
59
57
the pod can be accepted or rejected from the node based on the selected hint.
60
58
The hint is then stored in the Topology Manager for use by the * Hint Providers* when making the
61
59
resource allocation decisions.
@@ -64,28 +62,28 @@ resource allocation decisions.
64
62
65
63
The Topology Manager currently:
66
64
67
- - Aligns Pods of all QoS classes.
68
- - Aligns the requested resources that Hint Provider provides topology hints for.
65
+ - aligns Pods of all QoS classes.
66
+ - aligns the requested resources that Hint Provider provides topology hints for.
69
67
70
68
If these conditions are met, the Topology Manager will align the requested resources.
71
69
72
- In order to customise how this alignment is carried out, the Topology Manager provides two
73
- distinct knobs : ` scope ` and ` policy ` .
70
+ In order to customize how this alignment is carried out, the Topology Manager provides two
71
+ distinct options : ` scope ` and ` policy ` .
74
72
75
- The ` scope ` defines the granularity at which you would like resource alignment to be performed
76
- (e.g. at the ` pod ` or ` container ` level) . And the ` policy ` defines the actual strategy used to
77
- carry out the alignment (e.g. ` best-effort ` , ` restricted ` , ` single-numa-node ` , etc.) .
73
+ The ` scope ` defines the granularity at which you would like resource alignment to be performed,
74
+ for example, at the ` pod ` or ` container ` level. And the ` policy ` defines the actual policy used to
75
+ carry out the alignment, for example, ` best-effort ` , ` restricted ` , and ` single-numa-node ` .
78
76
Details on the various ` scopes ` and ` policies ` available today can be found below.
79
77
80
78
{{< note >}}
81
79
To align CPU resources with other requested resources in a Pod spec, the CPU Manager should be
82
80
enabled and proper CPU Manager policy should be configured on a Node.
83
- See [ control CPU Management Policies] ( /docs/tasks/administer-cluster/cpu-management-policies/ ) .
81
+ See [ Control CPU Management Policies on the Node ] ( /docs/tasks/administer-cluster/cpu-management-policies/ ) .
84
82
{{< /note >}}
85
83
86
84
{{< note >}}
87
85
To align memory (and hugepages) resources with other requested resources in a Pod spec, the Memory
88
- Manager should be enabled and proper Memory Manager policy should be configured on a Node. Examine
86
+ Manager should be enabled and proper Memory Manager policy should be configured on a Node. Refer to
89
87
[ Memory Manager] ( /docs/tasks/administer-cluster/memory-manager/ ) documentation.
90
88
{{< /note >}}
91
89
@@ -116,7 +114,8 @@ scope, for example the `pod` scope.
116
114
117
115
### ` pod ` scope
118
116
119
- To select the ` pod ` scope, set ` topologyManagerScope ` in the [ kubelet configuration file] ( /docs/tasks/administer-cluster/kubelet-config-file/ ) to ` pod ` .`
117
+ To select the ` pod ` scope, set ` topologyManagerScope ` in the
118
+ [ kubelet configuration file] ( /docs/tasks/administer-cluster/kubelet-config-file/ ) to ` pod ` .
120
119
121
120
This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is, the
122
121
Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all containers)
@@ -127,8 +126,8 @@ alignments produced by the Topology Manager on different occasions:
127
126
* all containers can be and are allocated to a shared set of NUMA nodes.
128
127
129
128
The total amount of particular resource demanded for the entire pod is calculated according to
130
- [ effective requests/limits] ( /docs/concepts/workloads/pods/init-containers/#resources ) formula, and
131
- thus, this total value is equal to the maximum of:
129
+ [ effective requests/limits] ( /docs/concepts/workloads/pods/init-containers/#resource-sharing-within-containers )
130
+ formula, and thus, this total value is equal to the maximum of:
132
131
133
132
* the sum of all app container requests,
134
133
* the maximum of init container requests,
@@ -147,12 +146,12 @@ is present among possible allocations. Reconsider the example above:
147
146
* whereas a set containing more NUMA nodes - it results in pod rejection (because instead of one
148
147
NUMA node, two or more NUMA nodes are required to satisfy the allocation).
149
148
150
- To recap, Topology Manager first computes a set of NUMA nodes and then tests it against Topology
149
+ To recap, the Topology Manager first computes a set of NUMA nodes and then tests it against the Topology
151
150
Manager policy, which either leads to the rejection or admission of the pod.
152
151
153
152
## Topology manager policies
154
153
155
- Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag,
154
+ The Topology Manager supports four allocation policies. You can set a policy via a kubelet flag,
156
155
` --topology-manager-policy ` . There are four supported policies:
157
156
158
157
* ` none ` (default)
@@ -161,7 +160,7 @@ Topology Manager supports four allocation policies. You can set a policy via a K
161
160
* ` single-numa-node `
162
161
163
162
{{< note >}}
164
- If Topology Manager is configured with the ** pod** scope, the container, which is considered by
163
+ If the Topology Manager is configured with the ** pod** scope, the container, which is considered by
165
164
the policy, is reflecting requirements of the entire pod, and thus each container from the pod
166
165
will result with ** the same** topology alignment decision.
167
166
{{< /note >}}
@@ -175,21 +174,21 @@ This is the default policy and does not perform any topology alignment.
175
174
For each container in a Pod, the kubelet, with ` best-effort ` topology management policy, calls
176
175
each Hint Provider to discover their resource availability. Using this information, the Topology
177
176
Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
178
- preferred, Topology Manager will store this and admit the pod to the node anyway.
177
+ preferred, the Topology Manager will store this and admit the pod to the node anyway.
179
178
180
179
The * Hint Providers* can then use this information when making the
181
180
resource allocation decision.
182
181
183
182
### ` restricted ` policy {#policy-restricted}
184
183
185
184
For each container in a Pod, the kubelet, with ` restricted ` topology management policy, calls each
186
- Hint Provider to discover their resource availability. Using this information, the Topology
185
+ Hint Provider to discover their resource availability. Using this information, the Topology
187
186
Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
188
- preferred, Topology Manager will reject this pod from the node. This will result in a pod in a
187
+ preferred, the Topology Manager will reject this pod from the node. This will result in a pod entering a
189
188
` Terminated ` state with a pod admission failure.
190
189
191
190
Once the pod is in a ` Terminated ` state, the Kubernetes scheduler will ** not** attempt to
192
- reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of
191
+ reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeployment of
193
192
the pod. An external control loop could be also implemented to trigger a redeployment of pods that
194
193
have the ` Topology Affinity ` error.
195
194
@@ -199,16 +198,16 @@ resource allocation decision.
199
198
### ` single-numa-node ` policy {#policy-single-numa-node}
200
199
201
200
For each container in a Pod, the kubelet, with ` single-numa-node ` topology management policy,
202
- calls each Hint Provider to discover their resource availability. Using this information, the
203
- Topology Manager determines if a single NUMA Node affinity is possible. If it is, Topology
201
+ calls each Hint Provider to discover their resource availability. Using this information, the
202
+ Topology Manager determines if a single NUMA Node affinity is possible. If it is, Topology
204
203
Manager will store this and the * Hint Providers* can then use this information when making the
205
- resource allocation decision. If, however, this is not possible then the Topology Manager will
204
+ resource allocation decision. If, however, this is not possible then the Topology Manager will
206
205
reject the pod from the node. This will result in a pod in a ` Terminated ` state with a pod
207
206
admission failure.
208
207
209
208
Once the pod is in a ` Terminated ` state, the Kubernetes scheduler will ** not** attempt to
210
- reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of
211
- the Pod.An external control loop could be also implemented to trigger a redeployment of pods
209
+ reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeployment of
210
+ the Pod. An external control loop could be also implemented to trigger a redeployment of pods
212
211
that have the ` Topology Affinity ` error.
213
212
214
213
## Topology manager policy options
@@ -218,6 +217,7 @@ Support for the Topology Manager policy options requires `TopologyManagerPolicyO
218
217
(it is enabled by default).
219
218
220
219
You can toggle groups of options on and off based upon their maturity level using the following feature gates:
220
+
221
221
* ` TopologyManagerPolicyBetaOptions ` default enabled. Enable to show beta-level options.
222
222
* ` TopologyManagerPolicyAlphaOptions ` default disabled. Enable to show alpha-level options.
223
223
@@ -230,42 +230,42 @@ this policy option is visible by default provided that the `TopologyManagerPolic
230
230
` TopologyManagerPolicyBetaOptions ` [ feature gates] ( /docs/reference/command-line-tools-reference/feature-gates/ )
231
231
are enabled.
232
232
233
- The topology manager is not aware by default of NUMA distances, and does not take them into account when making
233
+ The Topology Manager is not aware by default of NUMA distances, and does not take them into account when making
234
234
Pod admission decisions. This limitation surfaces in multi-socket, as well as single-socket multi NUMA systems,
235
235
and can cause significant performance degradation in latency-critical execution and high-throughput applications
236
- if the topology manager decides to align resources on non-adjacent NUMA nodes.
236
+ if the Topology Manager decides to align resources on non-adjacent NUMA nodes.
237
237
238
238
If you specify the ` prefer-closest-numa-nodes ` policy option, the ` best-effort ` and ` restricted `
239
239
policies favor sets of NUMA nodes with shorter distance between them when making admission decisions.
240
240
241
241
You can enable this option by adding ` prefer-closest-numa-nodes=true ` to the Topology Manager policy options.
242
242
243
- By default (without this option), Topology Manager aligns resources on either a single NUMA node or,
243
+ By default (without this option), the Topology Manager aligns resources on either a single NUMA node or,
244
244
in the case where more than one NUMA node is required, using the minimum number of NUMA nodes.
245
245
246
246
### ` max-allowable-numa-nodes ` (beta) {#policy-option-max-allowable-numa-nodes}
247
247
248
- The ` max-allowable-numa-nodes ` option is beta since Kubernetes 1.31. In Kubernetes {{< skew currentVersion >}}
248
+ The ` max-allowable-numa-nodes ` option is beta since Kubernetes 1.31. In Kubernetes {{< skew currentVersion >}},
249
249
this policy option is visible by default provided that the ` TopologyManagerPolicyOptions ` and
250
250
` TopologyManagerPolicyBetaOptions ` [ feature gates] ( /docs/reference/command-line-tools-reference/feature-gates/ )
251
251
are enabled.
252
252
253
253
The time to admit a pod is tied to the number of NUMA nodes on the physical machine.
254
- By default, Kubernetes does not run a kubelet with the topology manager enabled, on any (Kubernetes) node where
254
+ By default, Kubernetes does not run a kubelet with the Topology Manager enabled, on any (Kubernetes) node where
255
255
more than 8 NUMA nodes are detected.
256
256
257
257
{{< note >}}
258
- If you select the the ` max-allowable-numa-nodes ` policy option, nodes with more than 8 NUMA nodes can
259
- be allowed to run with the topology manager enabled. The Kubernetes project only has limited data on the impact
260
- of using the topology manager on (Kubernetes) nodes with more than 8 NUMA nodes. Because of that
258
+ If you select the ` max-allowable-numa-nodes ` policy option, nodes with more than 8 NUMA nodes can
259
+ be allowed to run with the Topology Manager enabled. The Kubernetes project only has limited data on the impact
260
+ of using the Topology Manager on (Kubernetes) nodes with more than 8 NUMA nodes. Because of that
261
261
lack of data, using this policy option with Kubernetes {{< skew currentVersion >}} is ** not** recommended and is
262
262
at your own risk.
263
263
{{< /note >}}
264
264
265
265
You can enable this option by adding ` max-allowable-numa-nodes=true ` to the Topology Manager policy options.
266
266
267
267
Setting a value of ` max-allowable-numa-nodes ` does not (in and of itself) affect the
268
- latency of pod admission, but binding a Pod to a (Kubernetes) node with many NUMA does does have an impact.
268
+ latency of pod admission, but binding a Pod to a (Kubernetes) node with many NUMA does have an impact.
269
269
Future, potential improvements to Kubernetes may improve Pod admission performance and the high
270
270
latency that happens as the number of NUMA nodes increases.
271
271
@@ -296,10 +296,10 @@ spec:
296
296
297
297
This pod runs in the `Burstable` QoS class because requests are less than limits.
298
298
299
- If the selected policy is anything other than `none`, Topology Manager would consider these Pod
299
+ If the selected policy is anything other than `none`, the Topology Manager would consider these Pod
300
300
specifications. The Topology Manager would consult the Hint Providers to get topology hints.
301
301
In the case of the `static`, the CPU Manager policy would return default topology hint, because
302
- these Pods do not have explicitly request CPU resources.
302
+ these Pods do not explicitly request CPU resources.
303
303
304
304
` ` ` yaml
305
305
spec:
@@ -320,7 +320,6 @@ spec:
320
320
This pod with integer CPU request runs in the `Guaranteed` QoS class because `requests` are equal
321
321
to `limits`.
322
322
323
-
324
323
` ` ` yaml
325
324
spec:
326
325
containers:
@@ -380,10 +379,10 @@ assignments.
380
379
381
380
# # Known limitations
382
381
383
- 1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes
382
+ 1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes,
384
383
there will be a state explosion when trying to enumerate the possible NUMA affinities and
385
384
generating their hints. See [`max-allowable-numa-nodes`](#policy-option-max-allowable-numa-nodes)
386
385
(beta) for more options.
387
386
388
- 2 . The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail
389
- on the node due to the Topology Manager.
387
+ 1 . The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail
388
+ on the node due to the Topology Manager.
0 commit comments