Skip to content

Commit 829dee0

Browse files
author
Tim Bannister
committed
Wrap text for Pod Topology Spread Constraints
Wrapping helps localization teams pick up and work with changes.
1 parent 311cdc3 commit 829dee0

File tree

1 file changed

+87
-27
lines changed

1 file changed

+87
-27
lines changed

content/en/docs/concepts/scheduling-eviction/topology-spread-constraints.md

Lines changed: 87 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,11 @@ weight: 40
77

88
<!-- overview -->
99

10-
You can use _topology spread constraints_ to control how {{< glossary_tooltip text="Pods" term_id="Pod" >}} are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.
10+
You can use _topology spread constraints_ to control how
11+
{{< glossary_tooltip text="Pods" term_id="Pod" >}} are spread across your cluster
12+
among failure-domains such as regions, zones, nodes, and other user-defined topology
13+
domains. This can help to achieve high availability as well as efficient resource
14+
utilization.
1115

1216

1317
<!-- body -->
@@ -16,7 +20,9 @@ You can use _topology spread constraints_ to control how {{< glossary_tooltip te
1620

1721
### Node Labels
1822

19-
Topology spread constraints rely on node labels to identify the topology domain(s) that each Node is in. For example, a Node might have labels: `node=node1,zone=us-east-1a,region=us-east-1`
23+
Topology spread constraints rely on node labels to identify the topology
24+
domain(s) that each Node is in. For example, a Node might have labels:
25+
`node=node1,zone=us-east-1a,region=us-east-1`
2026

2127
Suppose you have a 4-node cluster with the following labels:
2228

@@ -48,7 +54,9 @@ graph TB
4854
class zoneA,zoneB cluster;
4955
{{< /mermaid >}}
5056

51-
Instead of manually applying labels, you can also reuse the [well-known labels](/docs/reference/labels-annotations-taints/) that are created and populated automatically on most clusters.
57+
Instead of manually applying labels, you can also reuse the
58+
[well-known labels](/docs/reference/labels-annotations-taints/) that are created and populated
59+
automatically on most clusters.
5260

5361
## Spread Constraints for Pods
5462

@@ -70,7 +78,9 @@ spec:
7078
labelSelector: <object>
7179
```
7280
73-
You can define one or multiple `topologySpreadConstraint` to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across your cluster. The fields are:
81+
You can define one or multiple `topologySpreadConstraint` to instruct the
82+
kube-scheduler how to place each incoming Pod in relation to the existing Pods across
83+
your cluster. The fields are:
7484

7585
- **maxSkew** describes the degree to which Pods may be unevenly distributed.
7686
It must be greater than zero. Its semantics differs according to the value of `whenUnsatisfiable`:
@@ -104,15 +114,24 @@ You can define one or multiple `topologySpreadConstraint` to instruct the kube-s
104114
in order to use it.
105115
{{< /note >}}
106116

107-
- **topologyKey** is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
117+
- **topologyKey** is the key of node labels. If two Nodes are labelled with this key
118+
and have identical values for that label, the scheduler treats both Nodes as being
119+
in the same topology. The scheduler tries to place a balanced number of Pods into
120+
each topology domain.
108121

109122
- **whenUnsatisfiable** indicates how to deal with a Pod if it doesn't satisfy the spread constraint:
110123
- `DoNotSchedule` (default) tells the scheduler not to schedule it.
111124
- `ScheduleAnyway` tells the scheduler to still schedule it while prioritizing nodes that minimize the skew.
112125

113-
- **labelSelector** is used to find matching Pods. Pods that match this label selector are counted to determine the number of Pods in their corresponding topology domain. See [Label Selectors](/docs/concepts/overview/working-with-objects/labels/#label-selectors) for more details.
126+
- **labelSelector** is used to find matching Pods. Pods
127+
that match this label selector are counted to determine the
128+
number of Pods in their corresponding topology domain.
129+
See [Label Selectors](/docs/concepts/overview/working-with-objects/labels/#label-selectors)
130+
for more details.
114131

115-
When a Pod defines more than one `topologySpreadConstraint`, those constraints are ANDed: The kube-scheduler looks for a node for the incoming Pod that satisfies all the constraints.
132+
When a Pod defines more than one `topologySpreadConstraint`, those constraints are
133+
ANDed: The kube-scheduler looks for a node for the incoming Pod that satisfies all
134+
the constraints.
116135

117136
You can read more about this field by running `kubectl explain Pod.spec.topologySpreadConstraints`.
118137

@@ -142,9 +161,14 @@ If we want an incoming Pod to be evenly spread with existing Pods across zones,
142161

143162
{{< codenew file="pods/topology-spread-constraints/one-constraint.yaml" >}}
144163

145-
`topologyKey: zone` implies the even distribution will only be applied to the nodes which have label pair "zone:&lt;any value&gt;" present. `whenUnsatisfiable: DoNotSchedule` tells the scheduler to let it stay pending if the incoming Pod can't satisfy the constraint.
164+
`topologyKey: zone` implies the even distribution will only be applied to the
165+
nodes which have label pair "zone:&lt;any value&gt;" present. `whenUnsatisfiable:
166+
DoNotSchedule` tells the scheduler to let it stay pending if the incoming Pod can't
167+
satisfy the constraint.
146168

147-
If the scheduler placed this incoming Pod into "zoneA", the Pods distribution would become [3, 1], hence the actual skew is 2 (3 - 1) - which violates `maxSkew: 1`. In this example, the incoming Pod can only be placed into "zoneB":
169+
If the scheduler placed this incoming Pod into "zoneA", the Pods distribution would
170+
become [3, 1], hence the actual skew is 2 (3 - 1) - which violates `maxSkew: 1`. In
171+
this example, the incoming Pod can only be placed into "zoneB":
148172

149173
{{<mermaid>}}
150174
graph BT
@@ -189,13 +213,21 @@ graph BT
189213

190214
You can tweak the Pod spec to meet various kinds of requirements:
191215

192-
- Change `maxSkew` to a bigger value like "2" so that the incoming Pod can be placed into "zoneA" as well.
193-
- Change `topologyKey` to "node" so as to distribute the Pods evenly across nodes instead of zones. In the above example, if `maxSkew` remains "1", the incoming Pod can only be placed onto "node4".
194-
- Change `whenUnsatisfiable: DoNotSchedule` to `whenUnsatisfiable: ScheduleAnyway` to ensure the incoming Pod to be always schedulable (suppose other scheduling APIs are satisfied). However, it's preferred to be placed onto the topology domain which has fewer matching Pods. (Be aware that this preferability is jointly normalized with other internal scheduling priorities like resource usage ratio, etc.)
216+
- Change `maxSkew` to a bigger value like "2" so that the incoming Pod can be placed
217+
into "zoneA" as well.
218+
- Change `topologyKey` to "node" so as to distribute the Pods evenly across nodes
219+
instead of zones. In the above example, if `maxSkew` remains "1", the incoming
220+
Pod can only be placed onto "node4".
221+
- Change `whenUnsatisfiable: DoNotSchedule` to `whenUnsatisfiable: ScheduleAnyway`
222+
to ensure the incoming Pod to be always schedulable (suppose other scheduling APIs
223+
are satisfied). However, it's preferred to be placed into the topology domain which
224+
has fewer matching Pods. (Be aware that this preferability is jointly normalized
225+
with other internal scheduling priorities like resource usage ratio, etc.)
195226

196227
### Example: Multiple TopologySpreadConstraints
197228

198-
This builds upon the previous example. Suppose you have a 4-node cluster where 3 Pods labeled `foo:bar` are located in node1, node2 and node3 respectively:
229+
This builds upon the previous example. Suppose you have a 4-node cluster where 3
230+
Pods labeled `foo:bar` are located in node1, node2 and node3 respectively:
199231

200232
{{<mermaid>}}
201233
graph BT
@@ -220,7 +252,10 @@ You can use 2 TopologySpreadConstraints to control the Pods spreading on both zo
220252

221253
{{< codenew file="pods/topology-spread-constraints/two-constraints.yaml" >}}
222254

223-
In this case, to match the first constraint, the incoming Pod can only be placed into "zoneB"; while in terms of the second constraint, the incoming Pod can only be placed onto "node4". Then the results of 2 constraints are ANDed, so the only viable option is to place on "node4".
255+
In this case, to match the first constraint, the incoming Pod can only be placed into
256+
"zoneB"; while in terms of the second constraint, the incoming Pod can only be placed
257+
onto "node4". Then the results of 2 constraints are ANDed, so the only viable option
258+
is to place on "node4".
224259

225260
Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster across 2 zones:
226261

@@ -243,13 +278,18 @@ graph BT
243278
class zoneA,zoneB cluster;
244279
{{< /mermaid >}}
245280

246-
If you apply "two-constraints.yaml" to this cluster, you will notice "mypod" stays in `Pending` state. This is because: to satisfy the first constraint, "mypod" can only placed into "zoneB"; while in terms of the second constraint, "mypod" can only be placed onto "node2". Then a joint result of "zoneB" and "node2" returns nothing.
281+
If you apply "two-constraints.yaml" to this cluster, you will notice "mypod" stays in
282+
`Pending` state. This is because: to satisfy the first constraint, "mypod" can only placed
283+
into "zoneB"; while in terms of the second constraint, "mypod" can only be placed onto
284+
"node2". Then a joint result of "zoneB" and "node2" returns nothing.
247285

248-
To overcome this situation, you can either increase the `maxSkew` or modify one of the constraints to use `whenUnsatisfiable: ScheduleAnyway`.
286+
To overcome this situation, you can either increase the `maxSkew` or modify one of
287+
the constraints to use `whenUnsatisfiable: ScheduleAnyway`.
249288

250289
### Interaction With Node Affinity and Node Selectors
251290

252-
The scheduler will skip the non-matching nodes from the skew calculations if the incoming Pod has `spec.nodeSelector` or `spec.affinity.nodeAffinity` defined.
291+
The scheduler will skip the non-matching nodes from the skew calculations if the
292+
incoming Pod has `spec.nodeSelector` or `spec.affinity.nodeAffinity` defined.
253293

254294
### Example: TopologySpreadConstraints with NodeAffinity
255295

@@ -287,11 +327,17 @@ class n5 k8s;
287327
class zoneC cluster;
288328
{{< /mermaid >}}
289329

290-
and you know that "zoneC" must be excluded. In this case, you can compose the yaml as below, so that "mypod" will be placed into "zoneB" instead of "zoneC". Similarly `spec.nodeSelector` is also respected.
330+
and you know that "zoneC" must be excluded. In this case, you can compose the yaml
331+
as below, so that "mypod" will be placed into "zoneB" instead of "zoneC".
332+
Similarly `spec.nodeSelector` is also respected.
291333

292334
{{< codenew file="pods/topology-spread-constraints/one-constraint-with-nodeaffinity.yaml" >}}
293335

294-
The scheduler doesn't have prior knowledge of all the zones or other topology domains that a cluster has. They are determined from the existing nodes in the cluster. This could lead to a problem in autoscaled clusters, when a node pool (or node group) is scaled to zero nodes and the user is expecting them to scale up, because, in this case, those topology domains won't be considered until there is at least one node in them.
336+
The scheduler doesn't have prior knowledge of all the zones or other topology domains
337+
that a cluster has. They are determined from the existing nodes in the cluster. This
338+
could lead to a problem in autoscaled clusters, when a node pool (or node group) is
339+
scaled to zero nodes and the user is expecting them to scale up, because, in this case,
340+
those topology domains won't be considered until there is at least one node in them.
295341

296342
### Other Noticeable Semantics
297343

@@ -301,10 +347,21 @@ There are some implicit conventions worth noting here:
301347

302348
- The scheduler will bypass the nodes without `topologySpreadConstraints[*].topologyKey` present. This implies that:
303349

304-
1. the Pods located on those nodes do not impact `maxSkew` calculation - in the above example, suppose "node1" does not have label "zone", then the 2 Pods will be disregarded, hence the incoming Pod will be scheduled into "zoneA".
305-
2. the incoming Pod has no chances to be scheduled onto such nodes - in the above example, suppose a "node5" carrying label `{zone-typo: zoneC}` joins the cluster, it will be bypassed due to the absence of label key "zone".
306-
307-
- Be aware of what will happen if the incoming Pod's `topologySpreadConstraints[*].labelSelector` doesn't match its own labels. In the above example, if we remove the incoming Pod's labels, it can still be placed into "zoneB" since the constraints are still satisfied. However, after the placement, the degree of imbalance of the cluster remains unchanged - it's still zoneA having 2 Pods which hold label {foo:bar}, and zoneB having 1 Pod which holds label {foo:bar}. So if this is not what you expect, we recommend the workload's `topologySpreadConstraints[*].labelSelector` to match its own labels.
350+
1. the Pods located on those nodes do not impact `maxSkew` calculation - in the
351+
above example, suppose "node1" does not have label "zone", then the 2 Pods will
352+
be disregarded, hence the incoming Pod will be scheduled into "zoneA".
353+
2. the incoming Pod has no chances to be scheduled onto such nodes -
354+
in the above example, suppose a "node5" carrying label `{zone-typo: zoneC}`
355+
joins the cluster, it will be bypassed due to the absence of label key "zone".
356+
357+
- Be aware of what will happen if the incomingPod's
358+
`topologySpreadConstraints[*].labelSelector` doesn't match its own labels. In the
359+
above example, if we remove the incoming Pod's labels, it can still be placed into
360+
"zoneB" since the constraints are still satisfied. However, after the placement,
361+
the degree of imbalance of the cluster remains unchanged - it's still zoneA
362+
having 2 Pods which hold label {foo:bar}, and zoneB having 1 Pod which holds
363+
label {foo:bar}. So if this is not what you expect, we recommend the workload's
364+
`topologySpreadConstraints[*].labelSelector` to match its own labels.
308365

309366
### Cluster-level default constraints
310367

@@ -405,15 +462,18 @@ scheduled - more packed or more scattered.
405462
For finer control, you can specify topology spread constraints to distribute
406463
Pods across different topology domains - to achieve either high availability or
407464
cost-saving. This can also help on rolling update workloads and scaling out
408-
replicas smoothly. See
465+
replicas smoothly.
466+
See
409467
[Motivation](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/895-pod-topology-spread#motivation)
410468
for more details.
411469

412470
## Known Limitations
413471

414-
- There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in imbalanced Pods distribution.
415-
You can use [Descheduler](https://github.com/kubernetes-sigs/descheduler) to rebalance the Pods distribution.
416-
- Pods matched on tainted nodes are respected. See [Issue 80921](https://github.com/kubernetes/kubernetes/issues/80921)
472+
- There's no guarantee that the constraints remain satisfied when Pods are removed. For
473+
example, scaling down a Deployment may result in imbalanced Pods distribution.
474+
You can use [Descheduler](https://github.com/kubernetes-sigs/descheduler) to rebalance the Pods distribution.
475+
- Pods matched on tainted nodes are respected.
476+
See [Issue 80921](https://github.com/kubernetes/kubernetes/issues/80921).
417477

418478
## {{% heading "whatsnext" %}}
419479

0 commit comments

Comments
 (0)