Skip to content

Commit bd85794

Browse files
Huang-Weimrbobbytables
authored andcommitted
Blog introducing PodTopologySpread
1 parent 652e361 commit bd85794

File tree

5 files changed

+193
-1
lines changed

5 files changed

+193
-1
lines changed
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
---
2+
title: "Introducing PodTopologySpread"
3+
date: 2020-05-05
4+
slug: introducing-podtopologyspread
5+
url: /blog/2020/05/Introducing-PodTopologySpread
6+
---
7+
8+
**Author:** Wei Huang (IBM), Aldo Culquicondor (Google)
9+
10+
Managing Pods distribution across a cluster is hard. The well-known Kubernetes
11+
features for Pod affinity and anti-affinity, allow some control of Pod placement
12+
in different topologies. However, these features only resolve part of Pods
13+
distribution use cases: either place unlimited Pods to a single topology, or
14+
disallow two Pods to co-locate in the same topology. In between these two
15+
extreme cases, there is a common need to distribute the Pods evenly across the
16+
topologies, so as to achieve better cluster utilization and high availability of
17+
applications.
18+
19+
The PodTopologySpread scheduling plugin (originally proposed as EvenPodsSpread)
20+
was designed to fill that gap. We promoted it to beta in 1.18.
21+
22+
## API changes
23+
24+
A new field `topologySpreadConstraints` is introduced in the Pod's spec API:
25+
26+
```
27+
spec:
28+
topologySpreadConstraints:
29+
- maxSkew: <integer>
30+
topologyKey: <string>
31+
whenUnsatisfiable: <string>
32+
labelSelector: <object>
33+
```
34+
35+
As this API is embedded in Pod's spec, you can use this feature in all the
36+
high-level workload APIs, such as Deployment, DaemonSet, StatefulSet, etc.
37+
38+
Let's see an example of a cluster to understand this API.
39+
40+
![API](/images/blog/2020-05-05-introducing-podtopologyspread/api.png)
41+
42+
- **labelSelector** is used to find matching Pods. For each topology, we count
43+
the number of Pods that match this label selector. In the above example, given
44+
the labelSelector as "app: foo", the matching number in "zone1" is 2; while
45+
the number in "zone2" is 0.
46+
- **topologyKey** is the key that defines a topology in the Nodes' labels. In
47+
the above example, some Nodes are grouped into "zone1" if they have the label
48+
"zone=zone1" label; while other ones are grouped into "zone2".
49+
- **maxSkew** describes the maximum degree to which Pods can be unevenly
50+
distributed. In the above example:
51+
- if we put the incoming Pod to "zone1", the skew on "zone1" will become 3 (3
52+
Pods matched in "zone1"; global minimum of 0 Pods matched on "zone2"), which
53+
violates the "maxSkew: 1" constraint.
54+
- if the incoming Pod is placed to "zone2", the skew on "zone2" is 0 (1 Pod
55+
matched in "zone2"; global minimum of 1 Pod matched on "zone2" itself),
56+
which satisfies the "maxSkew: 1" constraint. Note that the skew is
57+
calculated per each qualified Node, instead of a global skew.
58+
- **whenUnsatisfiable** specifies, when "maxSkew" can't be satisfied, what
59+
action should be taken:
60+
- `DoNotSchedule` (default) tells the scheduler not to schedule it. It's a
61+
hard constraint.
62+
- `ScheduleAnyway` tells the scheduler to still schedule it while prioritizing
63+
Nodes that reduce the skew. It's a soft constraint.
64+
65+
## Advanced usage
66+
67+
As the feature name "PodTopologySpread" implies, the basic usage of this feature
68+
is to run your workload with an absolute even manner (maxSkew=1), or relatively
69+
even manner (maxSkew>=2). See the [official
70+
document](/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
71+
for more details.
72+
73+
In addition to this basic usage, there are some advanced usage examples that
74+
enable your workloads to benefit on high availability and cluster utilization.
75+
76+
### Usage along with NodeSelector / NodeAffinity
77+
78+
You may have found that we didn't have a "topologyValues" field to limit which
79+
topologies the Pods are going to be scheduled to. By default, it is going to
80+
search all Nodes and group them by "topologyKey". Sometimes this may not be the
81+
ideal case. For instance, suppose there is a cluster with Nodes tagged with
82+
"env=prod", "env=staging" and "env=qa", and now you want to evenly place Pods to
83+
the "qa" environment across zones, is it possible?
84+
85+
The answer is yes. You can leverage the NodeSelector or NodeAffinity API spec.
86+
Under the hood, the PodTopologySpread feature will **honor** that and calculate
87+
the spread constraints among the nodes that satisfy the selectors.
88+
89+
![Advanced-Usage-1](/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-1.png)
90+
91+
As illustrated above, you can specify `spec.affinity.nodeAffinity` to limit the
92+
"searching scope" to be "qa" environment, and within that scope, the Pod will be
93+
scheduled to one zone which satisfies the topologySpreadConstraints. In this
94+
case, it's "zone2".
95+
96+
### Multiple TopologySpreadConstraints
97+
98+
It's intuitive to understand how one single TopologySpreadConstraint works.
99+
What's the case for multiple TopologySpreadConstraints? Internally, each
100+
TopologySpreadConstraint is calculated independently, and the result sets will
101+
be merged to generate the eventual result set - i.e., suitable Nodes.
102+
103+
In the following example, we want to schedule a Pod to a cluster with 2
104+
requirements at the same time:
105+
106+
- place the Pod evenly with Pods across zones
107+
- place the Pod evenly with Pods across nodes
108+
109+
![Advanced-Usage-2](/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-2.png)
110+
111+
For the first constraint, there are 3 Pods in zone1 and 2 Pods in zone2, so the
112+
incoming Pod can be only put to zone2 to satisfy the "maxSkew=1" constraint. In
113+
other words, the result set is nodeX and nodeY.
114+
115+
For the second constraint, there are too many Pods in nodeB and nodeX, so the
116+
incoming Pod can be only put to nodeA and nodeY.
117+
118+
Now we can conclude the only qualified Node is nodeY - from the intersection of
119+
the sets {nodeX, nodeY} (from the first constraint) and {nodeA, nodeY} (from the
120+
second constraint).
121+
122+
Multiple TopologySpreadConstraints is powerful, but be sure to understand the
123+
difference with the preceding "NodeSelector/NodeAffinity" example: one is to
124+
calculate result set independently and then interjoined; while the other is to
125+
calculate topologySpreadConstraints based on the filtering results of node
126+
constraints.
127+
128+
Instead of using "hard" constraints in all topologySpreadConstraints, you can
129+
also combine using "hard" constraints and "soft" constraints to adhere to more
130+
diverse cluster situations.
131+
132+
{{< note >}}
133+
If two TopologySpreadConstraints are being applied for the same {topologyKey,
134+
whenUnsatisfiable} tuple, the Pod creation will be blocked returning a
135+
validation error.
136+
{{< /note >}}
137+
138+
## PodTopologySpread defaults
139+
140+
PodTopologySpread is a Pod level API. As such, to use the feature, workload
141+
authors need to be aware of the underlying topology of the cluster, and then
142+
specify proper `topologySpreadConstraints` in the Pod spec for every workload.
143+
While the Pod-level API gives the most flexibility it is also possible to
144+
specify cluster-level defaults.
145+
146+
The default PodTopologySpread constraints allow you to specify spreading for all
147+
the workloads in the cluster, tailored for its topology. The constraints can be
148+
specified by an operator/admin as PodTopologySpread plugin arguments in the
149+
[scheduling profile configuration
150+
API](/docs/reference/scheduling/profiles/) when starting
151+
kube-scheduler.
152+
153+
A sample configuration could look like this:
154+
155+
```
156+
apiVersion: kubescheduler.config.k8s.io/v1alpha2
157+
kind: KubeSchedulerConfiguration
158+
profiles:
159+
pluginConfig:
160+
- name: PodTopologySpread
161+
args:
162+
defaultConstraints:
163+
- maxSkew: 1
164+
topologyKey: example.com/rack
165+
whenUnsatisfiable: ScheduleAnyway
166+
```
167+
168+
When configuring default constraints, label selectors must be left empty.
169+
kube-scheduler will deduce the label selectors from the membership of the Pod to
170+
Services, ReplicationControllers, ReplicaSets or StatefulSets. Pods can
171+
always override the default constraints by providing their own through the
172+
PodSpec.
173+
174+
{{< note >}}
175+
When using default PodTopologySpread constraints, it is recommended to disable
176+
the old DefaultTopologySpread plugin.
177+
{{< /note >}}
178+
179+
## Wrap-up
180+
181+
PodTopologySpread allows you to define spreading constraints for your workloads
182+
with a flexible and expressive Pod-level API. In the past, workload authors used
183+
Pod AntiAffinity rules to force or hint the scheduler to run a single Pod per
184+
topology domain. In contrast, the new PodTopologySpread constraints allow Pods
185+
to specify skew levels that can be required (hard) or desired (soft). The
186+
feature can be paired with Node selectors and Node affinity to limit the
187+
spreading to specific domains. Pod spreading constraints can be defined for
188+
different topologies such as hostnames, zones, regions, racks, etc.
189+
190+
Lastly, cluster operators can define default constraints to be applied to all
191+
Pods. This way, Pods don't need to be aware of the underlying topology of the
192+
cluster.

content/en/docs/concepts/workloads/pods/pod-topology-spread-constraints.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Instead of manually applying labels, you can also reuse the [well-known labels](
5555

5656
The field `pod.spec.topologySpreadConstraints` is introduced in 1.16 as below:
5757

58-
```yaml
58+
```
5959
apiVersion: v1
6060
kind: Pod
6161
metadata:
66.5 KB
Loading
69.3 KB
Loading
42.6 KB
Loading

0 commit comments

Comments
 (0)