|
| 1 | +--- |
| 2 | +title: "Introducing PodTopologySpread" |
| 3 | +date: 2020-05-05 |
| 4 | +slug: introducing-podtopologyspread |
| 5 | +url: /blog/2020/05/Introducing-PodTopologySpread |
| 6 | +--- |
| 7 | + |
| 8 | +**Author:** Wei Huang (IBM), Aldo Culquicondor (Google) |
| 9 | + |
| 10 | +Managing Pods distribution across a cluster is hard. The well-known Kubernetes |
| 11 | +features for Pod affinity and anti-affinity, allow some control of Pod placement |
| 12 | +in different topologies. However, these features only resolve part of Pods |
| 13 | +distribution use cases: either place unlimited Pods to a single topology, or |
| 14 | +disallow two Pods to co-locate in the same topology. In between these two |
| 15 | +extreme cases, there is a common need to distribute the Pods evenly across the |
| 16 | +topologies, so as to achieve better cluster utilization and high availability of |
| 17 | +applications. |
| 18 | + |
| 19 | +The PodTopologySpread scheduling plugin (originally proposed as EvenPodsSpread) |
| 20 | +was designed to fill that gap. We promoted it to beta in 1.18. |
| 21 | + |
| 22 | +## API changes |
| 23 | + |
| 24 | +A new field `topologySpreadConstraints` is introduced in the Pod's spec API: |
| 25 | + |
| 26 | +``` |
| 27 | +spec: |
| 28 | + topologySpreadConstraints: |
| 29 | + - maxSkew: <integer> |
| 30 | + topologyKey: <string> |
| 31 | + whenUnsatisfiable: <string> |
| 32 | + labelSelector: <object> |
| 33 | +``` |
| 34 | + |
| 35 | +As this API is embedded in Pod's spec, you can use this feature in all the |
| 36 | +high-level workload APIs, such as Deployment, DaemonSet, StatefulSet, etc. |
| 37 | + |
| 38 | +Let's see an example of a cluster to understand this API. |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +- **labelSelector** is used to find matching Pods. For each topology, we count |
| 43 | + the number of Pods that match this label selector. In the above example, given |
| 44 | + the labelSelector as "app: foo", the matching number in "zone1" is 2; while |
| 45 | + the number in "zone2" is 0. |
| 46 | +- **topologyKey** is the key that defines a topology in the Nodes' labels. In |
| 47 | + the above example, some Nodes are grouped into "zone1" if they have the label |
| 48 | + "zone=zone1" label; while other ones are grouped into "zone2". |
| 49 | +- **maxSkew** describes the maximum degree to which Pods can be unevenly |
| 50 | + distributed. In the above example: |
| 51 | + - if we put the incoming Pod to "zone1", the skew on "zone1" will become 3 (3 |
| 52 | + Pods matched in "zone1"; global minimum of 0 Pods matched on "zone2"), which |
| 53 | + violates the "maxSkew: 1" constraint. |
| 54 | + - if the incoming Pod is placed to "zone2", the skew on "zone2" is 0 (1 Pod |
| 55 | + matched in "zone2"; global minimum of 1 Pod matched on "zone2" itself), |
| 56 | + which satisfies the "maxSkew: 1" constraint. Note that the skew is |
| 57 | + calculated per each qualified Node, instead of a global skew. |
| 58 | +- **whenUnsatisfiable** specifies, when "maxSkew" can't be satisfied, what |
| 59 | + action should be taken: |
| 60 | + - `DoNotSchedule` (default) tells the scheduler not to schedule it. It's a |
| 61 | + hard constraint. |
| 62 | + - `ScheduleAnyway` tells the scheduler to still schedule it while prioritizing |
| 63 | + Nodes that reduce the skew. It's a soft constraint. |
| 64 | + |
| 65 | +## Advanced usage |
| 66 | + |
| 67 | +As the feature name "PodTopologySpread" implies, the basic usage of this feature |
| 68 | +is to run your workload with an absolute even manner (maxSkew=1), or relatively |
| 69 | +even manner (maxSkew>=2). See the [official |
| 70 | +document](/docs/concepts/workloads/pods/pod-topology-spread-constraints/) |
| 71 | +for more details. |
| 72 | + |
| 73 | +In addition to this basic usage, there are some advanced usage examples that |
| 74 | +enable your workloads to benefit on high availability and cluster utilization. |
| 75 | + |
| 76 | +### Usage along with NodeSelector / NodeAffinity |
| 77 | + |
| 78 | +You may have found that we didn't have a "topologyValues" field to limit which |
| 79 | +topologies the Pods are going to be scheduled to. By default, it is going to |
| 80 | +search all Nodes and group them by "topologyKey". Sometimes this may not be the |
| 81 | +ideal case. For instance, suppose there is a cluster with Nodes tagged with |
| 82 | +"env=prod", "env=staging" and "env=qa", and now you want to evenly place Pods to |
| 83 | +the "qa" environment across zones, is it possible? |
| 84 | + |
| 85 | +The answer is yes. You can leverage the NodeSelector or NodeAffinity API spec. |
| 86 | +Under the hood, the PodTopologySpread feature will **honor** that and calculate |
| 87 | +the spread constraints among the nodes that satisfy the selectors. |
| 88 | + |
| 89 | + |
| 90 | + |
| 91 | +As illustrated above, you can specify `spec.affinity.nodeAffinity` to limit the |
| 92 | +"searching scope" to be "qa" environment, and within that scope, the Pod will be |
| 93 | +scheduled to one zone which satisfies the topologySpreadConstraints. In this |
| 94 | +case, it's "zone2". |
| 95 | + |
| 96 | +### Multiple TopologySpreadConstraints |
| 97 | + |
| 98 | +It's intuitive to understand how one single TopologySpreadConstraint works. |
| 99 | +What's the case for multiple TopologySpreadConstraints? Internally, each |
| 100 | +TopologySpreadConstraint is calculated independently, and the result sets will |
| 101 | +be merged to generate the eventual result set - i.e., suitable Nodes. |
| 102 | + |
| 103 | +In the following example, we want to schedule a Pod to a cluster with 2 |
| 104 | +requirements at the same time: |
| 105 | + |
| 106 | +- place the Pod evenly with Pods across zones |
| 107 | +- place the Pod evenly with Pods across nodes |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +For the first constraint, there are 3 Pods in zone1 and 2 Pods in zone2, so the |
| 112 | +incoming Pod can be only put to zone2 to satisfy the "maxSkew=1" constraint. In |
| 113 | +other words, the result set is nodeX and nodeY. |
| 114 | + |
| 115 | +For the second constraint, there are too many Pods in nodeB and nodeX, so the |
| 116 | +incoming Pod can be only put to nodeA and nodeY. |
| 117 | + |
| 118 | +Now we can conclude the only qualified Node is nodeY - from the intersection of |
| 119 | +the sets {nodeX, nodeY} (from the first constraint) and {nodeA, nodeY} (from the |
| 120 | +second constraint). |
| 121 | + |
| 122 | +Multiple TopologySpreadConstraints is powerful, but be sure to understand the |
| 123 | +difference with the preceding "NodeSelector/NodeAffinity" example: one is to |
| 124 | +calculate result set independently and then interjoined; while the other is to |
| 125 | +calculate topologySpreadConstraints based on the filtering results of node |
| 126 | +constraints. |
| 127 | + |
| 128 | +Instead of using "hard" constraints in all topologySpreadConstraints, you can |
| 129 | +also combine using "hard" constraints and "soft" constraints to adhere to more |
| 130 | +diverse cluster situations. |
| 131 | + |
| 132 | +{{< note >}} |
| 133 | +If two TopologySpreadConstraints are being applied for the same {topologyKey, |
| 134 | +whenUnsatisfiable} tuple, the Pod creation will be blocked returning a |
| 135 | +validation error. |
| 136 | +{{< /note >}} |
| 137 | + |
| 138 | +## PodTopologySpread defaults |
| 139 | + |
| 140 | +PodTopologySpread is a Pod level API. As such, to use the feature, workload |
| 141 | +authors need to be aware of the underlying topology of the cluster, and then |
| 142 | +specify proper `topologySpreadConstraints` in the Pod spec for every workload. |
| 143 | +While the Pod-level API gives the most flexibility it is also possible to |
| 144 | +specify cluster-level defaults. |
| 145 | + |
| 146 | +The default PodTopologySpread constraints allow you to specify spreading for all |
| 147 | +the workloads in the cluster, tailored for its topology. The constraints can be |
| 148 | +specified by an operator/admin as PodTopologySpread plugin arguments in the |
| 149 | +[scheduling profile configuration |
| 150 | +API](/docs/reference/scheduling/profiles/) when starting |
| 151 | +kube-scheduler. |
| 152 | + |
| 153 | +A sample configuration could look like this: |
| 154 | + |
| 155 | +``` |
| 156 | +apiVersion: kubescheduler.config.k8s.io/v1alpha2 |
| 157 | +kind: KubeSchedulerConfiguration |
| 158 | +profiles: |
| 159 | + pluginConfig: |
| 160 | + - name: PodTopologySpread |
| 161 | + args: |
| 162 | + defaultConstraints: |
| 163 | + - maxSkew: 1 |
| 164 | + topologyKey: example.com/rack |
| 165 | + whenUnsatisfiable: ScheduleAnyway |
| 166 | +``` |
| 167 | + |
| 168 | +When configuring default constraints, label selectors must be left empty. |
| 169 | +kube-scheduler will deduce the label selectors from the membership of the Pod to |
| 170 | +Services, ReplicationControllers, ReplicaSets or StatefulSets. Pods can |
| 171 | +always override the default constraints by providing their own through the |
| 172 | +PodSpec. |
| 173 | + |
| 174 | +{{< note >}} |
| 175 | +When using default PodTopologySpread constraints, it is recommended to disable |
| 176 | +the old DefaultTopologySpread plugin. |
| 177 | +{{< /note >}} |
| 178 | + |
| 179 | +## Wrap-up |
| 180 | + |
| 181 | +PodTopologySpread allows you to define spreading constraints for your workloads |
| 182 | +with a flexible and expressive Pod-level API. In the past, workload authors used |
| 183 | +Pod AntiAffinity rules to force or hint the scheduler to run a single Pod per |
| 184 | +topology domain. In contrast, the new PodTopologySpread constraints allow Pods |
| 185 | +to specify skew levels that can be required (hard) or desired (soft). The |
| 186 | +feature can be paired with Node selectors and Node affinity to limit the |
| 187 | +spreading to specific domains. Pod spreading constraints can be defined for |
| 188 | +different topologies such as hostnames, zones, regions, racks, etc. |
| 189 | + |
| 190 | +Lastly, cluster operators can define default constraints to be applied to all |
| 191 | +Pods. This way, Pods don't need to be aware of the underlying topology of the |
| 192 | +cluster. |
0 commit comments