Skip to content

Commit db09024

Browse files
authored
Merge pull request #5033 from mochizuki875/matchlabelkeys-to-podtopologyspread
KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it
2 parents 45b73b8 + 7d90a4b commit db09024

File tree

1 file changed

+135
-38
lines changed
  • keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades

1 file changed

+135
-38
lines changed

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

Lines changed: 135 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,10 @@ tags, and then generate with `hack/update-toc.sh`.
8787
- [Story 1](#story-1)
8888
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
8989
- [Risks and Mitigations](#risks-and-mitigations)
90+
- [Possible misuse](#possible-misuse)
91+
- [The update to labels specified at <code>matchLabelKeys</code> isn't supported](#the-update-to-labels-specified-at-matchlabelkeys-isnt-supported)
9092
- [Design Details](#design-details)
93+
- [[v1.33] design change and a safe upgrade path](#v133-design-change-and-a-safe-upgrade-path)
9194
- [Test Plan](#test-plan)
9295
- [Prerequisite testing updates](#prerequisite-testing-updates)
9396
- [Unit tests](#unit-tests)
@@ -109,6 +112,8 @@ tags, and then generate with `hack/update-toc.sh`.
109112
- [Implementation History](#implementation-history)
110113
- [Drawbacks](#drawbacks)
111114
- [Alternatives](#alternatives)
115+
- [use pod generateName](#use-pod-generatename)
116+
- [implement MatchLabelKeys in only either the scheduler plugin or kube-apiserver](#implement-matchlabelkeys-in-only-either-the-scheduler-plugin-or-kube-apiserver)
112117
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
113118
<!-- /toc -->
114119

@@ -179,10 +184,13 @@ which spreading is applied using a LabelSelector. This means the user should
179184
know the exact label key and value when defining the pod spec.
180185

181186
This KEP proposes a complementary field to LabelSelector named `MatchLabelKeys` in
182-
`TopologySpreadConstraint` which represent a set of label keys only. The scheduler
183-
will use those keys to look up label values from the incoming pod; and those key-value
184-
labels are ANDed with `LabelSelector` to identify the group of existing pods over
187+
`TopologySpreadConstraint` which represents a set of label keys only.
188+
At a pod creation, kube-apiserver will use those keys to look up label values from the incoming pod
189+
and those key-value labels will be merged with existing `LabelSelector` to identify the group of existing pods over
185190
which the spreading skew will be calculated.
191+
Note that in case `MatchLabelKeys` is supported in the cluster-level default constraints
192+
(see https://github.com/kubernetes/kubernetes/issues/129198), kube-scheduler will also handle it separately.
193+
186194

187195
The main case that this new way for identifying pods will enable is constraining
188196
skew spreading calculation to happen at the revision level in Deployments during
@@ -292,11 +300,33 @@ How will UX be reviewed, and by whom?
292300
293301
Consider including folks who also work outside the SIG or subproject.
294302
-->
303+
#### Possible misuse
304+
295305
In addition to using `pod-template-hash` added by the Deployment controller,
296306
users can also provide the customized key in `MatchLabelKeys` to identify
297307
which pods should be grouped. If so, the user needs to ensure that it is
298308
correct and not duplicated with other unrelated workloads.
299309

310+
#### The update to labels specified at `matchLabelKeys` isn't supported
311+
312+
`MatchLabelKeys` is handled and merged into `LabelSelector` at _a pod's creation_.
313+
It means this feature doesn't support the label's update even though a user
314+
could update the label that is specified at `matchLabelKeys` after a pod's creation.
315+
So, in such cases, the update of the label isn't reflected onto the merged `LabelSelector`,
316+
even though users might expect it to be.
317+
On the documentation, we'll declare it's not recommended to use `matchLabelKeys` with labels that might be updated.
318+
319+
Also, we assume the risk is acceptably low because:
320+
1. It's a fairly low probability to happen because pods are usually managed by another resource (e.g., deployment),
321+
and the update to pod template's labels on a deployment recreates pods, instead of directly updating the labels on existing pods.
322+
Also, even if users somehow use bare pods (which is not recommended in the first place),
323+
there's usually only a tiny moment between the pod creation and the pod getting scheduled, which makes this risk further rarer to happen,
324+
unless many pods are often getting stuck being unschedulable for a long time in the cluster (which is not recommended)
325+
or the labels specified at `matchLabelKeys` are frequently updated (which we'll declare as not recommended).
326+
2. If it happens, `selfMatchNum` will be 0 and both `matchNum` and `minMatchNum` will be retained.
327+
Consequently, depending on the current number of matching pods in the domain, `matchNum` - `minMatchNum` might be bigger than `maxSkew`,
328+
and the pod(s) could be unschedulable.
329+
But, it does not mean that the unfortunate pods would be unschedulable forever.
300330

301331
## Design Details
302332

@@ -307,15 +337,10 @@ required) or even code snippets. If there's any ambiguity about HOW your
307337
proposal will be implemented, this is the place to discuss them.
308338
-->
309339

310-
A new field named `MatchLabelKeys` will be added to `TopologySpreadConstraint`.
340+
A new optional field named `MatchLabelKeys` will be introduced to `TopologySpreadConstraint`.
311341
Currently, when scheduling a pod, the `LabelSelector` defined in the pod is used
312342
to identify the group of pods over which spreading will be calculated.
313-
`MatchLabelKeys` adds another constraint to how this group of pods is identified:
314-
the scheduler will use those keys to look up label values from the incoming pod;
315-
and those key-value labels are ANDed with `LabelSelector` to select the group of
316-
existing pods over which spreading will be calculated.
317-
318-
A new field named `MatchLabelKeys` will be introduced to`TopologySpreadConstraint`:
343+
`MatchLabelKeys` adds another constraint to how this group of pods is identified.
319344
```go
320345
type TopologySpreadConstraint struct {
321346
MaxSkew int32
@@ -333,27 +358,67 @@ type TopologySpreadConstraint struct {
333358
}
334359
```
335360

336-
Examples of use are as follows:
361+
When a Pod is created, kube-apiserver will obtain the labels from the pod
362+
by the keys in `matchLabelKeys` and the key-value labels are merged to `LabelSelector`
363+
of `TopologySpreadConstraint`.
364+
365+
For example, when this sample Pod is created,
366+
337367
```yaml
338-
topologySpreadConstraints:
339-
- maxSkew: 1
340-
topologyKey: kubernetes.io/hostname
341-
whenUnsatisfiable: DoNotSchedule
342-
matchLabelKeys:
343-
- app
344-
- pod-template-hash
368+
apiVersion: v1
369+
kind: Pod
370+
metadata:
371+
name: sample
372+
labels:
373+
app: sample
374+
...
375+
topologySpreadConstraints:
376+
- maxSkew: 1
377+
topologyKey: kubernetes.io/hostname
378+
whenUnsatisfiable: DoNotSchedule
379+
labelSelector: {}
380+
matchLabelKeys: # ADDED
381+
- app
382+
```
383+
384+
kube-apiserver modifies the `labelSelector` like the following:
385+
386+
```diff
387+
topologySpreadConstraints:
388+
- maxSkew: 1
389+
topologyKey: kubernetes.io/hostname
390+
whenUnsatisfiable: DoNotSchedule
391+
labelSelector:
392+
+ matchExpressions:
393+
+ - key: app
394+
+ operator: In
395+
+ values:
396+
+ - sample
397+
matchLabelKeys:
398+
- app
345399
```
346400

347-
The scheduler plugin `PodTopologySpread` will obtain the labels from the pod
348-
labels by the keys in `matchLabelKeys`. The obtained labels will be merged
349-
to `labelSelector` of `topologySpreadConstraints` to filter and group pods.
350-
The pods belonging to the same group will be part of the spreading in
351-
`PodTopologySpread`.
401+
In addition, kube-scheduler will handle `matchLabelKeys` within the cluster-level default constraints
402+
in the scheduler configuration in the future (see https://github.com/kubernetes/kubernetes/issues/129198).
352403

353404
Finally, the feature will be guarded by a new feature flag. If the feature is
354-
disabled, the field `matchLabelKeys` is preserved if it was already set in the
355-
persisted Pod object, otherwise it is silently dropped; moreover, kube-scheduler
356-
will ignore the field and continue to behave as before.
405+
disabled, the field `matchLabelKeys` and corresponding `labelSelector` are preserved
406+
if it was already set in the persisted Pod object, otherwise new Pod with the field
407+
creation will be rejected by kube-apiserver.
408+
Also kube-scheduler will ignore `matchLabelKeys` in the cluster-level default constraints configuration.
409+
410+
### [v1.33] design change and a safe upgrade path
411+
Previously, kube-scheduler just internally handled `matchLabelKeys` before the calculation of scheduling results.
412+
But, we changed the implementation design to the current form to make the design align with PodAffinity's `matchLabelKeys`.
413+
(See the detailed discussion in [the alternative section](#implement-matchlabelkeys-in-only-either-the-scheduler-plugin-or-kube-apiserver))
414+
415+
However, this implementation change could break `matchLabelKeys` of unscheduled pods created before the upgrade
416+
because kube-apiserver only handles `matchLabelKeys` at pods creation, that is,
417+
it doesn't handle `matchLabelKeys` at existing unscheduled pods.
418+
So, for a safe upgrade path from v1.32 to v1.33, kube-scheduler would handle not only `matchLabelKeys`
419+
from the default constraints, but also all incoming pods during v1.33.
420+
We're going to change kube-scheduler to only concern `matchLabelKeys` from the default constraints at v1.34 for efficiency,
421+
assuming kube-apiserver handles `matchLabelKeys` of all incoming pods.
357422

358423
### Test Plan
359424

@@ -400,8 +465,9 @@ This can inform certain test coverage improvements that we want to do before
400465
extending the production code to implement this enhancement.
401466
-->
402467

403-
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread`: `06-07` - `86%`
404-
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go`: `06-07` - `73.1%`
468+
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread`: `2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)` - `87.5%`
469+
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go`: `2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)` - `84.8%`
470+
- `k8s.io/kubernetes/pkg/registry/core/pod/strategy.go`: `2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)` - `65%`
405471

406472
##### Integration tests
407473

@@ -532,7 +598,9 @@ enhancement:
532598

533599
In the event of an upgrade, kube-apiserver will start to accept and store the field `MatchLabelKeys`.
534600

535-
In the event of a downgrade, kube-scheduler will ignore `MatchLabelKeys` even if it was set.
601+
In the event of a downgrade, kube-apiserver will reject pod creation with `matchLabelKeys` in `TopologySpreadConstraint`.
602+
But, regarding existing pods, we leave `matchLabelKeys` and generated `LabelSelector` even after downgraded.
603+
kube-scheduler will ignore `MatchLabelKeys` if it was set in the cluster-level default constraints configuration.
536604

537605
### Version Skew Strategy
538606

@@ -548,7 +616,11 @@ enhancement:
548616
- Will any other components on the node change? For example, changes to CSI,
549617
CRI or CNI may require updating that component before the kubelet.
550618
-->
551-
N/A
619+
620+
There's no version skew issue.
621+
622+
We changed the implementation design between v1.33 and v1.34, but we designed the change not to involve any version skew issue
623+
as described at [[v1.33] design change and a safe upgrade path](#v133-design-change-and-a-safe-upgrade-path).
552624

553625
## Production Readiness Review Questionnaire
554626

@@ -619,8 +691,10 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
619691
The feature can be disabled in Alpha and Beta versions by restarting
620692
kube-apiserver and kube-scheduler with feature-gate off.
621693
One caveat is that pods that used the feature will continue to have the
622-
MatchLabelKeys field set even after disabling the feature gate,
623-
however kube-scheduler will not take the field into account.
694+
MatchLabelKeys field set and the corresponding LabelSelector even after
695+
disabling the feature gate.
696+
In terms of Stable versions, users can choose to opt-out by not setting
697+
the matchLabelKeys field.
624698

625699
###### What happens if we reenable the feature if it was previously rolled back?
626700
Newly created pods need to follow this policy when scheduling. Old pods will
@@ -659,7 +733,8 @@ feature flags will be enabled on some API servers and not others during the
659733
rollout. Similarly, consider large clusters and how enablement/disablement
660734
will rollout across nodes.
661735
-->
662-
It won't impact already running workloads because it is an opt-in feature in scheduler.
736+
It won't impact already running workloads because it is an opt-in feature in kube-apiserver
737+
and kube-scheduler.
663738
But during a rolling upgrade, if some apiservers have not enabled the feature, they will not
664739
be able to accept and store the field "MatchLabelKeys" and the pods associated with these
665740
apiservers will not be able to use this feature. As a result, pods belonging to the
@@ -765,7 +840,7 @@ Recall that end users cannot usually observe component logs or access metrics.
765840
-->
766841

767842
- [x] Other (treat as last resort)
768-
- Details: We can determine if this feature is being used by checking deployments that have only `MatchLabelKeys` set in `TopologySpreadConstraint` and no `LabelSelector`. These Deployments will strictly adhere to TopologySpread after both deployment and rolling upgrades if the feature is being used.
843+
- Details: We can determine if this feature is being used by checking pods that have only `MatchLabelKeys` set in `TopologySpreadConstraint`.
769844

770845
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
771846

@@ -896,8 +971,12 @@ Think about adding additional work or introducing new steps in between
896971

897972
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
898973
-->
899-
Yes. there is an additional work: the scheduler will use the keys in `matchLabelKeys` to look up label values from the pod and AND with `LabelSelector`.
900-
Maybe result in a very samll impact in scheduling latency which directly contributes to pod-startup-latency SLO.
974+
Yes. there is an additional work:
975+
kube-apiserver uses the keys in `matchLabelKeys` to look up label values from the pod,
976+
and change `LabelSelector` according to them.
977+
kube-scheduler also handles matchLabelKeys if the cluster-level default constraints has it.
978+
The impact in the latency of pod creation request in kube-apiserver and the scheduling latency
979+
should be negligible.
901980

902981
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
903982

@@ -937,7 +1016,7 @@ details). For now, we leave it here.
9371016

9381017
###### How does this feature react if the API server and/or etcd is unavailable?
9391018
If the API server and/or etcd is not available, this feature will not be available.
940-
This is because the scheduler needs to update the scheduling results to the pod via the API server/etcd.
1019+
This is because the kube-scheduler needs to update the scheduling results to the pod via the API server/etcd.
9411020

9421021
###### What are other known failure modes?
9431022

@@ -963,7 +1042,7 @@ N/A
9631042
- Check the metric `schedule_attempts_total{result="error|unschedulable"}` to determine if the number
9641043
of attempts increased. If increased, You need to determine the cause of the failure by the event of
9651044
the pod. If it's caused by plugin `PodTopologySpread`, You can further analyze this problem by looking
966-
at the scheduler log.
1045+
at the kube-scheduler log.
9671046

9681047

9691048
## Implementation History
@@ -981,6 +1060,7 @@ Major milestones might include:
9811060
- 2022-03-17: Initial KEP
9821061
- 2022-06-08: KEP merged
9831062
- 2023-01-16: Graduate to Beta
1063+
- 2025-01-23: Change the implementation design to be aligned with PodAffinity's `matchLabelKeys`
9841064

9851065
## Drawbacks
9861066

@@ -996,11 +1076,28 @@ not need to be as detailed as the proposal, but should include enough
9961076
information to express the idea and why it was not acceptable.
9971077
-->
9981078

1079+
### use pod generateName
9991080
Use `pod.generateName` to distinguish new/old pods that belong to the
10001081
revisions of the same workload in scheduler plugin. It's decided not to
10011082
support because of the following reason: scheduler needs to ensure universal
10021083
and scheduler plugin shouldn't have special treatment for any labels/fields.
10031084

1085+
### implement MatchLabelKeys in only either the scheduler plugin or kube-apiserver
1086+
Technically, we can implement this feature within the PodTopologySpread plugin only;
1087+
merging the key-value labels corresponding to `MatchLabelKeys` into `LabelSelector` internally
1088+
within the plugin before calculating the scheduling results.
1089+
This is the actual implementation up to 1.32.
1090+
But, it may confuse users because this behavior would be different from PodAffinity's `MatchLabelKeys`.
1091+
1092+
Also, we cannot implement this feature only within kube-apiserver because it'd make it
1093+
impossible to handle `MatchLabelKeys` within the cluster-level default constraints
1094+
in the scheduler configuration in the future (see https://github.com/kubernetes/kubernetes/issues/129198).
1095+
1096+
So we decided to go with the design that implements this feature within both
1097+
the PodTopologySpread plugin and kube-apiserver.
1098+
Although the final design has a downside requiring us to maintain two implementations handling `MatchLabelKeys`,
1099+
each implementation is simple and we regard the risk of increased maintenance overhead as fairly low.
1100+
10041101
## Infrastructure Needed (Optional)
10051102

10061103
<!--

0 commit comments

Comments
 (0)