You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -659,13 +665,18 @@ feature flags will be enabled on some API servers and not others during the
659
665
rollout. Similarly, consider large clusters and how enablement/disablement
660
666
will rollout across nodes.
661
667
-->
668
+
It won't impact already running workloads because it is an opt-in feature.
669
+
662
670
663
671
###### What specific metrics should inform a rollback?
664
672
665
673
<!--
666
674
What signals should users be paying attention to when the feature is young
667
675
that might indicate a serious problem?
668
676
-->
677
+
- If the metric `schedule_attempts_total{result="error|unschedulable"}` increased significantly after pods using this feature are added.
678
+
- If the metric `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` increased to higher than 100ms on 90% after pods using this feature are added.
679
+
669
680
670
681
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
671
682
@@ -674,12 +685,60 @@ Describe manual testing that was done and the outcomes.
674
685
Longer term, we may want to require automated upgrade/rollback tests, but we
675
686
are missing a bunch of machinery and tooling and can't do that now.
676
687
-->
688
+
Yes, it was tested manually by following the steps below, and it was working at intended.
689
+
1. create a kubernetes cluster v1.26 with 3 nodes where `MatchLabelKeysInPodTopologySpread` feature is disabled.
690
+
2. deploy a deployment with this yaml
691
+
```yaml
692
+
apiVersion: apps/v1
693
+
kind: Deployment
694
+
metadata:
695
+
name: nginx
696
+
spec:
697
+
replicas: 12
698
+
selector:
699
+
matchLabels:
700
+
foo: bar
701
+
template:
702
+
metadata:
703
+
labels:
704
+
foo: bar
705
+
spec:
706
+
restartPolicy: Always
707
+
containers:
708
+
- name: nginx
709
+
image: nginx:1.14.2
710
+
topologySpreadConstraints:
711
+
- maxSkew: 1
712
+
topologyKey: kubernetes.io/hostname
713
+
whenUnsatisfiable: DoNotSchedule
714
+
labelSelector:
715
+
matchLabels:
716
+
foo: bar
717
+
matchLabelKeys:
718
+
- pod-template-hash
719
+
```
720
+
3. pods spread across nodes as 4/4/4
721
+
4. update the deployment nginx image to `nginx:1.15.0`
722
+
5. pods spread across nodes as 5/4/3
723
+
6. delete deployment nginx
724
+
7. upgrade kubenetes cluster to v1.27 (at master branch) while `MatchLabelKeysInPodTopologySpread` is enabled.
725
+
8. deploy a deployment nginx like step2
726
+
9. pods spread across nodes as 4/4/4
727
+
10. update the deployment nginx image to `nginx:1.15.0`
728
+
11. pods spread across nodes as 4/4/4
729
+
12. delete deployment nginx
730
+
13. downgrade kubenetes cluster to v1.26 where `MatchLabelKeysInPodTopologySpread` feature is enabled.
731
+
14. deploy a deployment nginx like step2
732
+
15. pods spread across nodes as 4/4/4
733
+
16. update the deployment nginx image to `nginx:1.15.0`
734
+
17. pods spread across nodes as 4/4/4
677
735
678
736
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
679
737
680
738
<!--
681
739
Even if applying deprecation policies, they may still surprise some users.
682
740
-->
741
+
No.
683
742
684
743
### Monitoring Requirements
685
744
@@ -694,6 +753,7 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
694
753
checking if there are objects with field X set) may be a last resort. Avoid
695
754
logs or events for this purpose.
696
755
-->
756
+
Operator can query pods that have the `pod.spec.topologySpreadConstraints.matchLabelKeys` field set to determine if the feature is in use by workloads.
697
757
698
758
###### How can someone using this feature know that it is working for their instance?
699
759
@@ -706,13 +766,8 @@ and operation of this feature.
706
766
Recall that end users cannot usually observe component logs or access metrics.
707
767
-->
708
768
709
-
- [ ] Events
710
-
- Event Reason:
711
-
- [ ] API .status
712
-
- Condition name:
713
-
- Other field:
714
-
- [ ] Other (treat as last resort)
715
-
- Details:
769
+
- [x] Other (treat as last resort)
770
+
- Details: We can determine if the feature is being used by comparing the expected and actual scheduling results.
716
771
717
772
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
718
773
@@ -730,26 +785,27 @@ high level (needs more precise definitions) those may be things like:
730
785
These goals will help you determine what you need to measure (SLIs) in the next
731
786
question.
732
787
-->
788
+
Metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} <= 100ms on 90-percentile.
733
789
734
790
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
0 commit comments