You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/3521-pod-scheduling-readiness/README.md
+59-16Lines changed: 59 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -140,9 +140,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
140
140
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
141
141
-[ ] (R) Production readiness review completed
142
142
-[ ] (R) Production readiness review approved
143
-
-[] "Implementation History" section is up-to-date for milestone
144
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
145
-
-[] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
143
+
-[x] "Implementation History" section is up-to-date for milestone
144
+
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
145
+
-[x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
146
146
147
147
<!--
148
148
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -797,6 +797,10 @@ What signals should users be paying attention to when the feature is young
797
797
that might indicate a serious problem?
798
798
-->
799
799
800
+
A rollback might be considered if the metric `scheduler_pending_pods{queue="gated"}` stays in a
801
+
high watermark for a long time. It, if not intentionally, may reveal that some controllers forget
802
+
to empty the Pods' scheduling gates, which keep them in pending state.
803
+
800
804
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
801
805
802
806
<!--
@@ -805,12 +809,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
805
809
are missing a bunch of machinery and tooling and can't do that now.
806
810
-->
807
811
812
+
It will be tested manually prior to beta launch.
813
+
808
814
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
809
815
810
816
<!--
811
817
Even if applying deprecation policies, they may still surprise some users.
812
818
-->
813
819
820
+
No.
821
+
814
822
### Monitoring Requirements
815
823
816
824
<!--
@@ -820,13 +828,16 @@ For GA, this section is required: approvers should be able to confirm the
820
828
previous answers based on experience in the field.
821
829
-->
822
830
823
-
Add a label `notReady` to the existing metric `pending_pods` to distinguish unschedulable Pods:
831
+
A new "queue" type `gated` is added to the existing metric `scheduler_pending_pods` to distinguish
832
+
unschedulable Pods:
824
833
825
-
-`unschedulable` (existing): scheduler tried but cannot find any Node to host the Pod
826
-
-`notReady` (new): scheduler respect the Pod's present `schedulingGates` and hence not schedule it
834
+
-`scheduler_pending_pods{queue="unschedulable"}` (existing): scheduler tried but cannot find any
835
+
Node to host the Pod
836
+
-`scheduler_pending_pods{queue="gated"}` (new): scheduler respect the Pod's present `schedulingGates`
837
+
and hence not schedule it
827
838
828
839
Moreover, to explicitly indicate a Pod's scheduling-unready state, a condition
829
-
`{type:PodScheduled, reason:WaitingForGates}` is introduced.
840
+
`{type:PodScheduled, reason:SchedulingGated}` is introduced.
830
841
831
842
###### How can an operator determine if the feature is in use by workloads?
832
843
@@ -836,6 +847,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
836
847
logs or events for this purpose.
837
848
-->
838
849
850
+
- observe non-zero value for the metric `pending_pods{queue="gated"}`
851
+
- observe non-empty value in a Pod's `.spec.schedulingGates` field
852
+
839
853
###### How can someone using this feature know that it is working for their instance?
840
854
841
855
<!--
@@ -854,6 +868,11 @@ Recall that end users cannot usually observe component logs or access metrics.
854
868
855
869
-[x] API .spec
856
870
- Other field: `schedulingGates`
871
+
-[x] Events
872
+
- Event Type: PodScheduled
873
+
- Event Status: False
874
+
- Event Reason: WaitingForGates SchedulingGated
875
+
- Event Message: Scheduling is blocked due to non-empty scheduling gates
857
876
858
877
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
859
878
@@ -872,18 +891,17 @@ These goals will help you determine what you need to measure (SLIs) in the next
872
891
question.
873
892
-->
874
893
894
+
N/A.
895
+
875
896
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
0 commit comments