@@ -76,6 +76,7 @@ tags, and then generate with `hack/update-toc.sh`.
76
76
- [ Motivation] ( #motivation )
77
77
- [ Goals] ( #goals )
78
78
- [ Non-Goals] ( #non-goals )
79
+ - [ Future Work] ( #future-work )
79
80
- [ Proposal] ( #proposal )
80
81
- [ User Stories (Optional)] ( #user-stories-optional )
81
82
- [ Story 1] ( #story-1 )
@@ -140,9 +141,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
140
141
- [ ] (R) [ all GA Endpoints] ( https://github.com/kubernetes/community/pull/1806 ) must be hit by [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
141
142
- [ ] (R) Production readiness review completed
142
143
- [ ] (R) Production readiness review approved
143
- - [ ] "Implementation History" section is up-to-date for milestone
144
- - [ ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
145
- - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
144
+ - [x ] "Implementation History" section is up-to-date for milestone
145
+ - [x ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
146
+ - [x ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
146
147
147
148
<!--
148
149
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -228,6 +229,11 @@ and make progress.
228
229
- Enforce updating the Pod's conditions to expose more context for scheduling-paused Pods.
229
230
- Focus on in-house use-cases of the Enqueue extension point.
230
231
232
+ ### Future Work
233
+
234
+ - Permission control on individual schedulingGate (via
235
+ [ fine-grained permissions] ( https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4 ) )
236
+
231
237
## Proposal
232
238
233
239
<!--
@@ -307,10 +313,16 @@ API Server will return a validation error.
307
313
| unscheduled Pod<br>(nil <tt>nodeName</tt>) | ✅ create<br>❌ update | ✅ create<br>✅ update |
308
314
| scheduled Pod<br>(non-nil <tt>nodeName</tt>) | ❌ create<br>❌ update | ✅ create<br>✅ update |
309
315
310
- - ** New field disabled in Alpha but not scheduler extension:** In Alpha, the new Pod field is disabled
311
- by default. However, the scheduler's extension point is activated no matter the feature gate is enabled
312
- or not. This enables scheduler plugin developers to tryout the feature even in Alpha, by crafting
313
- different enqueue plugins and wire with custom fields or conditions.
316
+ - ** New field disabled in Alpha but not scheduler extension:** In a high level, this feature contains
317
+ the following parts:
318
+
319
+ - a. Pod's spec, feature gate, and other misc bits describing the field ` schedulingGates `
320
+ - b. the ` SchedulingGates ` scheduler plugin
321
+ - c. the underlying scheduler framework, i.e., the new PreEnqueue extension point
322
+
323
+ If the feature is disabled, part ` a ` and ` b ` are disabled, but ` c ` is enabled anyways. This implies
324
+ a scheduler plugin developer can leverage part ` c ` _ only_ to craft their own ` a' ` and ` b' ` to
325
+ accomplish their own feature set.
314
326
315
327
- ** New phase literal in kubectl:** To provide better UX, we're going to add a new phase literal
316
328
` SchedulingPaused ` to the "phase" column of ` kubectl get pod ` . This new literal indicates whether it's
@@ -521,10 +533,9 @@ The following scenarios need to be covered in integration tests:
521
533
can be moved back to activeQ when ` .spec.schedulingGates ` is all cleared
522
534
- Ensure no significant performance degradation
523
535
524
- - ` test/integration/scheduler/queue_test.go ` : Will add new tests.
525
- - ` test/integration/scheduler/plugins/plugins_test.go ` : Will add new tests.
526
- - ` test/integration/scheduler/enqueue/enqueue_test.go ` : Will add new tests.
527
- - ` test/integration/scheduler_perf/scheduler_perf_test.go ` : https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling
536
+ - ` test/integration/scheduler/queue_test.go#TestSchedulingGates ` : [ k8s-triage] ( https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=TestSchedulingGates )
537
+ - ` test/integration/scheduler/plugins/plugins_test.go#TestPreEnqueuePlugin ` : [ k8s-triage] ( https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=TestPreEnqueuePlugin )
538
+ - ` test/integration/scheduler_perf/scheduler_perf_test.go ` : will add in Beta. (https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling )
528
539
529
540
##### e2e tests
530
541
@@ -538,15 +549,15 @@ https://storage.googleapis.com/k8s-triage/index.html
538
549
We expect no non-infra related flakes in the last month as a GA graduation criteria.
539
550
-->
540
551
541
- Create a test with the following sequences:
552
+ An e2e test was created in Alpha with the following sequences:
542
553
543
- - Provision a cluster with feature gate ` PodSchedulingReadiness=true ` (we may need to setup a testgrid
544
- for when it's alpha)
545
554
- Create a Pod with non-nil ` .spec.schedulingGates ` .
546
555
- Wait for 15 seconds to ensure (and then verify) it did not get scheduled.
547
556
- Clear the Pod's ` .spec.schedulingGates ` field.
548
557
- Wait for 5 seconds for the Pod to be scheduled; otherwise error the e2e test.
549
558
559
+ In beta, it will be enabled by default in the [ k8s-triage] ( https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=Feature%3APodSchedulingReadiness ) dashboard.
560
+
550
561
### Graduation Criteria
551
562
552
563
<!--
@@ -622,8 +633,6 @@ in back-to-back releases.
622
633
#### Beta
623
634
624
635
- Feature enabled by default.
625
- - Permission control on individual schedulingGate is applicable (via
626
- [ fine-grained permissions] ( https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4 ) ).
627
636
- Gather feedback from developers and out-of-tree plugins.
628
637
- Benchmark tests passed, and there is no performance degradation.
629
638
- Update documents to reflect the changes.
@@ -770,7 +779,7 @@ You can take a look at one potential example of such test in:
770
779
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
771
780
-->
772
781
773
- Appropriate tests will be added in Alpha . See [ Test Plan ] ( #test-plan ) for more details.
782
+ Appropriate tests have been added in the integration tests . See [ Integration tests ] ( #integration-tests ) for more details.
774
783
775
784
### Rollout, Upgrade and Rollback Planning
776
785
@@ -790,13 +799,26 @@ rollout. Similarly, consider large clusters and how enablement/disablement
790
799
will rollout across nodes.
791
800
-->
792
801
802
+ It shouldn't impact already running workloads. It's an opt-in feature, and users need to set
803
+ ` .spec.schedulingGates ` field to use this feature.
804
+
805
+ When this feature is disabled by the feature flag, the already created Pod's ` .spec.schedulingGates `
806
+ field is preserved, however, the newly created Pod's ` .spec.schedulingGates ` field is silently dropped.
807
+
793
808
###### What specific metrics should inform a rollback?
794
809
795
810
<!--
796
811
What signals should users be paying attention to when the feature is young
797
812
that might indicate a serious problem?
798
813
-->
799
814
815
+ A rollback might be considered if the metric ` scheduler_pending_pods{queue="gated"} ` stays in a
816
+ high watermark for a long time since it may indicate that some controllers are not properly handling
817
+ removing the scheduling gates, which causes the pods to stay in pending state.
818
+
819
+ Another indicator for rollback is the 90-percentile value of metric ` scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"} `
820
+ exceeds 100ms steadily.
821
+
800
822
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
801
823
802
824
<!--
@@ -805,12 +827,20 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
805
827
are missing a bunch of machinery and tooling and can't do that now.
806
828
-->
807
829
830
+ It will be tested manually prior to beta launch.
831
+
832
+ <<UNRESOLVED >>
833
+ Add detailed scenarios and result here, and cc @wojtek-t .
834
+ <</UNRESOLVED >>
835
+
808
836
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
809
837
810
838
<!--
811
839
Even if applying deprecation policies, they may still surprise some users.
812
840
-->
813
841
842
+ No.
843
+
814
844
### Monitoring Requirements
815
845
816
846
<!--
@@ -820,13 +850,19 @@ For GA, this section is required: approvers should be able to confirm the
820
850
previous answers based on experience in the field.
821
851
-->
822
852
823
- Add a label ` notReady ` to the existing metric ` pending_pods ` to distinguish unschedulable Pods:
853
+ A new "queue" type ` gated ` is added to the existing metric ` scheduler_pending_pods ` to distinguish
854
+ unschedulable Pods:
824
855
825
- - ` unschedulable ` (existing): scheduler tried but cannot find any Node to host the Pod
826
- - ` notReady ` (new): scheduler respect the Pod's present ` schedulingGates ` and hence not schedule it
856
+ - ` scheduler_pending_pods{queue="unschedulable"} ` (existing): scheduler tried but cannot find any
857
+ Node to host the Pod
858
+ - ` scheduler_pending_pods{queue="gated"} ` (new): scheduler respect the Pod's present ` schedulingGates `
859
+ and hence not schedule it
860
+
861
+ The metric ` scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"} ` gives a histogram
862
+ to show the Nth percentile value how SchedulingGates plugin is executed.
827
863
828
864
Moreover, to explicitly indicate a Pod's scheduling-unready state, a condition
829
- ` {type:PodScheduled, reason:WaitingForGates } ` is introduced.
865
+ ` {type:PodScheduled, reason:SchedulingGated } ` is introduced.
830
866
831
867
###### How can an operator determine if the feature is in use by workloads?
832
868
@@ -836,6 +872,10 @@ checking if there are objects with field X set) may be a last resort. Avoid
836
872
logs or events for this purpose.
837
873
-->
838
874
875
+ - observe non-zero value for the metric ` pending_pods{queue="gated"} `
876
+ - observe entries for the metric ` scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"} `
877
+ - observe non-empty value in a Pod's ` .spec.schedulingGates ` field
878
+
839
879
###### How can someone using this feature know that it is working for their instance?
840
880
841
881
<!--
@@ -854,6 +894,11 @@ Recall that end users cannot usually observe component logs or access metrics.
854
894
855
895
- [x] API .spec
856
896
- Other field: ` schedulingGates `
897
+ - [x] Events
898
+ - Event Type: PodScheduled
899
+ - Event Status: False
900
+ - Event Reason: SchedulingGated
901
+ - Event Message: Scheduling is blocked due to non-empty scheduling gates
857
902
858
903
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
859
904
@@ -872,18 +917,18 @@ These goals will help you determine what you need to measure (SLIs) in the next
872
917
question.
873
918
-->
874
919
920
+ N/A.
921
+
875
922
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
876
923
877
924
<!--
878
925
Pick one more of these and delete the rest.
879
926
-->
880
927
881
- - [ ] Metrics
882
- - Metric name:
883
- - [ Optional] Aggregation method:
884
- - Components exposing the metric:
885
- - [ ] Other (treat as last resort)
886
- - Details:
928
+ - [x] Metrics
929
+ - Metric name: scheduler_pending_pods{queue="gated"}
930
+ - Metric name: scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}
931
+ - Components exposing the metric: kube-scheduler
887
932
888
933
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
889
934
@@ -892,12 +937,16 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
892
937
implementation difficulties, etc.).
893
938
-->
894
939
940
+ N/A.
941
+
895
942
### Dependencies
896
943
897
944
<!--
898
945
This section must be completed when targeting beta to a release.
899
946
-->
900
947
948
+ N/A.
949
+
901
950
###### Does this feature depend on any specific services running in the cluster?
902
951
903
952
<!--
@@ -915,6 +964,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
915
964
- Impact of its degraded performance or high-error rates on the feature:
916
965
-->
917
966
967
+ N/A.
968
+
918
969
### Scalability
919
970
920
971
<!--
@@ -942,6 +993,9 @@ Focusing mostly on:
942
993
heartbeats, leader election, etc.)
943
994
-->
944
995
996
+ The feature itself doesn't generate API calls. But it's expected the API Server would receive
997
+ additional update/patch requests to mutate the scheduling gates, by external controllers.
998
+
945
999
###### Will enabling / using this feature result in introducing new API types?
946
1000
947
1001
<!--
@@ -951,6 +1005,8 @@ Describe them, providing:
951
1005
- Supported number of objects per namespace (for namespace-scoped objects)
952
1006
-->
953
1007
1008
+ No.
1009
+
954
1010
###### Will enabling / using this feature result in any new calls to the cloud provider?
955
1011
956
1012
<!--
@@ -959,6 +1015,8 @@ Describe them, providing:
959
1015
- Estimated increase:
960
1016
-->
961
1017
1018
+ No.
1019
+
962
1020
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
963
1021
964
1022
<!--
@@ -968,6 +1026,11 @@ Describe them, providing:
968
1026
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
969
1027
-->
970
1028
1029
+ - No to existing API objects that doesn't use this feature.
1030
+ - For API objects that use this feature:
1031
+ - API type: Pod
1032
+ - Estimated increase in size: new field ` .spec.schedulingGates ` about ~ 64 bytes (in the case of 2 scheduling gates)
1033
+
971
1034
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
972
1035
973
1036
<!--
@@ -979,6 +1042,8 @@ Think about adding additional work or introducing new steps in between
979
1042
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
980
1043
-->
981
1044
1045
+ No.
1046
+
982
1047
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
983
1048
984
1049
<!--
@@ -991,6 +1056,8 @@ This through this both in small and large cases, again with respect to the
991
1056
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
992
1057
-->
993
1058
1059
+ No.
1060
+
994
1061
### Troubleshooting
995
1062
996
1063
<!--
@@ -1006,6 +1073,12 @@ details). For now, we leave it here.
1006
1073
1007
1074
###### How does this feature react if the API server and/or etcd is unavailable?
1008
1075
1076
+ During the downtime of API server and/or etcd:
1077
+
1078
+ - Running workloads that don't need to remove their scheduling gates function well.
1079
+ - Running workloads that need to update their scheduling gates will stay in scheduling gated state
1080
+ as API requests will be rejected.
1081
+
1009
1082
###### What are other known failure modes?
1010
1083
1011
1084
<!--
@@ -1021,8 +1094,13 @@ For each of them, fill in the following information by copying the below templat
1021
1094
- Testing: Are there any tests for failure mode? If not, describe why.
1022
1095
-->
1023
1096
1097
+ In a highly-available cluster, if there are skewed API Servers, some update requests
1098
+ may get accepted and some may get rejected.
1099
+
1024
1100
###### What steps should be taken if SLOs are not being met to determine the problem?
1025
1101
1102
+ N/A.
1103
+
1026
1104
## Implementation History
1027
1105
1028
1106
<!--
@@ -1036,7 +1114,8 @@ Major milestones might include:
1036
1114
- when the KEP was retired or superseded
1037
1115
-->
1038
1116
1039
- - 2022-09-16 - Initial Proposal
1117
+ - 2022-09-16: Initial KEP
1118
+ - 2022-01-14: Graduate the feature to Beta
1040
1119
1041
1120
## Drawbacks
1042
1121
@@ -1052,11 +1131,11 @@ not need to be as detailed as the proposal, but should include enough
1052
1131
information to express the idea and why it was not acceptable.
1053
1132
-->
1054
1133
1055
- - Define a boolean filed ` .spec.schedulingPaused ` . Its value is optionally initialized to ` True ` to
1134
+ Define a boolean filed ` .spec.schedulingPaused ` . Its value is optionally initialized to ` True ` to
1056
1135
indicate it's not scheduling-ready, and flipped to ` False ` (by a controller) afterwards to trigger
1057
1136
this Pod's scheduling cycle.
1058
1137
1059
- This approach is not chosen because it cannot support multiple independent controllers to control
1138
+ This approach is not chosen because it cannot support multiple independent controllers to control
1060
1139
specific aspect a Pod's scheduling readiness.
1061
1140
1062
1141
## Infrastructure Needed (Optional)
0 commit comments