Skip to content

Commit ac5dae5

Browse files
authored
Merge pull request kubernetes#3739 from Huang-Wei/pod-readiness-gate-beta
KEP-3521: Graduate PodSchedulingReadiness to beta
2 parents 6ba9584 + 462309e commit ac5dae5

File tree

3 files changed

+114
-33
lines changed

3 files changed

+114
-33
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3521
22
alpha:
33
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-scheduling/3521-pod-scheduling-readiness/README.md

Lines changed: 109 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ tags, and then generate with `hack/update-toc.sh`.
7676
- [Motivation](#motivation)
7777
- [Goals](#goals)
7878
- [Non-Goals](#non-goals)
79+
- [Future Work](#future-work)
7980
- [Proposal](#proposal)
8081
- [User Stories (Optional)](#user-stories-optional)
8182
- [Story 1](#story-1)
@@ -140,9 +141,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
140141
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
141142
- [ ] (R) Production readiness review completed
142143
- [ ] (R) Production readiness review approved
143-
- [ ] "Implementation History" section is up-to-date for milestone
144-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
145-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
144+
- [x] "Implementation History" section is up-to-date for milestone
145+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
146+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
146147

147148
<!--
148149
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -228,6 +229,11 @@ and make progress.
228229
- Enforce updating the Pod's conditions to expose more context for scheduling-paused Pods.
229230
- Focus on in-house use-cases of the Enqueue extension point.
230231

232+
### Future Work
233+
234+
- Permission control on individual schedulingGate (via
235+
[fine-grained permissions](https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4))
236+
231237
## Proposal
232238

233239
<!--
@@ -307,10 +313,16 @@ API Server will return a validation error.
307313
| unscheduled Pod<br>(nil <tt>nodeName</tt>) | ✅ create<br>❌ update | ✅ create<br>✅ update |
308314
| scheduled Pod<br>(non-nil <tt>nodeName</tt>) | ❌ create<br>❌ update | ✅ create<br>✅ update |
309315

310-
- **New field disabled in Alpha but not scheduler extension:** In Alpha, the new Pod field is disabled
311-
by default. However, the scheduler's extension point is activated no matter the feature gate is enabled
312-
or not. This enables scheduler plugin developers to tryout the feature even in Alpha, by crafting
313-
different enqueue plugins and wire with custom fields or conditions.
316+
- **New field disabled in Alpha but not scheduler extension:** In a high level, this feature contains
317+
the following parts:
318+
319+
- a. Pod's spec, feature gate, and other misc bits describing the field `schedulingGates`
320+
- b. the `SchedulingGates` scheduler plugin
321+
- c. the underlying scheduler framework, i.e., the new PreEnqueue extension point
322+
323+
If the feature is disabled, part `a` and `b` are disabled, but `c` is enabled anyways. This implies
324+
a scheduler plugin developer can leverage part `c` _only_ to craft their own `a'` and `b'` to
325+
accomplish their own feature set.
314326

315327
- **New phase literal in kubectl:** To provide better UX, we're going to add a new phase literal
316328
`SchedulingPaused` to the "phase" column of `kubectl get pod`. This new literal indicates whether it's
@@ -521,10 +533,9 @@ The following scenarios need to be covered in integration tests:
521533
can be moved back to activeQ when `.spec.schedulingGates` is all cleared
522534
- Ensure no significant performance degradation
523535

524-
- `test/integration/scheduler/queue_test.go`: Will add new tests.
525-
- `test/integration/scheduler/plugins/plugins_test.go`: Will add new tests.
526-
- `test/integration/scheduler/enqueue/enqueue_test.go`: Will add new tests.
527-
- `test/integration/scheduler_perf/scheduler_perf_test.go`: https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling
536+
- `test/integration/scheduler/queue_test.go#TestSchedulingGates`: [k8s-triage](https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=TestSchedulingGates)
537+
- `test/integration/scheduler/plugins/plugins_test.go#TestPreEnqueuePlugin`: [k8s-triage](https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=TestPreEnqueuePlugin)
538+
- `test/integration/scheduler_perf/scheduler_perf_test.go`: will add in Beta. (https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling)
528539

529540
##### e2e tests
530541

@@ -538,15 +549,15 @@ https://storage.googleapis.com/k8s-triage/index.html
538549
We expect no non-infra related flakes in the last month as a GA graduation criteria.
539550
-->
540551

541-
Create a test with the following sequences:
552+
An e2e test was created in Alpha with the following sequences:
542553

543-
- Provision a cluster with feature gate `PodSchedulingReadiness=true` (we may need to setup a testgrid
544-
for when it's alpha)
545554
- Create a Pod with non-nil `.spec.schedulingGates`.
546555
- Wait for 15 seconds to ensure (and then verify) it did not get scheduled.
547556
- Clear the Pod's `.spec.schedulingGates` field.
548557
- Wait for 5 seconds for the Pod to be scheduled; otherwise error the e2e test.
549558

559+
In beta, it will be enabled by default in the [k8s-triage](https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=Feature%3APodSchedulingReadiness) dashboard.
560+
550561
### Graduation Criteria
551562

552563
<!--
@@ -622,8 +633,6 @@ in back-to-back releases.
622633
#### Beta
623634

624635
- Feature enabled by default.
625-
- Permission control on individual schedulingGate is applicable (via
626-
[fine-grained permissions](https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4)).
627636
- Gather feedback from developers and out-of-tree plugins.
628637
- Benchmark tests passed, and there is no performance degradation.
629638
- Update documents to reflect the changes.
@@ -770,7 +779,7 @@ You can take a look at one potential example of such test in:
770779
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
771780
-->
772781

773-
Appropriate tests will be added in Alpha. See [Test Plan](#test-plan) for more details.
782+
Appropriate tests have been added in the integration tests. See [Integration tests](#integration-tests) for more details.
774783

775784
### Rollout, Upgrade and Rollback Planning
776785

@@ -790,13 +799,26 @@ rollout. Similarly, consider large clusters and how enablement/disablement
790799
will rollout across nodes.
791800
-->
792801

802+
It shouldn't impact already running workloads. It's an opt-in feature, and users need to set
803+
`.spec.schedulingGates` field to use this feature.
804+
805+
When this feature is disabled by the feature flag, the already created Pod's `.spec.schedulingGates`
806+
field is preserved, however, the newly created Pod's `.spec.schedulingGates` field is silently dropped.
807+
793808
###### What specific metrics should inform a rollback?
794809

795810
<!--
796811
What signals should users be paying attention to when the feature is young
797812
that might indicate a serious problem?
798813
-->
799814

815+
A rollback might be considered if the metric `scheduler_pending_pods{queue="gated"}` stays in a
816+
high watermark for a long time since it may indicate that some controllers are not properly handling
817+
removing the scheduling gates, which causes the pods to stay in pending state.
818+
819+
Another indicator for rollback is the 90-percentile value of metric `scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}`
820+
exceeds 100ms steadily.
821+
800822
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
801823

802824
<!--
@@ -805,12 +827,20 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
805827
are missing a bunch of machinery and tooling and can't do that now.
806828
-->
807829

830+
It will be tested manually prior to beta launch.
831+
832+
<<UNRESOLVED>>
833+
Add detailed scenarios and result here, and cc @wojtek-t.
834+
<</UNRESOLVED>>
835+
808836
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
809837

810838
<!--
811839
Even if applying deprecation policies, they may still surprise some users.
812840
-->
813841

842+
No.
843+
814844
### Monitoring Requirements
815845

816846
<!--
@@ -820,13 +850,19 @@ For GA, this section is required: approvers should be able to confirm the
820850
previous answers based on experience in the field.
821851
-->
822852

823-
Add a label `notReady` to the existing metric `pending_pods` to distinguish unschedulable Pods:
853+
A new "queue" type `gated` is added to the existing metric `scheduler_pending_pods` to distinguish
854+
unschedulable Pods:
824855

825-
- `unschedulable` (existing): scheduler tried but cannot find any Node to host the Pod
826-
- `notReady` (new): scheduler respect the Pod's present `schedulingGates` and hence not schedule it
856+
- `scheduler_pending_pods{queue="unschedulable"}` (existing): scheduler tried but cannot find any
857+
Node to host the Pod
858+
- `scheduler_pending_pods{queue="gated"}` (new): scheduler respect the Pod's present `schedulingGates`
859+
and hence not schedule it
860+
861+
The metric `scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}` gives a histogram
862+
to show the Nth percentile value how SchedulingGates plugin is executed.
827863

828864
Moreover, to explicitly indicate a Pod's scheduling-unready state, a condition
829-
`{type:PodScheduled, reason:WaitingForGates}` is introduced.
865+
`{type:PodScheduled, reason:SchedulingGated}` is introduced.
830866

831867
###### How can an operator determine if the feature is in use by workloads?
832868

@@ -836,6 +872,10 @@ checking if there are objects with field X set) may be a last resort. Avoid
836872
logs or events for this purpose.
837873
-->
838874

875+
- observe non-zero value for the metric `pending_pods{queue="gated"}`
876+
- observe entries for the metric `scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}`
877+
- observe non-empty value in a Pod's `.spec.schedulingGates` field
878+
839879
###### How can someone using this feature know that it is working for their instance?
840880

841881
<!--
@@ -854,6 +894,11 @@ Recall that end users cannot usually observe component logs or access metrics.
854894

855895
- [x] API .spec
856896
- Other field: `schedulingGates`
897+
- [x] Events
898+
- Event Type: PodScheduled
899+
- Event Status: False
900+
- Event Reason: SchedulingGated
901+
- Event Message: Scheduling is blocked due to non-empty scheduling gates
857902

858903
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
859904

@@ -872,18 +917,18 @@ These goals will help you determine what you need to measure (SLIs) in the next
872917
question.
873918
-->
874919

920+
N/A.
921+
875922
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
876923

877924
<!--
878925
Pick one more of these and delete the rest.
879926
-->
880927

881-
- [ ] Metrics
882-
- Metric name:
883-
- [Optional] Aggregation method:
884-
- Components exposing the metric:
885-
- [ ] Other (treat as last resort)
886-
- Details:
928+
- [x] Metrics
929+
- Metric name: scheduler_pending_pods{queue="gated"}
930+
- Metric name: scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}
931+
- Components exposing the metric: kube-scheduler
887932

888933
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
889934

@@ -892,12 +937,16 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
892937
implementation difficulties, etc.).
893938
-->
894939

940+
N/A.
941+
895942
### Dependencies
896943

897944
<!--
898945
This section must be completed when targeting beta to a release.
899946
-->
900947

948+
N/A.
949+
901950
###### Does this feature depend on any specific services running in the cluster?
902951

903952
<!--
@@ -915,6 +964,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
915964
- Impact of its degraded performance or high-error rates on the feature:
916965
-->
917966

967+
N/A.
968+
918969
### Scalability
919970

920971
<!--
@@ -942,6 +993,9 @@ Focusing mostly on:
942993
heartbeats, leader election, etc.)
943994
-->
944995

996+
The feature itself doesn't generate API calls. But it's expected the API Server would receive
997+
additional update/patch requests to mutate the scheduling gates, by external controllers.
998+
945999
###### Will enabling / using this feature result in introducing new API types?
9461000

9471001
<!--
@@ -951,6 +1005,8 @@ Describe them, providing:
9511005
- Supported number of objects per namespace (for namespace-scoped objects)
9521006
-->
9531007

1008+
No.
1009+
9541010
###### Will enabling / using this feature result in any new calls to the cloud provider?
9551011

9561012
<!--
@@ -959,6 +1015,8 @@ Describe them, providing:
9591015
- Estimated increase:
9601016
-->
9611017

1018+
No.
1019+
9621020
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
9631021

9641022
<!--
@@ -968,6 +1026,11 @@ Describe them, providing:
9681026
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
9691027
-->
9701028

1029+
- No to existing API objects that doesn't use this feature.
1030+
- For API objects that use this feature:
1031+
- API type: Pod
1032+
- Estimated increase in size: new field `.spec.schedulingGates` about ~64 bytes (in the case of 2 scheduling gates)
1033+
9711034
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
9721035

9731036
<!--
@@ -979,6 +1042,8 @@ Think about adding additional work or introducing new steps in between
9791042
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
9801043
-->
9811044

1045+
No.
1046+
9821047
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
9831048

9841049
<!--
@@ -991,6 +1056,8 @@ This through this both in small and large cases, again with respect to the
9911056
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
9921057
-->
9931058

1059+
No.
1060+
9941061
### Troubleshooting
9951062

9961063
<!--
@@ -1006,6 +1073,12 @@ details). For now, we leave it here.
10061073

10071074
###### How does this feature react if the API server and/or etcd is unavailable?
10081075

1076+
During the downtime of API server and/or etcd:
1077+
1078+
- Running workloads that don't need to remove their scheduling gates function well.
1079+
- Running workloads that need to update their scheduling gates will stay in scheduling gated state
1080+
as API requests will be rejected.
1081+
10091082
###### What are other known failure modes?
10101083

10111084
<!--
@@ -1021,8 +1094,13 @@ For each of them, fill in the following information by copying the below templat
10211094
- Testing: Are there any tests for failure mode? If not, describe why.
10221095
-->
10231096

1097+
In a highly-available cluster, if there are skewed API Servers, some update requests
1098+
may get accepted and some may get rejected.
1099+
10241100
###### What steps should be taken if SLOs are not being met to determine the problem?
10251101

1102+
N/A.
1103+
10261104
## Implementation History
10271105

10281106
<!--
@@ -1036,7 +1114,8 @@ Major milestones might include:
10361114
- when the KEP was retired or superseded
10371115
-->
10381116

1039-
- 2022-09-16 - Initial Proposal
1117+
- 2022-09-16: Initial KEP
1118+
- 2022-01-14: Graduate the feature to Beta
10401119

10411120
## Drawbacks
10421121

@@ -1052,11 +1131,11 @@ not need to be as detailed as the proposal, but should include enough
10521131
information to express the idea and why it was not acceptable.
10531132
-->
10541133

1055-
- Define a boolean filed `.spec.schedulingPaused`. Its value is optionally initialized to `True` to
1134+
Define a boolean filed `.spec.schedulingPaused`. Its value is optionally initialized to `True` to
10561135
indicate it's not scheduling-ready, and flipped to `False` (by a controller) afterwards to trigger
10571136
this Pod's scheduling cycle.
10581137

1059-
This approach is not chosen because it cannot support multiple independent controllers to control
1138+
This approach is not chosen because it cannot support multiple independent controllers to control
10601139
specific aspect a Pod's scheduling readiness.
10611140

10621141
## Infrastructure Needed (Optional)

keps/sig-scheduling/3521-pod-scheduling-readiness/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@ see-also:
1818
- "/keps/sig-scheduling/624-scheduling-framework"
1919

2020
# The target maturity stage in the current dev cycle for this KEP.
21-
stage: alpha
21+
stage: beta
2222

2323
# The most recent milestone for which work toward delivery of this KEP has been
2424
# done. This can be the current (upcoming) milestone, if it is being actively
2525
# worked on.
26-
latest-milestone: "v1.26"
26+
latest-milestone: "v1.27"
2727

2828
# The milestone at which this feature was, or is targeted to be, at each stage.
2929
milestone:
@@ -42,4 +42,4 @@ disable-supported: true
4242

4343
# The following PRR answers are required at beta release
4444
metrics:
45-
- TBD
45+
- scheduler_pending_pods{queue="gated"}

0 commit comments

Comments
 (0)