Skip to content

Commit 40ce94d

Browse files
authored
Merge pull request kubernetes#2011 from alculquicondor/default-spread
DefaultPodTopologySpread graduation to Beta
2 parents aec9b86 + 2d59bef commit 40ce94d

File tree

2 files changed

+223
-16
lines changed

2 files changed

+223
-16
lines changed

keps/sig-scheduling/1258-default-pod-topology-spread/README.md

Lines changed: 219 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,14 @@
2424
- [Test Plan](#test-plan)
2525
- [Graduation Criteria](#graduation-criteria)
2626
- [Alpha (v1.19):](#alpha-v119)
27+
- [Beta (v1.20):](#beta-v120)
28+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
29+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
30+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
31+
- [Monitoring Requirements](#monitoring-requirements)
32+
- [Dependencies](#dependencies)
33+
- [Scalability](#scalability)
34+
- [Troubleshooting](#troubleshooting)
2735
- [Implementation History](#implementation-history)
2836
- [Alternatives](#alternatives)
2937
<!-- /toc -->
@@ -33,9 +41,9 @@
3341
- [x] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
3442
- [x] KEP approvers have set the KEP status to `implementable`
3543
- [x] Design details are appropriately documented
36-
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
37-
- [ ] Graduation criteria is in place
38-
- [ ] "Implementation History" section is up-to-date for milestone
44+
- [x] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
45+
- [x] Graduation criteria is in place
46+
- [x] "Implementation History" section is up-to-date for milestone
3947
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
4048
- [x] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4149

@@ -95,10 +103,9 @@ As a workload author, I want to spread the workload in the cluster, but:
95103

96104
### Implementation Details/Notes/Constraints
97105

98-
99106
#### Feature gate
100107

101-
Setting a default for `PodTopologySpread` will be guarded with the feature gate
108+
Setting a default for `PodTopologySpread` is guarded with the feature gate
102109
`DefaultPodTopologySpread`.
103110

104111
#### Relationship with "SelectorSpread" plugin
@@ -138,14 +145,16 @@ Values are decoded from the `pluginConfig` slice in the kube-scheduler Component
138145
```go
139146
// pkg/scheduler/apis/config/types_pluginargs.go
140147
type PodTopologySpreadArgs struct {
141-
// DefaultConstraints defines topology spread constraints to be applied to pods
142-
// that don't define any in `pod.spec.topologySpreadConstraints`. Pod selectors must
143-
// be empty, as they are deduced from the resources that the pod belongs to
144-
// (includes services, replication controllers, replica sets and stateful sets).
145-
// If not specified, the scheduler applies the following default constraints:
146-
// <default rules go here. See next section>
147-
// +optional
148-
DefaultConstraints []corev1.TopologySpreadConstraint
148+
// DefaultConstraints defines topology spread constraints to be applied to pods
149+
// that don't define any in `pod.spec.topologySpreadConstraints`. Pod selectors must
150+
// be empty, as they are deduced from pod's membership
151+
// to Services, ReplicationControllers, ReplicaSets or StatefulSets.
152+
// If empty, the default constraints prefer to spread Pods across Nodes and Zones.
153+
DefaultConstraints []corev1.TopologySpreadConstraint
154+
// DisableDefaultConstraints allows to disable DefaultConstraints. Defaults to false.
155+
// When set to true, DefaultConstraints must be empty or nil.
156+
// +optional
157+
DisableDefaultConstraints bool
149158
}
150159
```
151160

@@ -249,6 +258,10 @@ To ensure this feature to be rolled out in high quality. Following tests are man
249258
- **Integration Tests**: One integration test for the default rules and one for custom rules.
250259
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`.
251260
The performance should be as close as possible.
261+
[Beta] There should not be any significant degradation in scheduler performance in clusterloader benchmarks
262+
for vanilla workloads.
263+
- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass.
264+
This test is currently broken in 5k nodes.
252265

253266
### Graduation Criteria
254267

@@ -259,13 +272,205 @@ To ensure this feature to be rolled out in high quality. Following tests are man
259272
- [x] Score extension point implementation. Add support for `maxSkew`.
260273
- [x] Filter extension point implementation.
261274
- [x] Disabling `SelectorSpread` when the feature is enabled.
262-
- [x] Unit, Integration and benchmark test cases mentioned in the [Test Plan](#test-plan).
275+
- [x] Unit and benchmark test cases mentioned in the [Test Plan](#test-plan).
276+
277+
#### Beta (v1.20):
278+
279+
- [ ] Finalize implementation:
280+
- [ ] Map `SelectorSpreadingPriority` to `PodTopologySpread` when using Policy API.
281+
- [ ] Provide knob for disabling the k8s default constraints.
282+
- [ ] Integration tests.
283+
- [ ] Verify conformance tests passing.
284+
285+
## Production Readiness Review Questionnaire
286+
287+
### Feature Enablement and Rollback
288+
289+
* **How can this feature be enabled / disabled in a live cluster?**
290+
- [x] Feature gate (also fill in values in `kep.yaml`)
291+
- Feature gate name: `DefaultPodTopologySpread`
292+
- Components depending on the feature gate: `kube-scheduler`
293+
- [x] Other
294+
- Describe the mechanism:
295+
296+
Explicitly disable default spreading constraints for the `PodTopologySpread` plugin in the kube-scheduler config (passed via `--config` command line flag):
297+
298+
```yaml
299+
apiVersion: kubescheduler.config.k8s.io/v1beta1
300+
kind: KubeSchedulerConfiguration
301+
profiles:
302+
- pluginConfig:
303+
- name: PodTopologySpread
304+
args:
305+
disableDefaultConstraints: true
306+
```
307+
308+
- Will enabling / disabling the feature require downtime of the control
309+
plane?
310+
311+
Only kube-scheduler needs to be restarted.
312+
313+
- Will enabling / disabling the feature require downtime or reprovisioning
314+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
315+
316+
No
317+
318+
* **Does enabling the feature change any default behavior?**
319+
320+
Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.
321+
In particular, this will be more noticeable in clusters with more than 100 nodes.
322+
323+
The [default configuration](#default-constraints) was chosen to produce a behavior that closely resembles
324+
the `SelectorSpread` plugin.
325+
See [this PR description](https://github.com/kubernetes/kubernetes/pull/91793) for simulation data.
326+
327+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
328+
the enablement)?**
329+
330+
Yes. Once disabled, only scheduling of new Pods will be affected.
331+
332+
* **What happens if we reenable the feature if it was previously rolled back?**
333+
334+
Only scheduling of new Pods is affected.
335+
336+
* **Are there any tests for feature enablement/disablement?**
337+
338+
There are unit tests in `pkg/scheduler/algorithmprovider/registry_test.go` that validate the list of default plugins
339+
of `kube-scheduler` that correspond to the Feature Gate enabled and disabled.
340+
341+
### Rollout, Upgrade and Rollback Planning
342+
343+
* **How can a rollout fail? Can it impact already running workloads?**
344+
345+
Running workloads are not affected by `kube-scheduler`.
346+
347+
* **What specific metrics should inform a rollback?**
348+
349+
Primarily scheduling latency metrics, such as `framework_extension_point_duration_seconds`, `scheduling_algorithm_duration_seconds`
350+
and `e2e_scheduling_duration_seconds`, when they have increased significantly.
351+
352+
Since spreading is affected, node utilization might change.
353+
Utilization metrics can be queried in the `/metrics/resource` endpoint exposed by kubelet.
354+
355+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
356+
357+
N/A.
358+
359+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
360+
fields of API types, flags, etc.?**
361+
362+
TBD for GA.
363+
364+
### Monitoring Requirements
365+
366+
* **How can an operator determine if the feature is in use by workloads?**
367+
368+
All Pods are affected, unless they have explicit spreading constraints (.spec.topologySpreadConstraints).
369+
370+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
371+
the health of the service?**
372+
373+
- [x] Metrics
374+
- Metric name: `framework_extension_point_duration_seconds` with label `extension_point` values `PreScore` and/or `Score`.
375+
- [Optional] Aggregation method:
376+
- Components exposing the metric: `kube_scheduler`.
377+
- [ ] Other (treat as last resort)
378+
- Details:
379+
380+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
381+
382+
For 100 nodes, with a 4-core master:
383+
384+
- Latency for PreScore+Score less than 60ms for 99% percentile.
385+
- Latency for PreScore+Score less than 15ms for 95% percentile.
386+
387+
* **Are there any missing metrics that would be useful to have to improve observability
388+
of this feature?**
389+
390+
N/A.
391+
392+
### Dependencies
393+
394+
* **Does this feature depend on any specific services running in the cluster?**
395+
396+
N/A.
397+
398+
399+
### Scalability
400+
401+
* **Will enabling / using this feature result in any new API calls?**
402+
403+
No.
404+
405+
* **Will enabling / using this feature result in introducing new API types?**
406+
407+
No.
408+
409+
* **Will enabling / using this feature result in any new calls to the cloud
410+
provider?**
411+
412+
No.
413+
414+
* **Will enabling / using this feature result in increasing size or count of
415+
the existing API objects?**
416+
417+
No.
418+
419+
* **Will enabling / using this feature result in increasing time taken by any
420+
operations covered by [existing SLIs/SLOs]?**
421+
422+
Scheduling time on clusters with more than 100 nodes might increase. Smaller clusters are unaffected.
423+
This is because `SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew,
424+
resulting in partial spreading at this scale.
425+
On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone.
426+
427+
Before graduation, we will ensure that the latency increase is acceptable with Scalability SIG.
428+
429+
* **Will enabling / using this feature result in non-negligible increase of
430+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
431+
432+
kube-scheduler might use more CPU to calculate Zone spreading in certain configurations.
433+
In synthetic benchmarks, the new spreading spends 1.5ms to do PreScore/Score when there are 10k Pods in a 1k Nodes cluster,
434+
using 16 threads. This is comparable to SelectorSpread.
435+
436+
### Troubleshooting
437+
438+
* **How does this feature react if the API server and/or etcd is unavailable?**
439+
440+
kube-scheduler won't receive Pods
441+
The effect is no more than it be without the feature.
442+
443+
* **What are other known failure modes?**
444+
445+
- Pod scheduling is slow
446+
- Detection: Pod startup time is too high.
447+
- Diagnostics: Use the `framework_extension_point_duration_seconds` scheduler metric with label `extension_point` values `PreScore` and/or `Score`.
448+
- Mitigations: Disable the Feature Gate DefaultPodTopologySpreading in kube-scheduler.
449+
- Testing: There are performance dashboards.
450+
- Pods of a Service/ReplicaSet/ReplicationController/StatefulSet are not properly spread: spread is either too weak or too strong.
451+
- Detection: Too many pods belonging to the same Service/ReplicaSet/ReplicationController/StatefulSet are scheduled in a few nodes or
452+
are spread in too many nodes.
453+
- Mitigations: Use [Pod Topology spreading](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints)
454+
in your PodSpecs. Or modify the [default constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#cluster-level-default-constraints)
455+
for the `PodTopologySpread` plugin to your preference.
456+
- Diagnostics: N/A
457+
- Testing: E2E tests ensure that Pods are evenly spread in a clusters with only one Service.
458+
* **What steps should be taken if SLOs are not being met to determine the problem?**
459+
460+
If startup latency is in violation, there is the possibility that it's due to this feature.
461+
462+
1. Determine if the scheduler is the culprit: Check for significant latency in `e2e_scheduling_duration_seconds`.
463+
1. The feature only affects scheduling algorithms, thus you can check for significant latency in `scheduling_algorithm_duration_seconds`.
464+
1. To check if this feature is the culprit, look for significant latency in `framework_extension_point_duration_seconds`,
465+
using label `extension_point` with values `PreScore` and `Score`.
466+
1. Try disabling the Feature Gate `DefaultPodTopologySpreading`.
263467

264468
## Implementation History
265469

266470
- 2019-09-26: Initial KEP sent out for review.
267471
- 2020-01-20: KEP updated to make use of framework's PluginConfig.
268472
- 2020-05-04: Update completed tasks and target alpha for 1.19.
473+
- 2020-09-21: Add Beta graduation criteria and PRR.
269474

270475
## Alternatives
271476

keps/sig-scheduling/1258-default-pod-topology-spread/kep.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,12 @@ reviewers:
1111
approvers:
1212
- "@ahg-g"
1313
- "@Huang-Wei"
14+
prr-approvers:
15+
- "@wojtek-t"
1416
see-also:
1517
- "/keps/sig-scheduling/895-pod-topology-spread"
16-
stage: alpha
17-
latest-milestone: "v1.19"
18+
stage: beta
19+
latest-milestone: "v1.20"
1820
milestone:
1921
alpha: "v1.19"
2022
beta: "v1.20"

0 commit comments

Comments
 (0)