Skip to content

Commit 159112d

Browse files
authored
Merge pull request kubernetes#2049 from adtac/apfprr
priority and fairness: add production readiness review
2 parents 194605f + 1019969 commit 159112d

File tree

2 files changed

+191
-30
lines changed

2 files changed

+191
-30
lines changed

keps/sig-api-machinery/20190228-priority-and-fairness.md renamed to keps/sig-api-machinery/1040-priority-and-fairness/README.md

Lines changed: 137 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,4 @@
1-
---
2-
title: Priority and Fairness for API Server Requests
3-
authors:
4-
- "@MikeSpreitzer"
5-
- "@yue9944882"
6-
owning-sig: sig-api-machinery
7-
participating-sigs:
8-
- wg-multitenancy
9-
reviewers:
10-
- "@deads2k"
11-
- "@lavalamp"
12-
approvers:
13-
- "@deads2k"
14-
- "@lavalamp"
15-
editor: TBD
16-
creation-date: 2019-02-28
17-
last-updated: 2019-02-28
18-
status: implementable
19-
see-also:
20-
replaces:
21-
superseded-by:
22-
---
23-
24-
# Priority and Fairness for API Server Requests
1+
# KEP-1040: Priority and Fairness for API Server Requests
252

263
## Table of Contents
274

@@ -76,6 +53,13 @@ superseded-by:
7653
- [Design Considerations](#design-considerations)
7754
- [Test Plan](#test-plan)
7855
- [Graduation Criteria](#graduation-criteria)
56+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
57+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
58+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
59+
- [Monitoring Requirements](#monitoring-requirements)
60+
- [Dependencies](#dependencies)
61+
- [Scalability](#scalability)
62+
- [Troubleshooting](#troubleshooting)
7963
- [Implementation History](#implementation-history)
8064
- [Drawbacks](#drawbacks)
8165
- [Alternatives](#alternatives)
@@ -91,8 +75,8 @@ For enhancements that make changes to code or processes/procedures in core Kuber
9175

9276
Check these off as they are completed for the Release Team to track. These checklist items _must_ be updated for the enhancement to be released.
9377

94-
- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
95-
- [ ] KEP approvers have set the KEP status to `implementable`
78+
- [x] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
79+
- [x] KEP approvers have set the KEP status to `implementable`
9680
- [ ] Design details are appropriately documented
9781
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
9882
- [ ] Graduation criteria is in place
@@ -119,7 +103,7 @@ https://speakerdeck.com/sttts/kubernetes-api-codebase-tour?slide=18 .
119103

120104
## Motivation
121105

122-
Today the apiserver has a simple mechanism for protectimg itself
106+
Today the apiserver has a simple mechanism for protecting itself
123107
against CPU and memory overloads: max-in-flight limits for mutating
124108
and for readonly requests. Apart from the distinction between
125109
mutating and readonly, no other distinctions are made among requests;
@@ -268,7 +252,6 @@ yet but we think may be interesting to consider in the future.
268252
- Thread additional information along the paths needed to enable more
269253
precisely targeted avoidance of priority inversions.
270254

271-
272255
## Proposal
273256

274257
In short, this proposal is about generalizing the existing
@@ -406,7 +389,6 @@ with namespace then the bad behavior will be spread among all the
406389
queues of that schema's priority. Administrators need to make a good
407390
choice for how flows are distinguished.
408391

409-
410392
#### Queue Assignment Proof of Concept
411393

412394
The following golang code shows a simple recursive technique to
@@ -551,7 +533,6 @@ func main() {
551533
}
552534
```
553535

554-
555536
### Resource Limits
556537

557538
#### Primary CPU and Memory Protection
@@ -2025,6 +2006,132 @@ Beta:
20252006
- Automatically manages versions of mandatory/suggested configuration
20262007
- Discrimates paginated LIST requests
20272008
2009+
## Production Readiness Review Questionnaire
2010+
2011+
### Feature Enablement and Rollback
2012+
2013+
* **How can this feature be enabled / disabled in a live cluster?** To enable
2014+
priority and fairness, all of the following must be enabled:
2015+
- [x] Feature gate
2016+
- Feature gate name: APIPriorityAndFairness
2017+
- Components depending on the feature gate:
2018+
- kube-apiserver
2019+
- [x] Command-line flags
2020+
- `--enable-priority-and-fairness`, and
2021+
- `--runtime-config=flowcontrol.apiserver.k8s.io/v1alpha1=true`
2022+
2023+
* **Does enabling the feature change any default behavior?** Yes, requests that
2024+
weren't rejected before could get rejected while requests that were rejected
2025+
previously may be allowed. Performance of kube-apiserver under heavy load
2026+
will likely be different too.
2027+
2028+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
2029+
the enablement)?** Yes.
2030+
2031+
* **What happens if we reenable the feature if it was previously rolled back?**
2032+
The feature will be restored.
2033+
2034+
* **Are there any tests for feature enablement/disablement?** No. Manual tests
2035+
will be run before switching feature gate to beta.
2036+
2037+
### Rollout, Upgrade and Rollback Planning
2038+
2039+
* **How can a rollout fail? Can it impact already running workloads?** A
2040+
misconfiguration could cause apiserver requests to be rejected, which could
2041+
have widespread impact such as: (1) rejecting controller requests, thereby
2042+
bringing a lot of things to a halt, (2) dropping node heartbeats, which may
2043+
result in overloading other nodes, (3) rejecting kube-proxy requests to
2044+
apiserver, thereby breaking existing workloads, (4) dropping leader election
2045+
requests, resulting in HA failure, or any combination of the above.
2046+
2047+
* **What specific metrics should inform a rollback?** An abnormal spike in the
2048+
`apiserver_flowcontrol_rejected_requests_total` metric should potentially be
2049+
viewed as a sign that kube-apiserver is rejecting requests, potentially
2050+
incorrectly. The `apiserver_flowcontrol_request_queue_length_after_enqueue`
2051+
metric getting too close to the configured queue length could be a sign of
2052+
insufficient queue size (or a system overload), which can be precursor to
2053+
rejected requests.
2054+
2055+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
2056+
No. Manual tests will be run before switching feature gate to beta.
2057+
2058+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
2059+
fields of API types, flags, etc.?** Yes, `--max-requests-inflights` will be
2060+
deprecated in favor of APF.
2061+
2062+
### Monitoring Requirements
2063+
2064+
* **How can an operator determine if the feature is in use by workloads?**
2065+
If the `apiserver_flowcontrol_dispatched_requests_total` metric is non-zero,
2066+
this feature is in use. Note that this isn't a workload feature, but a
2067+
control plane one.
2068+
2069+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
2070+
the health of the service?**
2071+
- [x] Metrics
2072+
- Metric name: `apiserver_flowcontrol_request_queue_length_after_enqueue`
2073+
- Components exposing the metric: kube-apiserver
2074+
2075+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
2076+
No SLOs are proposed for the above SLI.
2077+
2078+
* **Are there any missing metrics that would be useful to have to improve observability
2079+
of this feature?** No.
2080+
2081+
### Dependencies
2082+
2083+
* **Does this feature depend on any specific services running in the cluster?**
2084+
No.
2085+
2086+
### Scalability
2087+
2088+
* **Will enabling / using this feature result in any new API calls?** Yes.
2089+
Self-requests for new API objects will be introduced. In addition, the
2090+
request execution order may change, which could occasionally increase the
2091+
number of retries.
2092+
2093+
* **Will enabling / using this feature result in introducing new API types?**
2094+
Yes, a new flowcontrol API group, configuration types, and status types are
2095+
introduced. See `k8s.io/api/flowcontrol/v1alpha1/types.go` for a full list.
2096+
2097+
* **Will enabling / using this feature result in any new calls to the cloud
2098+
provider?** No.
2099+
2100+
* **Will enabling / using this feature result in increasing size or count of
2101+
the existing API objects?** No.
2102+
2103+
* **Will enabling / using this feature result in increasing time taken by any
2104+
operations covered by [existing SLIs/SLOs]?** Yes, a non-negligible latency
2105+
is added to API calls to kube-apiserver. While [preliminary tests](https://github.com/tkashem/graceful/blob/master/priority-fairness/filter-latency/readme.md)
2106+
shows that the API server latency is still well within the existing SLOs,
2107+
more thorough testing needs to be performed.
2108+
2109+
* **Will enabling / using this feature result in non-negligible increase of
2110+
resource usage (CPU, RAM, disk, IO, ...) in any components?** The proposed
2111+
flowcontrol logic in request handling in kube-apiserver will increase the CPU
2112+
and memory overheads involved in serving each request. Note that the resource
2113+
usage will be configurable and may require the operator to fine-tune some
2114+
parameters.
2115+
2116+
### Troubleshooting
2117+
2118+
* **How does this feature react if the API server and/or etcd is unavailable?**
2119+
The feature is itself within the API server. Etcd being unavailable would
2120+
likely cause kube-apiserver to fail at processing incoming requests.
2121+
2122+
* **What are other known failure modes?** A misconfiguration could reject
2123+
requests incorrectly. See the rollout and monitoring sections for details on
2124+
which metrics to watch to detect such failures (see the `kep.yaml` file for
2125+
the full list of metrics). The following kube-apiserver log messages could
2126+
also indicate potential issues:
2127+
- "Unable to list PriorityLevelConfiguration objects"
2128+
- "Unable to list FlowSchema objects"
2129+
2130+
* **What steps should be taken if SLOs are not being met to determine the
2131+
problem?** No SLOs are proposed.
2132+
2133+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
2134+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
20282135
20292136
## Implementation History
20302137
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
title: Priority and Fairness for API Server Requests
2+
kep-number: 1040
3+
authors:
4+
- "@MikeSpreitzer"
5+
- "@yue9944882"
6+
owning-sig: sig-api-machinery
7+
participating-sigs:
8+
- wg-multitenancy
9+
- sig-scheduling
10+
status: implementable
11+
reviewers:
12+
- "@deads2k"
13+
- "@lavalamp"
14+
- "@ahg-g"
15+
- "@wojtek-t"
16+
approvers:
17+
- "@deads2k"
18+
- "@lavalamp"
19+
prr-approvers:
20+
- "@wojtek-t"
21+
creation-date: 2019-02-28
22+
23+
# The target maturity stage in the current dev cycle for this KEP.
24+
stage: beta
25+
26+
# The most recent milestone for which work toward delivery of this KEP has been
27+
# done. This can be the current (upcoming) milestone, if it is being actively
28+
# worked on.
29+
latest-milestone: "v1.20"
30+
31+
# The milestone at which this feature was, or is targeted to be, at each stage.
32+
milestone:
33+
alpha: "v1.18"
34+
beta: "v1.20"
35+
stable: "v1.22"
36+
37+
# The following PRR answers are required at alpha release.
38+
# List the feature gate name and the components for which it must be enabled.
39+
feature-gates:
40+
- name: APIPriorityAndFairness
41+
components:
42+
- kube-apiserver
43+
disable-supported: true
44+
45+
# The following PRR answers are required at beta release.
46+
metrics:
47+
- apiserver_flowcontrol_rejected_requests_total
48+
- apiserver_flowcontrol_dispatched_requests_total
49+
- apiserver_flowcontrol_current_inqueue_requests
50+
- apiserver_flowcontrol_request_queue_length_after_enqueue
51+
- apiserver_flowcontrol_request_concurrency_limit
52+
- apiserver_flowcontrol_current_executing_requests
53+
- apiserver_flowcontrol_request_wait_duration_seconds
54+
- apiserver_flowcontrol_request_execution_seconds

0 commit comments

Comments
 (0)