Skip to content

Commit 8f92c3e

Browse files
authored
Merge pull request kubernetes#2237 from deads2k/promote-insecure-backend-to-ga
update insecure-backend-proxy feature to target GA. Add PRR
2 parents 10fffa8 + ccb4052 commit 8f92c3e

File tree

2 files changed

+148
-11
lines changed

2 files changed

+148
-11
lines changed

keps/sig-api-machinery/1295-insecure-backend-proxy/README.md

Lines changed: 143 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,13 @@ misbehaving self-hosted clusters.
2424
- [Graduation Criteria](#graduation-criteria)
2525
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
2626
- [Version Skew Strategy](#version-skew-strategy)
27+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
28+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
29+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
30+
- [Monitoring Requirements](#monitoring-requirements)
31+
- [Dependencies](#dependencies)
32+
- [Scalability](#scalability)
33+
- [Troubleshooting](#troubleshooting)
2734
- [Implementation History](#implementation-history)
2835
- [Drawbacks [optional]](#drawbacks-optional)
2936
- [Alternatives [optional]](#alternatives-optional)
@@ -155,6 +162,10 @@ The risk in doing that is greater than the additional benefit.
155162
### Test Plan
156163

157164
1. Positive and negative tests for this are fairly easy to write and the changes are narrow in scope.
165+
2. There will not be e2e tests written because the scenario under which this API is effective is only in a mis-configured
166+
cluster where a kubelet has not refreshed its serving certs. There is an existing positive and negative integration
167+
[test](https://github.com/kubernetes/kubernetes/blob/release-1.20/test/integration/apiserver/podlogs/podlogs_test.go#L141-L164)
168+
which the sig leads believe is sufficient.
158169

159170
### Graduation Criteria
160171

@@ -169,17 +180,140 @@ Because the change is isolated to non-persisted API contracts with the kube-apis
169180

170181
Because the change is isolated to non-persisted API contracts with the kube-apiserver, there are no skew or upgrade/downgrade considerations.
171182

172-
## Implementation History
183+
## Production Readiness Review Questionnaire
184+
185+
### Feature Enablement and Rollback
186+
187+
_This section must be completed when targeting alpha to a release._
188+
189+
* **How can this feature be enabled / disabled in a live cluster?**
190+
- [x] Feature gate (also fill in values in `kep.yaml`)
191+
- Feature gate name: AllowInsecureBackendProxy
192+
- Components depending on the feature gate: kube-apiserver
193+
194+
* **Does enabling the feature change any default behavior?**
195+
No, all default behavior remains the same with the feature gate on or off.
196+
197+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
198+
the enablement)?**
199+
Yes, the feature can be disabled after enablement.
200+
Because no data is persisted via this API, there is no impact that lingers across kube-apiserver restarts.
201+
202+
* **What happens if we reenable the feature if it was previously rolled back?**
203+
Because no data is persisted via this API, there is no impact that lingers across kube-apiserver restarts.
204+
205+
* **Are there any tests for feature enablement/disablement?**
206+
Because no data is persisted via this API, there is no lingering memory in the system to check.
207+
208+
### Rollout, Upgrade and Rollback Planning
209+
210+
_This section must be completed when targeting beta graduation to a release._
211+
212+
* **How can a rollout fail? Can it impact already running workloads?**
213+
This is contained to a single binary, with no persisted data.
214+
The worst failure mode is when an HA cluster has some members with the feature off and some members with the feature on.
215+
In such a case, the user observed behavior going through a load balancer is inconsistent until the cluster settles.
216+
217+
* **What specific metrics should inform a rollback?**
218+
If there is a notable increase in failed pod/logs calls, it may be indicative of the new code causing a problem.
219+
220+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
221+
Yes. This was explicitly tested in the OpenShift distro when the feature went to beta.
222+
During HA cluster upgrades, the client observed behavior was inconsistent (as expected), but once all members had
223+
the feature gate consistent it was fine.
224+
Skew also worked correctly, with new clients sending the additional option simply not connecting as they wish, failing
225+
in the safe direction.
226+
227+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
228+
fields of API types, flags, etc.?**
229+
No.
230+
231+
### Monitoring Requirements
232+
233+
* **How can an operator determine if the feature is in use by workloads?**
234+
`pods_logs_insecure_backend_total` has a label `skip_tls_allowed` which will count how often this value is set by clients.
235+
236+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
237+
the health of the service?**
238+
- [ ] Metrics
239+
- Metric name:
240+
`pods_logs_insecure_backend_total` indicates usage.
241+
`pods_logs_backend_tls_failure_total` indicates how often usage of the option may have allowed a connection to be established.
242+
- Components exposing the metric: kube-apiserver
243+
244+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
245+
pods/logs can suffer errors today based on user input because the kubelet cannot be verified.
246+
Because this is driven based on clients, different clusters may have different "reasonable" starting values.
247+
However, there should not be a marked increase the failure rate of pods/logs.
173248

174-
Major milestones in the life cycle of a KEP should be tracked in `Implementation History`.
175-
Major milestones might include
249+
* **Are there any missing metrics that would be useful to have to improve observability
250+
of this feature?**
251+
I don't think we need greater granularity here.
252+
253+
### Dependencies
254+
255+
* **Does this feature depend on any specific services running in the cluster?**
256+
No.
257+
This does not introduce any new calls from the kube-apiserver.
258+
259+
### Scalability
260+
261+
* **Will enabling / using this feature result in any new API calls?**
262+
no.
263+
It adds an option to an existing API call that would already have been called.
264+
265+
* **Will enabling / using this feature result in introducing new API types?**
266+
No.
267+
It adds a field to `PodLogOptions`, which is not a persisted API.
268+
269+
* **Will enabling / using this feature result in any new calls to the cloud
270+
provider?**
271+
No.
272+
273+
* **Will enabling / using this feature result in increasing size or count of
274+
the existing API objects?**
275+
No
276+
277+
* **Will enabling / using this feature result in increasing time taken by any
278+
operations covered by [existing SLIs/SLOs]?**
279+
No.
280+
281+
* **Will enabling / using this feature result in non-negligible increase of
282+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
283+
No.
284+
285+
### Troubleshooting
286+
287+
The Troubleshooting section currently serves the `Playbook` role. We may consider
288+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
289+
details). For now, we leave it here.
290+
291+
_This section must be completed when targeting beta graduation to a release._
292+
293+
* **How does this feature react if the API server and/or etcd is unavailable?**
294+
No impact because this feature only affects the kube-apiserver behavior.
295+
296+
* **What are other known failure modes?**
297+
There are no known failure modes.
298+
299+
* **What steps should be taken if SLOs are not being met to determine the problem?**
300+
The usual steps used to debug a pod/logs failure.
301+
This varies somewhat, but generally you gather.
302+
1. the kube-apiserver logs
303+
2. the pods you cannot connect to
304+
3. the node API running that pod
305+
4. the kubelet log for that node
306+
5. the crio log for that node
307+
From there you can decide how far the request is getting and whether you need to investigate the network connections.
308+
This is a fairly deep and rare thing to investigate today.
309+
310+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
311+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
312+
313+
## Implementation History
176314

177-
- the `Summary` and `Motivation` sections being merged signaling SIG acceptance
178-
- the `Proposal` section being merged signaling agreement on a proposed design
179-
- the date implementation started
180-
- the first Kubernetes release where an initial version of the KEP was available
181-
- the version of Kubernetes where the KEP graduated to general availability
182-
- when the KEP was retired or superseded
315+
Introduced as beta in 1.17.
316+
Moving to stable in 1.21.
183317

184318
## Drawbacks [optional]
185319

keps/sig-api-machinery/1295-insecure-backend-proxy/kep.yaml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,19 +17,22 @@ reviewers:
1717
approvers:
1818
- "@lavalamp"
1919
- "@mikedanese"
20+
prr-approvers:
21+
- "@johnbelamaric"
2022
see-also:
2123

2224
# The target maturity stage in the current dev cycle for this KEP.
23-
stage: beta
25+
stage: stable
2426

2527
# The most recent milestone for which work toward delivery of this KEP has been
2628
# done. This can be the current (upcoming) milestone, if it is being actively
2729
# worked on.
28-
latest-milestone: "v1.17"
30+
latest-milestone: "v1.21"
2931

3032
# The milestone at which this feature was, or is targeted to be, at each stage.
3133
milestone:
3234
beta: "v1.17"
35+
stable: "v1.21"
3336

3437
# The following PRR answers are required at alpha release
3538
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)