@@ -24,6 +24,13 @@ misbehaving self-hosted clusters.
24
24
- [ Graduation Criteria] ( #graduation-criteria )
25
25
- [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
26
26
- [ Version Skew Strategy] ( #version-skew-strategy )
27
+ - [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
28
+ - [ Feature Enablement and Rollback] ( #feature-enablement-and-rollback )
29
+ - [ Rollout, Upgrade and Rollback Planning] ( #rollout-upgrade-and-rollback-planning )
30
+ - [ Monitoring Requirements] ( #monitoring-requirements )
31
+ - [ Dependencies] ( #dependencies )
32
+ - [ Scalability] ( #scalability )
33
+ - [ Troubleshooting] ( #troubleshooting )
27
34
- [ Implementation History] ( #implementation-history )
28
35
- [ Drawbacks [ optional]] ( #drawbacks-optional )
29
36
- [ Alternatives [ optional]] ( #alternatives-optional )
@@ -155,6 +162,10 @@ The risk in doing that is greater than the additional benefit.
155
162
### Test Plan
156
163
157
164
1 . Positive and negative tests for this are fairly easy to write and the changes are narrow in scope.
165
+ 2 . There will not be e2e tests written because the scenario under which this API is effective is only in a mis-configured
166
+ cluster where a kubelet has not refreshed its serving certs. There is an existing positive and negative integration
167
+ [ test] ( https://github.com/kubernetes/kubernetes/blob/release-1.20/test/integration/apiserver/podlogs/podlogs_test.go#L141-L164 )
168
+ which the sig leads believe is sufficient.
158
169
159
170
### Graduation Criteria
160
171
@@ -169,17 +180,140 @@ Because the change is isolated to non-persisted API contracts with the kube-apis
169
180
170
181
Because the change is isolated to non-persisted API contracts with the kube-apiserver, there are no skew or upgrade/downgrade considerations.
171
182
172
- ## Implementation History
183
+ ## Production Readiness Review Questionnaire
184
+
185
+ ### Feature Enablement and Rollback
186
+
187
+ _ This section must be completed when targeting alpha to a release._
188
+
189
+ * ** How can this feature be enabled / disabled in a live cluster?**
190
+ - [x] Feature gate (also fill in values in ` kep.yaml ` )
191
+ - Feature gate name: AllowInsecureBackendProxy
192
+ - Components depending on the feature gate: kube-apiserver
193
+
194
+ * ** Does enabling the feature change any default behavior?**
195
+ No, all default behavior remains the same with the feature gate on or off.
196
+
197
+ * ** Can the feature be disabled once it has been enabled (i.e. can we roll back
198
+ the enablement)?**
199
+ Yes, the feature can be disabled after enablement.
200
+ Because no data is persisted via this API, there is no impact that lingers across kube-apiserver restarts.
201
+
202
+ * ** What happens if we reenable the feature if it was previously rolled back?**
203
+ Because no data is persisted via this API, there is no impact that lingers across kube-apiserver restarts.
204
+
205
+ * ** Are there any tests for feature enablement/disablement?**
206
+ Because no data is persisted via this API, there is no lingering memory in the system to check.
207
+
208
+ ### Rollout, Upgrade and Rollback Planning
209
+
210
+ _ This section must be completed when targeting beta graduation to a release._
211
+
212
+ * ** How can a rollout fail? Can it impact already running workloads?**
213
+ This is contained to a single binary, with no persisted data.
214
+ The worst failure mode is when an HA cluster has some members with the feature off and some members with the feature on.
215
+ In such a case, the user observed behavior going through a load balancer is inconsistent until the cluster settles.
216
+
217
+ * ** What specific metrics should inform a rollback?**
218
+ If there is a notable increase in failed pod/logs calls, it may be indicative of the new code causing a problem.
219
+
220
+ * ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
221
+ Yes. This was explicitly tested in the OpenShift distro when the feature went to beta.
222
+ During HA cluster upgrades, the client observed behavior was inconsistent (as expected), but once all members had
223
+ the feature gate consistent it was fine.
224
+ Skew also worked correctly, with new clients sending the additional option simply not connecting as they wish, failing
225
+ in the safe direction.
226
+
227
+ * ** Is the rollout accompanied by any deprecations and/or removals of features, APIs,
228
+ fields of API types, flags, etc.?**
229
+ No.
230
+
231
+ ### Monitoring Requirements
232
+
233
+ * ** How can an operator determine if the feature is in use by workloads?**
234
+ ` pods_logs_insecure_backend_total ` has a label ` skip_tls_allowed ` which will count how often this value is set by clients.
235
+
236
+ * ** What are the SLIs (Service Level Indicators) an operator can use to determine
237
+ the health of the service?**
238
+ - [ ] Metrics
239
+ - Metric name:
240
+ ` pods_logs_insecure_backend_total ` indicates usage.
241
+ ` pods_logs_backend_tls_failure_total ` indicates how often usage of the option may have allowed a connection to be established.
242
+ - Components exposing the metric: kube-apiserver
243
+
244
+ * ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
245
+ pods/logs can suffer errors today based on user input because the kubelet cannot be verified.
246
+ Because this is driven based on clients, different clusters may have different "reasonable" starting values.
247
+ However, there should not be a marked increase the failure rate of pods/logs.
173
248
174
- Major milestones in the life cycle of a KEP should be tracked in ` Implementation History ` .
175
- Major milestones might include
249
+ * ** Are there any missing metrics that would be useful to have to improve observability
250
+ of this feature?**
251
+ I don't think we need greater granularity here.
252
+
253
+ ### Dependencies
254
+
255
+ * ** Does this feature depend on any specific services running in the cluster?**
256
+ No.
257
+ This does not introduce any new calls from the kube-apiserver.
258
+
259
+ ### Scalability
260
+
261
+ * ** Will enabling / using this feature result in any new API calls?**
262
+ no.
263
+ It adds an option to an existing API call that would already have been called.
264
+
265
+ * ** Will enabling / using this feature result in introducing new API types?**
266
+ No.
267
+ It adds a field to ` PodLogOptions ` , which is not a persisted API.
268
+
269
+ * ** Will enabling / using this feature result in any new calls to the cloud
270
+ provider?**
271
+ No.
272
+
273
+ * ** Will enabling / using this feature result in increasing size or count of
274
+ the existing API objects?**
275
+ No
276
+
277
+ * ** Will enabling / using this feature result in increasing time taken by any
278
+ operations covered by [ existing SLIs/SLOs] ?**
279
+ No.
280
+
281
+ * ** Will enabling / using this feature result in non-negligible increase of
282
+ resource usage (CPU, RAM, disk, IO, ...) in any components?**
283
+ No.
284
+
285
+ ### Troubleshooting
286
+
287
+ The Troubleshooting section currently serves the ` Playbook ` role. We may consider
288
+ splitting it into a dedicated ` Playbook ` document (potentially with some monitoring
289
+ details). For now, we leave it here.
290
+
291
+ _ This section must be completed when targeting beta graduation to a release._
292
+
293
+ * ** How does this feature react if the API server and/or etcd is unavailable?**
294
+ No impact because this feature only affects the kube-apiserver behavior.
295
+
296
+ * ** What are other known failure modes?**
297
+ There are no known failure modes.
298
+
299
+ * ** What steps should be taken if SLOs are not being met to determine the problem?**
300
+ The usual steps used to debug a pod/logs failure.
301
+ This varies somewhat, but generally you gather.
302
+ 1 . the kube-apiserver logs
303
+ 2 . the pods you cannot connect to
304
+ 3 . the node API running that pod
305
+ 4 . the kubelet log for that node
306
+ 5 . the crio log for that node
307
+ From there you can decide how far the request is getting and whether you need to investigate the network connections.
308
+ This is a fairly deep and rare thing to investigate today.
309
+
310
+ [ supported limits ] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
311
+ [ existing SLIs/SLOs ] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
312
+
313
+ ## Implementation History
176
314
177
- - the ` Summary ` and ` Motivation ` sections being merged signaling SIG acceptance
178
- - the ` Proposal ` section being merged signaling agreement on a proposed design
179
- - the date implementation started
180
- - the first Kubernetes release where an initial version of the KEP was available
181
- - the version of Kubernetes where the KEP graduated to general availability
182
- - when the KEP was retired or superseded
315
+ Introduced as beta in 1.17.
316
+ Moving to stable in 1.21.
183
317
184
318
## Drawbacks [ optional]
185
319
0 commit comments