16
16
- [ Implementation History] ( #implementation-history )
17
17
- [ Drawbacks] ( #drawbacks )
18
18
- [ Alternatives] ( #alternatives )
19
+ - [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
20
+ - [ Feature Enablement and Rollback] ( #feature-enablement-and-rollback )
21
+ - [ Rollout, Upgrade and Rollback Planning] ( #rollout-upgrade-and-rollback-planning )
22
+ - [ Monitoring Requirements] ( #monitoring-requirements )
23
+ - [ Dependencies] ( #dependencies )
24
+ - [ Scalability] ( #scalability )
25
+ - [ Troubleshooting] ( #troubleshooting )
19
26
<!-- /toc -->
20
27
21
28
## Release Signoff Checklist
@@ -29,7 +36,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
29
36
- [X] (R) Graduation criteria is in place
30
37
- [ ] (R) Production readiness review completed
31
38
- [ ] Production readiness review approved
32
- - [ ] "Implementation History" section is up-to-date for milestone
39
+ - [X ] "Implementation History" section is up-to-date for milestone
33
40
- [ ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
34
41
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
35
42
@@ -70,7 +77,9 @@ Kubelet not respecting the probe timeout is a bug and should be fixed.
70
77
Changes to kubelet:
71
78
* Ensure kubelet handles timeout errors and registers them as failing probes.
72
79
* Add feature gate ` ExecProbeTimeout ` that is GA and on by default.
73
- * If the feature gate ` ExecProbeTimeout ` is disabled and an exec probe timeout is reached, add warning logs to inform users that exec probes are timing out.
80
+ * If the feature gate ` ExecProbeTimeout ` is disabled and an exec probe timeout is reached, add warning event to inform users that exec probes are timing out.
81
+ * Introduce the [ probe duration metric] ( https://github.com/kubernetes/kubernetes/issues/101035 )
82
+ * metric dimension cardinality must be reviewed and approved by SIG Instrumentation
74
83
* Re-enable existing exec liveness probe e2e test.
75
84
* Add new exec readiness probe e2e test.
76
85
@@ -85,12 +94,15 @@ E2E tests:
85
94
86
95
This is a bug fix so the feature gate will be GA and on by default from the start.
87
96
97
+ Documentation on the migration steps must be provided at kubernetes
98
+ documentation site offering tips on detecting and updating affected workloads.
99
+
88
100
The feature flag should be kept available till we get a sufficient evidence of people not being
89
101
affected by this bug fix - either directly (adjusting the timeouts in pod definition), or
90
102
indirectly, when the timeout is not specified in some third party templates and products
91
103
that cannot be easily fixed by end user.
92
104
93
- Tentative timeline is to lock the feature flag to ` true ` in 1.22 .
105
+ Tentative timeline is to lock the feature flag to ` true ` in 1.25 .
94
106
95
107
### Upgrade / Downgrade Strategy
96
108
@@ -118,3 +130,128 @@ Some alternatives that were considered:
118
130
119
131
1 . Increasing the default timeout for exec probes
120
132
2 . Continuing to ignore the exec probe timeout
133
+
134
+ ## Production Readiness Review Questionnaire
135
+
136
+ ### Feature Enablement and Rollback
137
+
138
+ ###### How can this feature be enabled / disabled in a live cluster?
139
+
140
+ - [X] Feature gate (also fill in values in ` kep.yaml ` )
141
+ - Feature gate name: ` ExecProbeTimeouts `
142
+ - Components depending on the feature gate: kubelet
143
+
144
+ ###### Does enabling the feature change any default behavior?
145
+
146
+ Yes, all workloads that were not accounting for the timeout affect the probe
147
+ behavior will experience the problem.
148
+
149
+ ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
150
+
151
+ Yes, by resetting the feature gate back.
152
+
153
+ ###### What happens if we reenable the feature if it was previously rolled back?
154
+
155
+ Behavior will restore back immediately.
156
+
157
+ ###### Are there any tests for feature enablement/disablement?
158
+
159
+ N/A, trivial
160
+
161
+ ### Rollout, Upgrade and Rollback Planning
162
+
163
+ ###### How can a rollout or rollback fail? Can it impact already running workloads?
164
+
165
+ Rollout and rollback are straightforward and are not expected to fail.
166
+
167
+ ###### What specific metrics should inform a rollback?
168
+
169
+ Pods entering crashloopbackoff because of exec timeout failure.
170
+
171
+ ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
172
+
173
+ N/A, trivial
174
+
175
+ ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
176
+
177
+ No
178
+
179
+ ### Monitoring Requirements
180
+
181
+ The only mechanism currently implemented is warning logs in kubelet.
182
+ The KEP was updated to introduce the warning events for the cases when timeout
183
+ was exceeded. With these events, operator may ensure that no workloads are
184
+ affected by this bug currently by analyzing events.
185
+
186
+ ###### How can an operator determine if the feature is in use by workloads?
187
+
188
+ Before migration, analyze events indicating that the timeout was exceeded by exec probe.
189
+ There is no way to determine if exceed timeout failure of exec probes were intentional
190
+ or not once the feature gate was enabled.
191
+
192
+ ###### How can someone using this feature know that it is working for their instance?
193
+
194
+ No, there is no way to determine if exceed timeout failure of exec probes were intentional
195
+ or not once the feature gate was enabled.
196
+
197
+ ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
198
+
199
+ SLO of the feature: exec probes must fail when timeout is exceeded. This can be
200
+ checked by reviewing that Probe duration metric not exceeding significantly
201
+ the timeout value.
202
+
203
+ ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
204
+
205
+ - [x] Metrics
206
+ - Metric name: ` probe_duration_seconds `
207
+
208
+ ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
209
+
210
+ The metric [ probe duration metric] ( https://github.com/kubernetes/kubernetes/issues/101035 )
211
+ was not implemented yet.
212
+
213
+ ### Dependencies
214
+
215
+ ###### Does this feature depend on any specific services running in the cluster?
216
+
217
+ No
218
+
219
+ ### Scalability
220
+
221
+ ###### Will enabling / using this feature result in any new API calls?
222
+
223
+ No
224
+
225
+ ###### Will enabling / using this feature result in introducing new API types?
226
+
227
+ No
228
+
229
+ ###### Will enabling / using this feature result in any new calls to the cloud provider?
230
+
231
+ No
232
+
233
+ ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
234
+
235
+ No
236
+
237
+ ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
238
+
239
+ No
240
+
241
+ ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
242
+
243
+ No
244
+
245
+ ### Troubleshooting
246
+
247
+ Kubelet.log may be used for all the probes behavior troubleshooting.
248
+
249
+ ###### How does this feature react if the API server and/or etcd is unavailable?
250
+
251
+ ###### What are other known failure modes?
252
+
253
+ None
254
+
255
+ ###### What steps should be taken if SLOs are not being met to determine the problem?
256
+
257
+ None. It is a core functionality of kubelet
0 commit comments