Skip to content

Commit 4859d9c

Browse files
authored
Merge pull request kubernetes#3201 from SergeyKanzhelev/execProbetimeout
ExecProbeTimeout PRR and re-promoting to GA
2 parents b02f428 + a186367 commit 4859d9c

File tree

2 files changed

+143
-6
lines changed

2 files changed

+143
-6
lines changed

keps/sig-node/1972-kubelet-exec-probe-timeouts/README.md

Lines changed: 140 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,13 @@
1616
- [Implementation History](#implementation-history)
1717
- [Drawbacks](#drawbacks)
1818
- [Alternatives](#alternatives)
19+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
20+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
21+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
22+
- [Monitoring Requirements](#monitoring-requirements)
23+
- [Dependencies](#dependencies)
24+
- [Scalability](#scalability)
25+
- [Troubleshooting](#troubleshooting)
1926
<!-- /toc -->
2027

2128
## Release Signoff Checklist
@@ -29,7 +36,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
2936
- [X] (R) Graduation criteria is in place
3037
- [ ] (R) Production readiness review completed
3138
- [ ] Production readiness review approved
32-
- [ ] "Implementation History" section is up-to-date for milestone
39+
- [X] "Implementation History" section is up-to-date for milestone
3340
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
3441
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
3542

@@ -70,7 +77,9 @@ Kubelet not respecting the probe timeout is a bug and should be fixed.
7077
Changes to kubelet:
7178
* Ensure kubelet handles timeout errors and registers them as failing probes.
7279
* Add feature gate `ExecProbeTimeout` that is GA and on by default.
73-
* If the feature gate `ExecProbeTimeout` is disabled and an exec probe timeout is reached, add warning logs to inform users that exec probes are timing out.
80+
* If the feature gate `ExecProbeTimeout` is disabled and an exec probe timeout is reached, add warning event to inform users that exec probes are timing out.
81+
* Introduce the [probe duration metric](https://github.com/kubernetes/kubernetes/issues/101035)
82+
* metric dimension cardinality must be reviewed and approved by SIG Instrumentation
7483
* Re-enable existing exec liveness probe e2e test.
7584
* Add new exec readiness probe e2e test.
7685

@@ -85,12 +94,15 @@ E2E tests:
8594

8695
This is a bug fix so the feature gate will be GA and on by default from the start.
8796

97+
Documentation on the migration steps must be provided at kubernetes
98+
documentation site offering tips on detecting and updating affected workloads.
99+
88100
The feature flag should be kept available till we get a sufficient evidence of people not being
89101
affected by this bug fix - either directly (adjusting the timeouts in pod definition), or
90102
indirectly, when the timeout is not specified in some third party templates and products
91103
that cannot be easily fixed by end user.
92104

93-
Tentative timeline is to lock the feature flag to `true` in 1.22.
105+
Tentative timeline is to lock the feature flag to `true` in 1.25.
94106

95107
### Upgrade / Downgrade Strategy
96108

@@ -118,3 +130,128 @@ Some alternatives that were considered:
118130

119131
1. Increasing the default timeout for exec probes
120132
2. Continuing to ignore the exec probe timeout
133+
134+
## Production Readiness Review Questionnaire
135+
136+
### Feature Enablement and Rollback
137+
138+
###### How can this feature be enabled / disabled in a live cluster?
139+
140+
- [X] Feature gate (also fill in values in `kep.yaml`)
141+
- Feature gate name: `ExecProbeTimeouts`
142+
- Components depending on the feature gate: kubelet
143+
144+
###### Does enabling the feature change any default behavior?
145+
146+
Yes, all workloads that were not accounting for the timeout affect the probe
147+
behavior will experience the problem.
148+
149+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
150+
151+
Yes, by resetting the feature gate back.
152+
153+
###### What happens if we reenable the feature if it was previously rolled back?
154+
155+
Behavior will restore back immediately.
156+
157+
###### Are there any tests for feature enablement/disablement?
158+
159+
N/A, trivial
160+
161+
### Rollout, Upgrade and Rollback Planning
162+
163+
###### How can a rollout or rollback fail? Can it impact already running workloads?
164+
165+
Rollout and rollback are straightforward and are not expected to fail.
166+
167+
###### What specific metrics should inform a rollback?
168+
169+
Pods entering crashloopbackoff because of exec timeout failure.
170+
171+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
172+
173+
N/A, trivial
174+
175+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
176+
177+
No
178+
179+
### Monitoring Requirements
180+
181+
The only mechanism currently implemented is warning logs in kubelet.
182+
The KEP was updated to introduce the warning events for the cases when timeout
183+
was exceeded. With these events, operator may ensure that no workloads are
184+
affected by this bug currently by analyzing events.
185+
186+
###### How can an operator determine if the feature is in use by workloads?
187+
188+
Before migration, analyze events indicating that the timeout was exceeded by exec probe.
189+
There is no way to determine if exceed timeout failure of exec probes were intentional
190+
or not once the feature gate was enabled.
191+
192+
###### How can someone using this feature know that it is working for their instance?
193+
194+
No, there is no way to determine if exceed timeout failure of exec probes were intentional
195+
or not once the feature gate was enabled.
196+
197+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
198+
199+
SLO of the feature: exec probes must fail when timeout is exceeded. This can be
200+
checked by reviewing that Probe duration metric not exceeding significantly
201+
the timeout value.
202+
203+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
204+
205+
- [x] Metrics
206+
- Metric name: `probe_duration_seconds`
207+
208+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
209+
210+
The metric [probe duration metric](https://github.com/kubernetes/kubernetes/issues/101035)
211+
was not implemented yet.
212+
213+
### Dependencies
214+
215+
###### Does this feature depend on any specific services running in the cluster?
216+
217+
No
218+
219+
### Scalability
220+
221+
###### Will enabling / using this feature result in any new API calls?
222+
223+
No
224+
225+
###### Will enabling / using this feature result in introducing new API types?
226+
227+
No
228+
229+
###### Will enabling / using this feature result in any new calls to the cloud provider?
230+
231+
No
232+
233+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
234+
235+
No
236+
237+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
238+
239+
No
240+
241+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
242+
243+
No
244+
245+
### Troubleshooting
246+
247+
Kubelet.log may be used for all the probes behavior troubleshooting.
248+
249+
###### How does this feature react if the API server and/or etcd is unavailable?
250+
251+
###### What are other known failure modes?
252+
253+
None
254+
255+
###### What steps should be taken if SLOs are not being met to determine the problem?
256+
257+
None. It is a core functionality of kubelet

keps/sig-node/1972-kubelet-exec-probe-timeouts/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ authors:
55
- "@SergeyKanzhelev"
66
owning-sig: sig-node
77
participating-sigs:
8-
status: implemented
8+
status: implementeable
99
creation-date: 2020-09-08
1010
reviewers:
1111
- "@dchen1107"
@@ -20,11 +20,11 @@ stage: stable
2020
# The most recent milestone for which work toward delivery of this KEP has been
2121
# done. This can be the current (upcoming) milestone, if it is being actively
2222
# worked on.
23-
latest-milestone: "v1.20"
23+
latest-milestone: "v1.24"
2424

2525
# The milestone at which this feature was, or is targeted to be, at each stage.
2626
milestone:
27-
stable: "v1.20"
27+
stable: "v1.24"
2828

2929
# The following PRR answers are required at alpha release
3030
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)