Skip to content

Commit 0da867d

Browse files
authored
Merge pull request kubernetes#2638 from ehashman/update-2238
Update probe grace period KEP to beta
2 parents 89df46b + 4f9ca82 commit 0da867d

File tree

3 files changed

+75
-37
lines changed

3 files changed

+75
-37
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2238
22
alpha:
33
approver: "@johnbelamaric"
4+
beta:
5+
approver: "@johnbelamaric"

keps/sig-node/2238-liveness-probe-grace-period/README.md

Lines changed: 70 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,13 @@
88
- [Non-Goals](#non-goals)
99
- [Proposal](#proposal)
1010
- [Configuration example](#configuration-example)
11-
- [User Stories (Optional)](#user-stories-optional)
12-
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1311
- [Risks and Mitigations](#risks-and-mitigations)
1412
- [Design Details](#design-details)
1513
- [Test Plan](#test-plan)
1614
- [Graduation Criteria](#graduation-criteria)
17-
- [Alpha](#alpha)
18-
- [Beta](#beta)
19-
- [Graduation](#graduation)
15+
- [Alpha (1.21)](#alpha-121)
16+
- [Beta (1.22)](#beta-122)
17+
- [Graduation (1.25)](#graduation-125)
2018
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
2119
- [Version Skew Strategy](#version-skew-strategy)
2220
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -29,7 +27,6 @@
2927
- [Implementation History](#implementation-history)
3028
- [Drawbacks](#drawbacks)
3129
- [Alternatives](#alternatives)
32-
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
3330
<!-- /toc -->
3431

3532
## Release Signoff Checklist
@@ -43,9 +40,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4340
- [X] (R) Graduation criteria is in place
4441
- [X] (R) Production readiness review completed
4542
- [X] (R) Production readiness review approved
46-
- [ ] "Implementation History" section is up-to-date for milestone
47-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
48-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
43+
- [X] "Implementation History" section is up-to-date for milestone
44+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
45+
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4946

5047
[kubernetes.io]: https://kubernetes.io/
5148
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -140,12 +137,6 @@ spec:
140137
terminationGracePeriodSeconds: 60
141138
```
142139
143-
### User Stories (Optional)
144-
145-
N/A - bugfix
146-
147-
### Notes/Constraints/Caveats (Optional)
148-
149140
### Risks and Mitigations
150141
151142
This should be a low-risk API change as it is backwards-compatible. If the
@@ -192,21 +183,27 @@ quickly.
192183

193184
### Graduation Criteria
194185

195-
#### Alpha
186+
#### Alpha (1.21)
196187

197188
- New probe field, `terminationGracePeriodSeconds`, is implemented and
198189
available behind a feature flag.
199190
- Appropriate tests are written.
200191

201-
_Below graduation criteria are tentative._
192+
#### Beta (1.22)
202193

203-
#### Beta
194+
- Feature flag will default to off.
195+
- Remove feature gate from [kubelet](https://github.com/kubernetes/kubernetes/pull/99375#issuecomment-794680869).
196+
- Ensure that when feature gate is off in API server, probe-level
197+
`TerminationGracePeriodSeconds` is blanked out.
198+
- Add validation to ensure `terminationGracePeriodSeconds` is non-negative.
199+
- Feature flag is defaulted to on after kube-apiserver is +2 versions of the
200+
kubelet having the support (1.24).
204201

205-
- Feature flag is defaulted on.
202+
_Below graduation criteria are tentative._
206203

207-
#### Graduation
204+
#### Graduation (1.25)
208205

209-
- Feature flag is removed, feature is graduated.
206+
- Feature flag is removed, feature is graduated (1.25).
210207

211208
### Upgrade / Downgrade Strategy
212209

@@ -245,7 +242,11 @@ enhancement:
245242
n-2 kubelet without this feature will default to the old behaviour, using the
246243
pod-level `terminationGracePeriodSeconds`.
247244

248-
Only when feature gate is enabled for all components will we use the new field.
245+
Feature gate will be removed from kubelet in 1.22, ensuring support for n-2
246+
version skew on a 1.24+ API server.
247+
248+
Only when feature gate is enabled for all components will we use the new field,
249+
so we delay defaulting the feature flag on until then.
249250

250251
## Production Readiness Review Questionnaire
251252

@@ -299,7 +300,8 @@ _This section must be completed when targeting alpha to a release._
299300
Describe the consequences on existing workloads (e.g., if this is a runtime
300301
feature, can it break the existing applications?).
301302

302-
Yes: disable feature flag.
303+
Yes: disable feature flag. API server will blank out values if it is set in
304+
etcd while feature flag is disabled.
303305

304306
While feature flag is enabled, the feature can also be disabled by unsetting
305307
the field on the Probe specification, which will restore the default
@@ -329,17 +331,34 @@ _This section must be completed when targeting beta graduation to a release._
329331
Try to be as paranoid as possible - e.g., what if some components will restart
330332
mid-rollout?
331333

334+
This change is purely additive. Already-running workloads on a node will not
335+
have this field set.
336+
332337
* **What specific metrics should inform a rollback?**
333338

339+
Increases in error rates on shutdowns initiated by probes; unexpected
340+
shutdown times for pods with containers that have probes configured.
341+
334342
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
335343
Describe manual testing that was done and the outcomes.
336344
Longer term, we may want to require automated upgrade/rollback tests, but we
337345
are missing a bunch of machinery and tooling and can't do that now.
338346

347+
For the API server, rollback can be tested by:
348+
349+
- Turning on the feature flag
350+
- Creating a workload with the field
351+
- Disabling the feature flag
352+
353+
We will remove the feature flag from the kubelet for beta to avoid worrying
354+
about the n-2 skew case.
355+
339356
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
340357
fields of API types, flags, etc.?**
341358
Even if applying deprecation policies, they may still surprise some users.
342359

360+
No.
361+
343362
### Monitoring Requirements
344363

345364
_This section must be completed when targeting beta graduation to a release._
@@ -349,10 +368,15 @@ _This section must be completed when targeting beta graduation to a release._
349368
checking if there are objects with field X set) may be a last resort. Avoid
350369
logs or events for this purpose.
351370

371+
On kube-api-server: `ProbeTerminationGracePeriod` is turned on.
372+
373+
On individual workloads: a probe-level `TerminationGracePeriodSeconds` value
374+
is set.
375+
352376
* **What are the SLIs (Service Level Indicators) an operator can use to determine
353377
the health of the service?**
354-
- [ ] Metrics
355-
- Metric name:
378+
- [X] Metrics
379+
- Metric name: existing cluster metrics for measuring container shutdown time and errors
356380
- [Optional] Aggregation method:
357381
- Components exposing the metric:
358382
- [ ] Other (treat as last resort)
@@ -367,11 +391,17 @@ the health of the service?**
367391
job creation time) for cron job <= 10%
368392
- 99,9% of /health requests per day finish with 200 code
369393

394+
A cluster administrator could set SLO targets for pod shutdown latency due to
395+
probe failure, as well as error rates on pod shutdown due to probe failure.
396+
370397
* **Are there any missing metrics that would be useful to have to improve observability
371398
of this feature?**
372399
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
373400
implementation difficulties, etc.).
374401

402+
I believe this should be covered by existing metrics. Adding new metrics for
403+
this should not be in scope.
404+
375405
### Dependencies
376406

377407
_This section must be completed when targeting beta graduation to a release._
@@ -390,6 +420,7 @@ _This section must be completed when targeting beta graduation to a release._
390420
- Impact of its outage on the feature:
391421
- Impact of its degraded performance or high-error rates on the feature:
392422

423+
N/A.
393424

394425
### Scalability
395426

@@ -464,6 +495,9 @@ _This section must be completed when targeting beta graduation to a release._
464495

465496
* **How does this feature react if the API server and/or etcd is unavailable?**
466497

498+
Kubelet will already have cached/watched the pod spec. There will not be new
499+
failure modes for the kubelet responding to API server/etcd unavailability.
500+
467501
* **What are other known failure modes?**
468502
For each of them, fill in the following information by copying the below template:
469503
- [Failure mode brief description]
@@ -476,14 +510,25 @@ _This section must be completed when targeting beta graduation to a release._
476510
Not required until feature graduated to beta.
477511
- Testing: Are there any tests for failure mode? If not, describe why.
478512

513+
Cluster administrators can monitor this by checking kubelet logs, which note
514+
grace periods on pod termination. Application developers can monitor this by
515+
observing shutdown behaviour of their pods for probe failures and for normal
516+
shutdown events.
517+
479518
* **What steps should be taken if SLOs are not being met to determine the problem?**
480519

520+
Application developers have the option of unsetting the pod-level
521+
`TerminationGracePeriodSeconds` for their application. Cluster-wide, the
522+
feature gate could be disabled for the API server.
523+
481524
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
482525
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
483526

484527
## Implementation History
485528

486529
- 2021-01-07: Initial draft KEP
530+
- 2021-03-11: Alpha implementation
531+
([kubernetes/kubernetes#99375](https://github.com/kubernetes/kubernetes/pull/99375))
487532

488533
## Drawbacks
489534

@@ -532,11 +577,3 @@ Introduce a similar field, but at the pod-level, next to the existing field.
532577
- PRO: similar values are grouped together
533578
- CON: logically, liveness probes operate on a per-container basis, and thus we
534579
may need different thresholds for different containers
535-
536-
## Infrastructure Needed (Optional)
537-
538-
<!--
539-
Use this section if you need things from the project/SIG. Examples include a
540-
new subproject, repos requested, or GitHub details. Listing these here allows a
541-
SIG to get the process for these resources started right away.
542-
-->

keps/sig-node/2238-liveness-probe-grace-period/kep.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,24 +18,23 @@ prr-approvers:
1818
- "@johnbelamaric"
1919

2020
# The target maturity stage in the current dev cycle for this KEP.
21-
stage: alpha
21+
stage: beta
2222

2323
# The most recent milestone for which work toward delivery of this KEP has been
2424
# done. This can be the current (upcoming) milestone, if it is being actively
2525
# worked on.
26-
latest-milestone: "v1.21"
26+
latest-milestone: "v1.22"
2727

2828
# The milestone at which this feature was, or is targeted to be, at each stage.
2929
milestone:
3030
alpha: "v1.21"
3131
beta: "v1.22"
32-
stable: "v1.23"
32+
stable: "v1.25"
3333

3434
# The following PRR answers are required at alpha release
3535
# List the feature gate name and the components for which it must be enabled
3636
feature-gates:
3737
- name: ProbeTerminationGracePeriod
3838
components:
3939
- kube-apiserver
40-
- kubelet
4140
disable-supported: true

0 commit comments

Comments
 (0)