8
8
- [ Non-Goals] ( #non-goals )
9
9
- [ Proposal] ( #proposal )
10
10
- [ Configuration example] ( #configuration-example )
11
- - [ User Stories (Optional)] ( #user-stories-optional )
12
- - [ Notes/Constraints/Caveats (Optional)] ( #notesconstraintscaveats-optional )
13
11
- [ Risks and Mitigations] ( #risks-and-mitigations )
14
12
- [ Design Details] ( #design-details )
15
13
- [ Test Plan] ( #test-plan )
16
14
- [ Graduation Criteria] ( #graduation-criteria )
17
- - [ Alpha] ( #alpha )
18
- - [ Beta] ( #beta )
19
- - [ Graduation] ( #graduation )
15
+ - [ Alpha (1.21) ] ( #alpha-121 )
16
+ - [ Beta (1.22) ] ( #beta-122 )
17
+ - [ Graduation (1.25) ] ( #graduation-125 )
20
18
- [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
21
19
- [ Version Skew Strategy] ( #version-skew-strategy )
22
20
- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
29
27
- [ Implementation History] ( #implementation-history )
30
28
- [ Drawbacks] ( #drawbacks )
31
29
- [ Alternatives] ( #alternatives )
32
- - [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
33
30
<!-- /toc -->
34
31
35
32
## Release Signoff Checklist
@@ -43,9 +40,9 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
43
40
- [X] (R) Graduation criteria is in place
44
41
- [X] (R) Production readiness review completed
45
42
- [X] (R) Production readiness review approved
46
- - [ ] "Implementation History" section is up-to-date for milestone
47
- - [ ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
48
- - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
43
+ - [X ] "Implementation History" section is up-to-date for milestone
44
+ - [X ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
45
+ - [X ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
49
46
50
47
[ kubernetes.io ] : https://kubernetes.io/
51
48
[ kubernetes/enhancements ] : https://git.k8s.io/enhancements
@@ -140,12 +137,6 @@ spec:
140
137
terminationGracePeriodSeconds : 60
141
138
` ` `
142
139
143
- ### User Stories (Optional)
144
-
145
- N/A - bugfix
146
-
147
- ### Notes/Constraints/Caveats (Optional)
148
-
149
140
### Risks and Mitigations
150
141
151
142
This should be a low-risk API change as it is backwards-compatible. If the
@@ -192,21 +183,27 @@ quickly.
192
183
193
184
# ## Graduation Criteria
194
185
195
- # ### Alpha
186
+ # ### Alpha (1.21)
196
187
197
188
- New probe field, `terminationGracePeriodSeconds`, is implemented and
198
189
available behind a feature flag.
199
190
- Appropriate tests are written.
200
191
201
- _Below graduation criteria are tentative._
192
+ # ### Beta (1.22)
202
193
203
- # ### Beta
194
+ - Feature flag will default to off.
195
+ - Remove feature gate from [kubelet](https://github.com/kubernetes/kubernetes/pull/99375#issuecomment-794680869).
196
+ - Ensure that when feature gate is off in API server, probe-level
197
+ ` TerminationGracePeriodSeconds` is blanked out.
198
+ - Add validation to ensure `terminationGracePeriodSeconds` is non-negative.
199
+ - Feature flag is defaulted to on after kube-apiserver is +2 versions of the
200
+ kubelet having the support (1.24).
204
201
205
- - Feature flag is defaulted on.
202
+ _Below graduation criteria are tentative._
206
203
207
- # ### Graduation
204
+ # ### Graduation (1.25)
208
205
209
- - Feature flag is removed, feature is graduated.
206
+ - Feature flag is removed, feature is graduated (1.25) .
210
207
211
208
# ## Upgrade / Downgrade Strategy
212
209
@@ -245,7 +242,11 @@ enhancement:
245
242
n-2 kubelet without this feature will default to the old behaviour, using the
246
243
pod-level `terminationGracePeriodSeconds`.
247
244
248
- Only when feature gate is enabled for all components will we use the new field.
245
+ Feature gate will be removed from kubelet in 1.22, ensuring support for n-2
246
+ version skew on a 1.24+ API server.
247
+
248
+ Only when feature gate is enabled for all components will we use the new field,
249
+ so we delay defaulting the feature flag on until then.
249
250
250
251
# # Production Readiness Review Questionnaire
251
252
@@ -299,7 +300,8 @@ _This section must be completed when targeting alpha to a release._
299
300
Describe the consequences on existing workloads (e.g., if this is a runtime
300
301
feature, can it break the existing applications?).
301
302
302
- Yes : disable feature flag.
303
+ Yes : disable feature flag. API server will blank out values if it is set in
304
+ etcd while feature flag is disabled.
303
305
304
306
While feature flag is enabled, the feature can also be disabled by unsetting
305
307
the field on the Probe specification, which will restore the default
@@ -329,17 +331,34 @@ _This section must be completed when targeting beta graduation to a release._
329
331
Try to be as paranoid as possible - e.g., what if some components will restart
330
332
mid-rollout?
331
333
334
+ This change is purely additive. Already-running workloads on a node will not
335
+ have this field set.
336
+
332
337
* **What specific metrics should inform a rollback?**
333
338
339
+ Increases in error rates on shutdowns initiated by probes; unexpected
340
+ shutdown times for pods with containers that have probes configured.
341
+
334
342
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
335
343
Describe manual testing that was done and the outcomes.
336
344
Longer term, we may want to require automated upgrade/rollback tests, but we
337
345
are missing a bunch of machinery and tooling and can't do that now.
338
346
347
+ For the API server, rollback can be tested by :
348
+
349
+ - Turning on the feature flag
350
+ - Creating a workload with the field
351
+ - Disabling the feature flag
352
+
353
+ We will remove the feature flag from the kubelet for beta to avoid worrying
354
+ about the n-2 skew case.
355
+
339
356
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
340
357
fields of API types, flags, etc.?**
341
358
Even if applying deprecation policies, they may still surprise some users.
342
359
360
+ No.
361
+
343
362
# ## Monitoring Requirements
344
363
345
364
_This section must be completed when targeting beta graduation to a release._
@@ -349,10 +368,15 @@ _This section must be completed when targeting beta graduation to a release._
349
368
checking if there are objects with field X set) may be a last resort. Avoid
350
369
logs or events for this purpose.
351
370
371
+ On kube-api-server : ` ProbeTerminationGracePeriod` is turned on.
372
+
373
+ On individual workloads : a probe-level `TerminationGracePeriodSeconds` value
374
+ is set.
375
+
352
376
* **What are the SLIs (Service Level Indicators) an operator can use to determine
353
377
the health of the service?**
354
- - [ ] Metrics
355
- - Metric name :
378
+ - [X ] Metrics
379
+ - Metric name : existing cluster metrics for measuring container shutdown time and errors
356
380
- [Optional] Aggregation method :
357
381
- Components exposing the metric :
358
382
- [ ] Other (treat as last resort)
@@ -367,11 +391,17 @@ the health of the service?**
367
391
job creation time) for cron job <= 10%
368
392
- 99,9% of /health requests per day finish with 200 code
369
393
394
+ A cluster administrator could set SLO targets for pod shutdown latency due to
395
+ probe failure, as well as error rates on pod shutdown due to probe failure.
396
+
370
397
* **Are there any missing metrics that would be useful to have to improve observability
371
398
of this feature?**
372
399
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
373
400
implementation difficulties, etc.).
374
401
402
+ I believe this should be covered by existing metrics. Adding new metrics for
403
+ this should not be in scope.
404
+
375
405
# ## Dependencies
376
406
377
407
_This section must be completed when targeting beta graduation to a release._
@@ -390,6 +420,7 @@ _This section must be completed when targeting beta graduation to a release._
390
420
- Impact of its outage on the feature :
391
421
- Impact of its degraded performance or high-error rates on the feature :
392
422
423
+ N/A.
393
424
394
425
# ## Scalability
395
426
@@ -464,6 +495,9 @@ _This section must be completed when targeting beta graduation to a release._
464
495
465
496
* **How does this feature react if the API server and/or etcd is unavailable?**
466
497
498
+ Kubelet will already have cached/watched the pod spec. There will not be new
499
+ failure modes for the kubelet responding to API server/etcd unavailability.
500
+
467
501
* **What are other known failure modes?**
468
502
For each of them, fill in the following information by copying the below template :
469
503
- [Failure mode brief description]
@@ -476,14 +510,25 @@ _This section must be completed when targeting beta graduation to a release._
476
510
Not required until feature graduated to beta.
477
511
- Testing : Are there any tests for failure mode? If not, describe why.
478
512
513
+ Cluster administrators can monitor this by checking kubelet logs, which note
514
+ grace periods on pod termination. Application developers can monitor this by
515
+ observing shutdown behaviour of their pods for probe failures and for normal
516
+ shutdown events.
517
+
479
518
* **What steps should be taken if SLOs are not being met to determine the problem?**
480
519
520
+ Application developers have the option of unsetting the pod-level
521
+ ` TerminationGracePeriodSeconds` for their application. Cluster-wide, the
522
+ feature gate could be disabled for the API server.
523
+
481
524
[supported limits] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
482
525
[existing SLIs/SLOs] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
483
526
484
527
# # Implementation History
485
528
486
529
- 2021-01-07 : Initial draft KEP
530
+ - 2021-03-11 : Alpha implementation
531
+ ([kubernetes/kubernetes#99375](https://github.com/kubernetes/kubernetes/pull/99375))
487
532
488
533
# # Drawbacks
489
534
@@ -532,11 +577,3 @@ Introduce a similar field, but at the pod-level, next to the existing field.
532
577
- PRO : similar values are grouped together
533
578
- CON : logically, liveness probes operate on a per-container basis, and thus we
534
579
may need different thresholds for different containers
535
-
536
- # # Infrastructure Needed (Optional)
537
-
538
- <!--
539
- Use this section if you need things from the project/SIG. Examples include a
540
- new subproject, repos requested, or GitHub details. Listing these here allows a
541
- SIG to get the process for these resources started right away.
542
- -->
0 commit comments