8
8
- [ Non-Goals] ( #non-goals )
9
9
- [ Proposal] ( #proposal )
10
10
- [ Configuration example] ( #configuration-example )
11
- - [ User Stories (Optional)] ( #user-stories-optional )
12
- - [ Notes/Constraints/Caveats (Optional)] ( #notesconstraintscaveats-optional )
13
11
- [ Risks and Mitigations] ( #risks-and-mitigations )
14
12
- [ Design Details] ( #design-details )
15
13
- [ Test Plan] ( #test-plan )
29
27
- [ Implementation History] ( #implementation-history )
30
28
- [ Drawbacks] ( #drawbacks )
31
29
- [ Alternatives] ( #alternatives )
32
- - [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
33
30
<!-- /toc -->
34
31
35
32
## Release Signoff Checklist
@@ -325,17 +322,34 @@ _This section must be completed when targeting beta graduation to a release._
325
322
Try to be as paranoid as possible - e.g., what if some components will restart
326
323
mid-rollout?
327
324
325
+ This change is purely additive. Already-running workloads on a node will not
326
+ have this field set.
327
+
328
328
* **What specific metrics should inform a rollback?**
329
329
330
+ Increases in error rates on shutdowns initiated by probes; unexpected
331
+ shutdown times for pods with containers that have probes configured.
332
+
330
333
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
331
334
Describe manual testing that was done and the outcomes.
332
335
Longer term, we may want to require automated upgrade/rollback tests, but we
333
336
are missing a bunch of machinery and tooling and can't do that now.
334
337
338
+ For the API server, rollback can be tested by :
339
+
340
+ - Turning on the feature flag
341
+ - Creating a workload with the field
342
+ - Disabling the feature flag
343
+
344
+ We will remove the feature flag from the kubelet for beta to avoid worrying
345
+ about the n-2 skew case.
346
+
335
347
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
336
348
fields of API types, flags, etc.?**
337
349
Even if applying deprecation policies, they may still surprise some users.
338
350
351
+ No.
352
+
339
353
# ## Monitoring Requirements
340
354
341
355
_This section must be completed when targeting beta graduation to a release._
@@ -345,10 +359,15 @@ _This section must be completed when targeting beta graduation to a release._
345
359
checking if there are objects with field X set) may be a last resort. Avoid
346
360
logs or events for this purpose.
347
361
362
+ On kube-api-server : ` ProbeTerminationGracePeriod` is turned on.
363
+
364
+ On individual workloads : a probe-level `TerminationGracePeriodSeconds` value
365
+ is set.
366
+
348
367
* **What are the SLIs (Service Level Indicators) an operator can use to determine
349
368
the health of the service?**
350
- - [ ] Metrics
351
- - Metric name :
369
+ - [X ] Metrics
370
+ - Metric name : existing cluster metrics for measuring container shutdown time and errors
352
371
- [Optional] Aggregation method :
353
372
- Components exposing the metric :
354
373
- [ ] Other (treat as last resort)
@@ -363,11 +382,17 @@ the health of the service?**
363
382
job creation time) for cron job <= 10%
364
383
- 99,9% of /health requests per day finish with 200 code
365
384
385
+ A cluster administrator could set SLO targets for pod shutdown latency due to
386
+ probe failure, as well as error rates on pod shutdown due to probe failure.
387
+
366
388
* **Are there any missing metrics that would be useful to have to improve observability
367
389
of this feature?**
368
390
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
369
391
implementation difficulties, etc.).
370
392
393
+ I believe this should be covered by existing metrics. Adding new metrics for
394
+ this should not be in scope.
395
+
371
396
# ## Dependencies
372
397
373
398
_This section must be completed when targeting beta graduation to a release._
@@ -386,6 +411,7 @@ _This section must be completed when targeting beta graduation to a release._
386
411
- Impact of its outage on the feature :
387
412
- Impact of its degraded performance or high-error rates on the feature :
388
413
414
+ N/A.
389
415
390
416
# ## Scalability
391
417
@@ -460,6 +486,9 @@ _This section must be completed when targeting beta graduation to a release._
460
486
461
487
* **How does this feature react if the API server and/or etcd is unavailable?**
462
488
489
+ Kubelet will already have cached/watched the pod spec. There will not be new
490
+ failure modes for the kubelet responding to API server/etcd unavailability.
491
+
463
492
* **What are other known failure modes?**
464
493
For each of them, fill in the following information by copying the below template :
465
494
- [Failure mode brief description]
@@ -472,14 +501,25 @@ _This section must be completed when targeting beta graduation to a release._
472
501
Not required until feature graduated to beta.
473
502
- Testing : Are there any tests for failure mode? If not, describe why.
474
503
504
+ Cluster administrators can monitor this by checking kubelet logs, which note
505
+ grace periods on pod termination. Application developers can monitor this by
506
+ observing shutdown behaviour of their pods for probe failures and for normal
507
+ shutdown events.
508
+
475
509
* **What steps should be taken if SLOs are not being met to determine the problem?**
476
510
511
+ Application developers have the option of unsetting the pod-level
512
+ ` TerminationGracePeriodSeconds` for their application. Cluster-wide, the
513
+ feature gate could be disabled for the API server.
514
+
477
515
[supported limits] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
478
516
[existing SLIs/SLOs] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
479
517
480
518
# # Implementation History
481
519
482
520
- 2021-01-07 : Initial draft KEP
521
+ - 2021-03-11 : Alpha implementation
522
+ ([kubernetes/kubernetes#99375](https://github.com/kubernetes/kubernetes/pull/99375))
483
523
484
524
# # Drawbacks
485
525
@@ -528,11 +568,3 @@ Introduce a similar field, but at the pod-level, next to the existing field.
528
568
- PRO : similar values are grouped together
529
569
- CON : logically, liveness probes operate on a per-container basis, and thus we
530
570
may need different thresholds for different containers
531
-
532
- # # Infrastructure Needed (Optional)
533
-
534
- <!--
535
- Use this section if you need things from the project/SIG. Examples include a
536
- new subproject, repos requested, or GitHub details. Listing these here allows a
537
- SIG to get the process for these resources started right away.
538
- -->
0 commit comments