Skip to content

Commit c0c0a59

Browse files
committed
Add PRR answers
1 parent 488232d commit c0c0a59

File tree

2 files changed

+47
-13
lines changed

2 files changed

+47
-13
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2238
22
alpha:
33
approver: "@johnbelamaric"
4+
beta:
5+
approver: "@johnbelamaric"

keps/sig-node/2238-liveness-probe-grace-period/README.md

Lines changed: 45 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@
88
- [Non-Goals](#non-goals)
99
- [Proposal](#proposal)
1010
- [Configuration example](#configuration-example)
11-
- [User Stories (Optional)](#user-stories-optional)
12-
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1311
- [Risks and Mitigations](#risks-and-mitigations)
1412
- [Design Details](#design-details)
1513
- [Test Plan](#test-plan)
@@ -29,7 +27,6 @@
2927
- [Implementation History](#implementation-history)
3028
- [Drawbacks](#drawbacks)
3129
- [Alternatives](#alternatives)
32-
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
3330
<!-- /toc -->
3431

3532
## Release Signoff Checklist
@@ -325,17 +322,34 @@ _This section must be completed when targeting beta graduation to a release._
325322
Try to be as paranoid as possible - e.g., what if some components will restart
326323
mid-rollout?
327324

325+
This change is purely additive. Already-running workloads on a node will not
326+
have this field set.
327+
328328
* **What specific metrics should inform a rollback?**
329329

330+
Increases in error rates on shutdowns initiated by probes; unexpected
331+
shutdown times for pods with containers that have probes configured.
332+
330333
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
331334
Describe manual testing that was done and the outcomes.
332335
Longer term, we may want to require automated upgrade/rollback tests, but we
333336
are missing a bunch of machinery and tooling and can't do that now.
334337

338+
For the API server, rollback can be tested by:
339+
340+
- Turning on the feature flag
341+
- Creating a workload with the field
342+
- Disabling the feature flag
343+
344+
We will remove the feature flag from the kubelet for beta to avoid worrying
345+
about the n-2 skew case.
346+
335347
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
336348
fields of API types, flags, etc.?**
337349
Even if applying deprecation policies, they may still surprise some users.
338350

351+
No.
352+
339353
### Monitoring Requirements
340354

341355
_This section must be completed when targeting beta graduation to a release._
@@ -345,10 +359,15 @@ _This section must be completed when targeting beta graduation to a release._
345359
checking if there are objects with field X set) may be a last resort. Avoid
346360
logs or events for this purpose.
347361

362+
On kube-api-server: `ProbeTerminationGracePeriod` is turned on.
363+
364+
On individual workloads: a probe-level `TerminationGracePeriodSeconds` value
365+
is set.
366+
348367
* **What are the SLIs (Service Level Indicators) an operator can use to determine
349368
the health of the service?**
350-
- [ ] Metrics
351-
- Metric name:
369+
- [X] Metrics
370+
- Metric name: existing cluster metrics for measuring container shutdown time and errors
352371
- [Optional] Aggregation method:
353372
- Components exposing the metric:
354373
- [ ] Other (treat as last resort)
@@ -363,11 +382,17 @@ the health of the service?**
363382
job creation time) for cron job <= 10%
364383
- 99,9% of /health requests per day finish with 200 code
365384

385+
A cluster administrator could set SLO targets for pod shutdown latency due to
386+
probe failure, as well as error rates on pod shutdown due to probe failure.
387+
366388
* **Are there any missing metrics that would be useful to have to improve observability
367389
of this feature?**
368390
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
369391
implementation difficulties, etc.).
370392

393+
I believe this should be covered by existing metrics. Adding new metrics for
394+
this should not be in scope.
395+
371396
### Dependencies
372397

373398
_This section must be completed when targeting beta graduation to a release._
@@ -386,6 +411,7 @@ _This section must be completed when targeting beta graduation to a release._
386411
- Impact of its outage on the feature:
387412
- Impact of its degraded performance or high-error rates on the feature:
388413

414+
N/A.
389415

390416
### Scalability
391417

@@ -460,6 +486,9 @@ _This section must be completed when targeting beta graduation to a release._
460486

461487
* **How does this feature react if the API server and/or etcd is unavailable?**
462488

489+
Kubelet will already have cached/watched the pod spec. There will not be new
490+
failure modes for the kubelet responding to API server/etcd unavailability.
491+
463492
* **What are other known failure modes?**
464493
For each of them, fill in the following information by copying the below template:
465494
- [Failure mode brief description]
@@ -472,14 +501,25 @@ _This section must be completed when targeting beta graduation to a release._
472501
Not required until feature graduated to beta.
473502
- Testing: Are there any tests for failure mode? If not, describe why.
474503

504+
Cluster administrators can monitor this by checking kubelet logs, which note
505+
grace periods on pod termination. Application developers can monitor this by
506+
observing shutdown behaviour of their pods for probe failures and for normal
507+
shutdown events.
508+
475509
* **What steps should be taken if SLOs are not being met to determine the problem?**
476510

511+
Application developers have the option of unsetting the pod-level
512+
`TerminationGracePeriodSeconds` for their application. Cluster-wide, the
513+
feature gate could be disabled for the API server.
514+
477515
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
478516
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
479517

480518
## Implementation History
481519

482520
- 2021-01-07: Initial draft KEP
521+
- 2021-03-11: Alpha implementation
522+
([kubernetes/kubernetes#99375](https://github.com/kubernetes/kubernetes/pull/99375))
483523

484524
## Drawbacks
485525

@@ -528,11 +568,3 @@ Introduce a similar field, but at the pod-level, next to the existing field.
528568
- PRO: similar values are grouped together
529569
- CON: logically, liveness probes operate on a per-container basis, and thus we
530570
may need different thresholds for different containers
531-
532-
## Infrastructure Needed (Optional)
533-
534-
<!--
535-
Use this section if you need things from the project/SIG. Examples include a
536-
new subproject, repos requested, or GitHub details. Listing these here allows a
537-
SIG to get the process for these resources started right away.
538-
-->

0 commit comments

Comments
 (0)