Add PRR answers

ehashman · ehashman · commit c0c0a59d52d8 · 2021-04-19T15:59:29.000-07:00
diff --git a/keps/prod-readiness/sig-node/2238.yaml b/keps/prod-readiness/sig-node/2238.yaml
@@ -1,3 +1,5 @@
 kep-number: 2238
 alpha:
   approver: "@johnbelamaric"
+beta:
+  approver: "@johnbelamaric"
diff --git a/keps/sig-node/2238-liveness-probe-grace-period/README.md b/keps/sig-node/2238-liveness-probe-grace-period/README.md
@@ -8,8 +8,6 @@
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
   - [Configuration example](#configuration-example)
-  - [User Stories (Optional)](#user-stories-optional)
-  - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
   - [Test Plan](#test-plan)
@@ -29,7 +27,6 @@
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
-- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -325,17 +322,34 @@ _This section must be completed when targeting beta graduation to a release._
   Try to be as paranoid as possible - e.g., what if some components will restart
    mid-rollout?
 
+  This change is purely additive. Already-running workloads on a node will not
+  have this field set.
+
 * **What specific metrics should inform a rollback?**
 
+  Increases in error rates on shutdowns initiated by probes; unexpected
+  shutdown times for pods with containers that have probes configured.
+
 * **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
   Describe manual testing that was done and the outcomes.
   Longer term, we may want to require automated upgrade/rollback tests, but we
   are missing a bunch of machinery and tooling and can't do that now.
 
+  For the API server, rollback can be tested by:
+
+  - Turning on the feature flag
+  - Creating a workload with the field
+  - Disabling the feature flag
+
+  We will remove the feature flag from the kubelet for beta to avoid worrying
+  about the n-2 skew case.
+
 * **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
 fields of API types, flags, etc.?**
   Even if applying deprecation policies, they may still surprise some users.
 
+  No.
+
 ### Monitoring Requirements
 
 _This section must be completed when targeting beta graduation to a release._
@@ -345,10 +359,15 @@ _This section must be completed when targeting beta graduation to a release._
   checking if there are objects with field X set) may be a last resort. Avoid
   logs or events for this purpose.
 
+  On kube-api-server: `ProbeTerminationGracePeriod` is turned on.
+
+  On individual workloads: a probe-level `TerminationGracePeriodSeconds` value
+  is set.
+
 * **What are the SLIs (Service Level Indicators) an operator can use to determine
 the health of the service?**
-  - [ ] Metrics
-    - Metric name:
+  - [X] Metrics
+    - Metric name: existing cluster metrics for measuring container shutdown time and errors
     - [Optional] Aggregation method:
     - Components exposing the metric:
   - [ ] Other (treat as last resort)
@@ -363,11 +382,17 @@ the health of the service?**
     job creation time) for cron job <= 10%
   - 99,9% of /health requests per day finish with 200 code
 
+  A cluster administrator could set SLO targets for pod shutdown latency due to
+  probe failure, as well as error rates on pod shutdown due to probe failure.
+
 * **Are there any missing metrics that would be useful to have to improve observability
 of this feature?**
   Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
   implementation difficulties, etc.).
 
+  I believe this should be covered by existing metrics. Adding new metrics for
+  this should not be in scope.
+
 ### Dependencies
 
 _This section must be completed when targeting beta graduation to a release._
@@ -386,6 +411,7 @@ _This section must be completed when targeting beta graduation to a release._
       - Impact of its outage on the feature:
       - Impact of its degraded performance or high-error rates on the feature:
 
+  N/A.
 
 ### Scalability
 
@@ -460,6 +486,9 @@ _This section must be completed when targeting beta graduation to a release._
 
 * **How does this feature react if the API server and/or etcd is unavailable?**
 
+  Kubelet will already have cached/watched the pod spec. There will not be new
+  failure modes for the kubelet responding to API server/etcd unavailability.
+
 * **What are other known failure modes?**
   For each of them, fill in the following information by copying the below template:
   - [Failure mode brief description]
@@ -472,14 +501,25 @@ _This section must be completed when targeting beta graduation to a release._
       Not required until feature graduated to beta.
     - Testing: Are there any tests for failure mode? If not, describe why.
 
+  Cluster administrators can monitor this by checking kubelet logs, which note
+  grace periods on pod termination. Application developers can monitor this by
+  observing shutdown behaviour of their pods for probe failures and for normal
+  shutdown events.
+
 * **What steps should be taken if SLOs are not being met to determine the problem?**
 
+  Application developers have the option of unsetting the pod-level
+  `TerminationGracePeriodSeconds` for their application. Cluster-wide, the
+  feature gate could be disabled for the API server.
+
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 
 ## Implementation History
 
 - 2021-01-07: Initial draft KEP
+- 2021-03-11: Alpha implementation
+  ([kubernetes/kubernetes#99375](https://github.com/kubernetes/kubernetes/pull/99375))
 
 ## Drawbacks
 
@@ -528,11 +568,3 @@ Introduce a similar field, but at the pod-level, next to the existing field.
 - PRO: similar values are grouped together
 - CON: logically, liveness probes operate on a per-container basis, and thus we
   may need different thresholds for different containers
-
-## Infrastructure Needed (Optional)
-
-<!--
-Use this section if you need things from the project/SIG. Examples include a
-new subproject, repos requested, or GitHub details. Listing these here allows a
-SIG to get the process for these resources started right away.
--->