Merge pull request kubernetes#2646 from adtac/suspend-beta

k8s-ci-robot · web-flow · commit 6ae9f537d58f · 2021-04-30T05:45:58.000-07:00
suspended jobs: graduate to beta
diff --git a/keps/prod-readiness/sig-apps/2232.yaml b/keps/prod-readiness/sig-apps/2232.yaml
@@ -1,3 +1,5 @@
 kep-number: 2232
 alpha:
   approver: "@wojtek-t"
+beta:
+  approver: "@wojtek-t"
diff --git a/keps/sig-apps/2232-suspend-jobs/README.md b/keps/sig-apps/2232-suspend-jobs/README.md
@@ -81,7 +81,11 @@ SIG Architecture for cross-cutting KEPs).
   - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
   - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
   - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
@@ -280,6 +284,7 @@ Unit, integration, and end-to-end tests will be added to test that:
 ### Graduation Criteria
 
 #### Alpha -> Beta Graduation
+* Metrics with observability in to the Job controller available
 * Implemented feedback from alpha testers
 
 #### Beta -> GA Graduation
@@ -383,80 +388,91 @@ field.
   Jobs that have the flag set will be suspended, and new jobs or updates to existing
   ones to the field will be persisted.
 
-* **Are there any tests for feature enablement/disablement?** No.
-
-<!-- Uncomment when targeting beta graduation
+* **Are there any tests for feature enablement/disablement?** Yes. Integration
+  tests have exhaustive testing switching between different feature enablement
+  states whilst using the feature at the same time. Unit tests and end-to-end
+  tests test feature enablement too.
 
 ### Rollout, Upgrade and Rollback Planning
 
 _This section must be completed when targeting beta graduation to a release._
 
-* **How can a rollout fail? Can it impact already running workloads?**
-  Try to be as paranoid as possible - e.g., what if some components will restart
-   mid-rollout?
-
-* **What specific metrics should inform a rollback?**
-
-* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
+* **How can a rollout fail? Can it impact already running workloads?** Impact
+  to existing Jobs that previously didn't use this feature in alpha is
+  impossible. In workloads using the feature in an older version, suspended
+  Jobs may inadvertently be resumed (or Jobs may be inadvertently suspended) if
+  there are storage-related issues arising from components crashing
+  mid-rollout.
+
+* **What specific metrics should inform a rollback?** `job_sync_duration_seconds`
+  and `job_sync_total` should be observed. Unexpected spikes in the metric with
+  labels `result=error` and `action=pods_deleted` is potentially an indicator
+  that:
+    1. Job suspension is producing errors in the Job controller,
+    1. Jobs are getting suspended when they shouldn't be, or
+    1. Job sync latency is high when Job are suspended.
+  While the above list isn't exhaustive, they're signals in favour of rollbacks.
+
+* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path
+  tested?** <!-- I'll answer this after implementation.
   Describe manual testing that was done and the outcomes.
   Longer term, we may want to require automated upgrade/rollback tests, but we
-  are missing a bunch of machinery and tooling and can't do that now.
+  are missing a bunch of machinery and tooling and can't do that now. -->
 
-* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, 
-fields of API types, flags, etc.?**
-  Even if applying deprecation policies, they may still surprise some users.
+* **Is the rollout accompanied by any deprecations and/or removals of features,
+  APIs, fields of API types, flags, etc.?** No.
 
 ### Monitoring Requirements
 
 _This section must be completed when targeting beta graduation to a release._
 
 * **How can an operator determine if the feature is in use by workloads?**
-  Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
-  checking if there are objects with field X set) may be a last resort. Avoid
-  logs or events for this purpose.
-
-* **What are the SLIs (Service Level Indicators) an operator can use to determine 
-the health of the service?**
-  - [ ] Metrics
-    - Metric name:
-    - [Optional] Aggregation method:
+  The `.spec.suspend` field is set to true by Jobs. The status conditions of a
+  Job can also be used to determine whether a Job is using the feature (look
+  for a condition of type "Suspended").
+
+* **What are the SLIs (Service Level Indicators) an operator can use to
+  determine the health of the service?**
+  - [x] Metrics
+    - Metric name: The metrics `job_sync_duration_seconds` and `job_sync_total`
+      get a new label named `action` to allow operators to filter Job sync
+      latency and error rate, respectively, by the action performed. There are
+      four mutually-exclusive values possible for this label:
+        - `reconciling` when the Job's pod creation/deletion expectations are
+          unsatisfied and the controller is waiting for issued Pod
+          creation/deletions to complete.
+        - `tracking` when the Job's pod creation/deletion expectations are
+          satisfied and the number of active Pods matches expectations (i.e. no
+          pod creation/deletions issued in this sync). This is expected to be
+          the action in most of the syncs.
+        - `pods_created` when the controller creates Pods. This can happen
+          when the number of active Pods is less than the wanted Job
+          parallelism.
+        - `pods_deleted` when the controller deletes Pods. This can happen if a
+          Job is suspended or if the number of active Pods is more than
+          parallelism.
+      Each sample of the two metrics will have exactly one of the above values
+      for the `action` label.
     - Components exposing the metric:
-  - [ ] Other (treat as last resort)
-    - Details:
-
-* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
-  At a high level, this usually will be in the form of "high percentile of SLI
-  per day <= X". It's impossible to provide comprehensive guidance, but at the very
-  high level (needs more precise definitions) those may be things like:
-  - per-day percentage of API calls finishing with 5XX errors <= 1%
-  - 99% percentile over day of absolute value from (job creation time minus expected
-    job creation time) for cron job <= 10%
-  - 99,9% of /health requests per day finish with 200 code
-
-* **Are there any missing metrics that would be useful to have to improve observability 
-of this feature?**
-  Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
-  implementation difficulties, etc.).
+      - kube-controller-manager
+
+* **What are the reasonable SLOs (Service Level Objectives) for the above
+  SLIs?**
+  - per-day percentage of `job_sync_total` with labels `result=error` and
+    `action=pods_deleted` <= 1%
+  - 99% percentile over day for `job_sync_duration_seconds` with label
+    `action=pods_deleted` is <= 15s, assuming a client-side QPS limit of 50
+    calls per second
+
+* **Are there any missing metrics that would be useful to have to improve
+  observability of this feature?** No.
 
 ### Dependencies
 
 _This section must be completed when targeting beta graduation to a release._
 
 * **Does this feature depend on any specific services running in the cluster?**
-  Think about both cluster-level services (e.g. metrics-server) as well
-  as node-level agents (e.g. specific version of CRI). Focus on external or
-  optional services that are needed. For example, if this feature depends on
-  a cloud provider API, or upon an external software-defined storage or network
-  control plane.
-
-  For each of these, fill in the following—thinking about running existing user workloads
-  and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high-error rates on the feature:
-
--->
+  Feature is restricted to kube-apiserver and kube-controller-manager.
 
 ### Scalability
 
@@ -480,8 +496,6 @@ _This section must be completed when targeting beta graduation to a release._
 * **Will enabling / using this feature result in non-negligible increase of 
   resource usage (CPU, RAM, disk, IO, ...) in any components?** No.
 
-<!-- Uncomment when targeting beta graduation
-
 ### Troubleshooting
 
 The Troubleshooting section currently serves the `Playbook` role. We may consider
@@ -491,29 +505,33 @@ details). For now, we leave it here.
 _This section must be completed when targeting beta graduation to a release._
 
 * **How does this feature react if the API server and/or etcd is unavailable?**
-
-* **What are other known failure modes?**
-  For each of them, fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without logging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
-
-* **What steps should be taken if SLOs are not being met to determine the problem?**
-
--->
+  Updates to suspend or resume a Job will not work. The controller will not be
+  able to create or delete Pods. Events, logs, and status conditions for Jobs
+  will not be updated to reflect their suspended status.
+
+* **What are other known failure modes?** None. The API server, etcd, and the
+  controller manager are the only possible points of failure.
+
+* **What steps should be taken if SLOs are not being met to determine the
+  problem?**
+  - Verify that kube-apiserver and etcd are healthy. If not, the Job controller
+    cannot operate, so you must fix those problems first.
+  - Verify that `job_sync_total` is unexpectedly high for `result=error` and
+    `action=pods_deleted` in comparison to other actions.
+  - Verify that `job_sync_duration_seconds` is noticeably larger for
+    `action=pods_deleted` in comparison to the other actions.
+  - If control plane components are starved for CPU, which could be a potential
+    reason behind Job sync latency spikes, consider increasing the control
+    plane's resources.
 
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 
 ## Implementation History
 
 2021-02-01: Initial KEP merged, alpha targeted for 1.21
+2021-03-08: Implementation merged in 1.21 with feature gate disabled by default
+2021-04-22: KEP updated for beta graduation in 1.22
 
 ## Drawbacks
 
diff --git a/keps/sig-apps/2232-suspend-jobs/kep.yaml b/keps/sig-apps/2232-suspend-jobs/kep.yaml
@@ -17,12 +17,12 @@ prr-approvers:
   - "@wojtek-t"
 
 # The target maturity stage in the current dev cycle for this KEP.
-stage: alpha
+stage: beta
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.21"
+latest-milestone: "v1.22"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
@@ -40,4 +40,6 @@ feature-gates:
 disable-supported: true
 
 # The following PRR answers are required at beta release
-metrics: []
+metrics:
+  - job_sync_duration_seconds
+  - job_sync_total