Remarks to the Job update blogpost

mimowo · reylejano · soltysh · mimowo · commit 1fab6eaf757b · 2023-08-11T08:48:09.000+02:00
Co-authored-by: Rey Lejano &lt;rlejano@gmail.com&gt;
Co-authored-by: Maciej Szulik &lt;soltysh@gmail.com&gt;
Co-authored-by: Tim Bannister &lt;tim@scalefactory.com&gt;
Co-authored-by: Aldo Culquicondor &lt;1299064+alculquicondor@users.noreply.github.com&gt;
Co-authored-by: Paola Cortés &lt;51036950+cortespao@users.noreply.github.com&gt;
diff --git a/content/en/blog/_posts/2023-08-21-job-update-post.md b/content/en/blog/_posts/2023-08-21-job-update-post.md
@@ -1,42 +1,47 @@
 ---
 layout: blog
-title: "Kubernetes 1.28: New Job features"
-date: 2023-08-15
+title: "Kubernetes 1.28: Improved failure handling for Jobs"
+date: 2023-08-21
 slug: kubernetes-1-28-jobapi-update
 ---
 
 **Authors:** Kevin Hannon (G-Research), Michał Woźniak (Google)
 
 This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch
-users: [PodReplacementPolicy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated)
-and [BackoffLimitPerIndex](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs).
+users: [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy)
+and [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index).
 
-## Pod Replacement Policy
+These features continue the effort started by the
+[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
+to improve the handling of Pod failures in a Job.
 
-### What problem does this solve?
+## Pod replacement policy {#pod-replacement-policy}
 
 By default, when a pod enters a terminating state (e.g. due to preemption or
-eviction), a replacement pod is created immediately, and both pods are running
-at the same time.
+eviction), Kubernetes immediately creates a replacement Pod. Therefore, both Pods are running
+at the same time. In API terms, a pod is considered terminating when it has a
+`deletionTimestamp` and it has a phase `Pending` or `Running`.
 
-This is problematic for some popular machine learning frameworks, such as
-TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one pod running at the same time,
+The scenario when two Pods are running at a given time is problematic for
+some popular machine learning frameworks, such as
+TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one Pod running at the same time,
 for a given index (see more details in the [issue](https://github.com/kubernetes/kubernetes/issues/115844)).
 
 Creating the replacement Pod before the previous one fully terminates can also
-cause problems in clusters with scarce resources or with tight budgets. These
-resources can be difficult to obtain so pods can take a long time to find
-resources and they may only be able to find nodes until the existing pods are
-fully terminated. Further, if cluster autoscaler is enabled, the replacement
-Pods might produce undesired scale ups.
+cause problems in clusters with scarce resources or with tight budgets, such as:
+* cluster resources can be difficult to obtain for Pods pending to be scheduled,
+  as Kubernetes might take a long time to find available nodes until the existing
+  Pods are fully terminated.
+* if cluster autoscaler is enabled, the replacement Pods might produce undesired
+  scale ups.
 
-### How can I use it
+### How can you use it? {#pod-replacement-policy-how-to-use}
 
-This is an alpha feature, which you can enable by enabling the `JobPodReplacementPolicy`
+This is an alpha feature, which you can enable by turning on `JobPodReplacementPolicy`
 [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in
 your cluster.
 
-Once the feature is enabled you can use it by creating a new Job, which specifies
+Once the feature is enabled in your cluster, you can use it by creating a new Job that specifies a
 `podReplacementPolicy` field as shown here:
 
 ```yaml
@@ -49,6 +54,9 @@ spec:
   ...
 ```
 
+In that Job, the Pods would only be replaced once they reached the `Failed` phase,
+and not when they are terminating.
+
 Additionally, you can inspect the `.status.terminating` field of a Job. The value
 of the field is the number of Pods owned by the Job that are currently terminating.
 
@@ -64,50 +72,49 @@ status:
 ```
 
 This can be particularly useful for external queueing controllers, such as
-[Kueue](https://github.com/kubernetes-sigs/kueue), that would calculate the
-quota and suspend the start of a new Job until the resources are reclaimed from
+[Kueue](https://github.com/kubernetes-sigs/kueue), that tracks quota
+from running Pods of a Job until the resources are reclaimed from
 the currently terminating Job.
 
-### How can I learn more?
-
-- Read the KEP: [PodReplacementPolicy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated)
-
-## Job Backoff Limit per Index
+Note that the `podReplacementPolicy: Failed` is the default when using a custom
+[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy).
 
-### What problem does this solve?
+## Backoff limit per index {#backoff-limit-per-index}
 
-By default, pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode)
+By default, Pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode)
 are counted towards the global limit of retries, represented by `.spec.backoffLimit`.
 This means, that if there is a consistently failing index, it is restarted
-repeatedly until it exhausts the limit. Once the limit is exceeded the entire
+repeatedly until it exhausts the limit. Once the limit is reached the entire
 Job is marked failed and some indexes may never be even started.
 
-This is problematic for use cases where you want to handle pod failures for
+This is problematic for use cases where you want to handle Pod failures for
 every index independently. For example, if you use Indexed Jobs for running
 integration tests where each index corresponds to a testing suite. In that case,
 you may want to account for possible flake tests allowing for 1 or 2 retries per
-suite. Additionally, there might be some buggy suites, making the corresponding
-indexes fail consistently. In that case you may prefer to terminate retries for
-that indexes, yet allowing other suites to complete.
+suite. There might be some buggy suites, making the corresponding
+indexes fail consistently. In that case you may prefer to limit retries for
+the buggy suites, yet allowing other suites to complete.
 
 The feature allows you to:
-* complete execution of all indexes, despite some indexes failing,
+* complete execution of all indexes, despite some indexes failing.
 * better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes.
 
-### How to use it?
+### How can you use it? {#backoff-limit-per-index-how-to-use}
 
-This is an alpha feature, which you can enable by enabling the
+This is an alpha feature, which you can enable by turning on the
 `JobBackoffLimitPerIndex`
 [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
 in your cluster.
 
-Once the feature is enabled, you can create an Indexed Job with the
+Once the feature is enabled in your cluster, you can create an Indexed Job with the
 `.spec.backoffLimitPerIndex` field specified.
 
 #### Example
 
 The following example demonstrates how to use this feature to make sure the
-Job executes all indexes, and the number of failures is controller per index.
+Job executes all indexes (provided there is no other reason for the early Job
+termination, such as reaching the `activeDeadlineSeconds` timeout, or being
+manually deleted by the user), and the number of failures is controlled per index.
 
 ```yaml
 apiVersion: batch/v1
@@ -136,7 +143,7 @@ spec:
           time.sleep(1)
 ```
 
-Now, inspect the pods after the job is finished:
+Now, inspect the Pods after the job is finished:
 
 ```sh
 kubectl get pods -l job-name=job-backoff-limit-per-index-execute-all
@@ -157,13 +164,13 @@ job-backoff-limit-per-index-execute-all-6-tbkr8   0/1     Completed   0
 job-backoff-limit-per-index-execute-all-7-hxjsq   0/1     Completed   0          22s
 ```
 
-Additionally, let's take a look at the job status:
+Additionally, you can take a look at the status for that Job:
 
 ```sh
 kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml
 ```
 
-Returns output similar to this:
+The output ends with a `status` similar to:
 
 ```yaml
   status:
@@ -185,19 +192,29 @@ then the buggy indexes would retry until the global `backoffLimit` was exceeded,
 and then the entire Job would be marked failed, before some of the higher
 indexes are started.
 
-### Getting Involved
+## How can you learn more?
+
+- Read the user-facing documentation for [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy),
+[Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index), and
+[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
+- Read the KEPs for [Pod Replacement Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated),
+[Backoff limit per index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs), and
+[Pod failure policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures).
+
+## Getting Involved
 
-These features were sponsored under the domain of SIG Apps.  Batch is actively
+These features were sponsored by [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps).  Batch use cases are actively
 being improved for Kubernetes users in the
 [batch working group](https://github.com/kubernetes/community/tree/master/wg-batch).
 Working groups are relatively short-lived initiatives focused on specific goals.
-In the case of Batch, the goal is to improve/support batch users and enhance the
+The goal of the WG Batch is to improve experience for batch workload users, offer support for
+batch processing use cases, and enhance the
 Job API for common use cases.  If that interests you, please join the working
 group either by subscriping to our
 [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on
 [Slack](https://kubernetes.slack.com/messages/wg-batch).
 
-### Acknowledgments
+## Acknowledgments
 
 As with any Kubernetes feature, multiple people contributed to getting this
 done, from testing and filing bugs to reviewing code.