|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 1.26: Job Tracking, to Support Massively Parallel Batch Workloads, Is Generally Available" |
| 4 | +date: 2022-12-29 |
| 5 | +slug: "scalable-job-tracking-ga" |
| 6 | +--- |
| 7 | + |
| 8 | +**Authors:** Aldo Culquicondor (Google) |
| 9 | + |
| 10 | +The Kubernetes 1.26 release includes a stable implementation of the [Job](/docs/concepts/workloads/controllers/job/) |
| 11 | +controller that can reliably track a large amount of Jobs with high levels of |
| 12 | +parallelism. [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) |
| 13 | +and [WG Batch](https://github.com/kubernetes/community/tree/master/wg-batch) |
| 14 | +have worked on this foundational improvement since Kubernetes 1.22. After |
| 15 | +multiple iterations and scale verifications, this is now the default |
| 16 | +implementation of the Job controller. |
| 17 | + |
| 18 | +Paired with the Indexed [completion mode](/docs/concepts/workloads/controllers/job/#completion-mode), |
| 19 | +the Job controller can handle massively parallel batch Jobs, supporting up to |
| 20 | +100k concurrent Pods. |
| 21 | + |
| 22 | +The new implementation also made possible the development of [Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy), |
| 23 | +which is in beta in the 1.26 release. |
| 24 | + |
| 25 | +## How do I use this feature? |
| 26 | + |
| 27 | +To use Job tracking with finalizers, upgrade to Kubernetes 1.25 or newer and |
| 28 | +create new Jobs. You can also use this feature in v1.23 and v1.24, if you have the |
| 29 | +ability to enable the `JobTrackingWithFinalizers` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/). |
| 30 | + |
| 31 | +If your cluster runs Kubernetes 1.26, Job tracking with finalizers is a stable |
| 32 | +feature. For v1.25, it's behind that feature gate, and your cluster administrators may have |
| 33 | +explicitly disabled it - for example, if you have a policy of not using |
| 34 | +beta features. |
| 35 | + |
| 36 | +Jobs created before the upgrade will still be tracked using the legacy behavior. |
| 37 | +This is to avoid retroactively adding finalizers to running Pods, which might |
| 38 | +introduce race conditions. |
| 39 | + |
| 40 | +For maximum performance on large Jobs, the Kubernetes project recommends |
| 41 | +using the [Indexed completion mode](/docs/concepts/workloads/controllers/job/#completion-mode). |
| 42 | +In this mode, the control plane is able to track Job progress with less API |
| 43 | +calls. |
| 44 | + |
| 45 | +If you are a developer of operator(s) for batch, [HPC](https://en.wikipedia.org/wiki/High-performance_computing), |
| 46 | +[AI](https://en.wikipedia.org/wiki/Artificial_intelligence), [ML](https://en.wikipedia.org/wiki/Machine_learning) |
| 47 | +or related workloads, we encourage you to use the Job API to delegate accurate |
| 48 | +progress tracking to Kubernetes. If there is something missing in the Job API |
| 49 | +that forces you to manage plain Pods, the [Working Group Batch](https://github.com/kubernetes/community/tree/master/wg-batch) |
| 50 | +welcomes your feedback and contributions. |
| 51 | + |
| 52 | +### Deprecation notices |
| 53 | + |
| 54 | +During the development of the feature, the control plane added the annotation |
| 55 | +[`batch.kubernetes.io/job-tracking`](/docs/reference/labels-annotations-taints/#batch-kubernetes-io-job-tracking) |
| 56 | +to the Jobs that were created when the feature was enabled. |
| 57 | +This allowed a safe transition for older Jobs, but it was never meant to stay. |
| 58 | + |
| 59 | +In the 1.26 release, we deprecated the annotation `batch.kubernetes.io/job-tracking` |
| 60 | +and the control plane will stop adding it in Kubernetes 1.27. |
| 61 | +Along with that change, we will remove the legacy Job tracking implementation. |
| 62 | +As a result, the Job controller will track all Jobs using finalizers and it will |
| 63 | +ignore Pods that don't have the aforementioned finalizer. |
| 64 | + |
| 65 | +Before you upgrade your cluster to 1.27, we recommend that you verify that there |
| 66 | +are no running Jobs that don't have the annotation, or you wait for those jobs |
| 67 | +to complete. |
| 68 | +Otherwise, you might observe the control plane recreating some Pods. |
| 69 | +We expect that this shouldn't affect any users, as the feature is enabled by |
| 70 | +default since Kubernetes 1.25, giving enough buffer for old jobs to complete. |
| 71 | + |
| 72 | +## What problem does the new implementation solve? |
| 73 | + |
| 74 | +Generally, Kubernetes workload controllers, such as ReplicaSet or StatefulSet, |
| 75 | +rely on the existence of Pods or other objects in the API to determine the |
| 76 | +status of the workload and whether replacements are needed. |
| 77 | +For example, if a Pod that belonged to a ReplicaSet terminates or ceases to |
| 78 | +exist, the ReplicaSet controller needs to create a replacement Pod to satisfy |
| 79 | +the desired number of replicas (`.spec.replicas`). |
| 80 | + |
| 81 | +Since its inception, the Job controller also relied on the existence of Pods in |
| 82 | +the API to track Job status. A Job has [completion](/docs/concepts/workloads/controllers/job/#completion-mode) |
| 83 | +and [failure handling](/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures) |
| 84 | +policies, requiring the end state of a finished Pod to determine whether to |
| 85 | +create a replacement Pod or mark the Job as completed or failed. As a result, |
| 86 | +the Job controller depended on Pods, even terminated ones, to remain in the API |
| 87 | +in order to keep track of the status. |
| 88 | + |
| 89 | +This dependency made the tracking of Job status unreliable, because Pods can be |
| 90 | +deleted from the API for a number of reasons, including: |
| 91 | +- The garbage collector removing orphan Pods when a Node goes down. |
| 92 | +- The garbage collector removing terminated Pods when they reach a threshold. |
| 93 | +- The Kubernetes scheduler preempting a Pod to accomodate higher priority Pods. |
| 94 | +- The taint manager evicting a Pod that doesn't tolerate a `NoExecute` taint. |
| 95 | +- External controllers, not included as part of Kubernetes, or humans deleting |
| 96 | + Pods. |
| 97 | + |
| 98 | +### The new implementation |
| 99 | + |
| 100 | +When a controller needs to take an action on objects before they are removed, it |
| 101 | +should add a [finalizer](/docs/concepts/overview/working-with-objects/finalizers/) |
| 102 | +to the objects that it manages. |
| 103 | +A finalizer prevents the objects from being deleted from the API until the |
| 104 | +finalizers are removed. Once the controller is done with the cleanup and |
| 105 | +accounting for the deleted object, it can remove the finalizer from the object and the |
| 106 | +control plane removes the object from the API. |
| 107 | + |
| 108 | +This is what the new Job controller is doing: adding a finalizer during Pod |
| 109 | +creation, and removing the finalizer after the Pod has terminated and has been |
| 110 | +accounted for in the Job status. However, it wasn't that simple. |
| 111 | + |
| 112 | +The main challenge is that there are at least two objects involved: the Pod |
| 113 | +and the Job. While the finalizer lives in the Pod object, the accounting lives |
| 114 | +in the Job object. There is no mechanism to atomically remove the finalizer in |
| 115 | +the Pod and update the counters in the Job status. Additionally, there could be |
| 116 | +more than one terminated Pod at a given time. |
| 117 | + |
| 118 | +To solve this problem, we implemented a three staged approach, each translating |
| 119 | +to an API call. |
| 120 | +1. For each terminated Pod, add the unique ID (UID) of the Pod into short-lived |
| 121 | + lists stored in the `.status` of the owning Job |
| 122 | + ([.status.uncountedTerminatedPods](/docs/reference/kubernetes-api/workload-resources/job-v1/#JobStatus)). |
| 123 | +2. Remove the finalizer from the Pods(s). |
| 124 | +3. Atomically do the following operations: |
| 125 | + - remove UIDs from the short-lived lists |
| 126 | + - increment the overall `succeeded` and `failed` counters in the `status` of |
| 127 | + the Job. |
| 128 | + |
| 129 | +Additional complications come from the fact that the Job controller might |
| 130 | +receive the results of the API changes in steps 1 and 2 out of order. We solved |
| 131 | +this by adding an in-memory cache for removed finalizers. |
| 132 | + |
| 133 | +Still, we faced some issues during the beta stage, leaving some pods stuck |
| 134 | +with finalizers in some conditions ([#108645](https://github.com/kubernetes/kubernetes/issues/108645), |
| 135 | +[#109485](https://github.com/kubernetes/kubernetes/issues/109485), and |
| 136 | +[#111646](https://github.com/kubernetes/kubernetes/pull/111646)). As a result, |
| 137 | +we decided to switch that feature gate to be disabled by default for the 1.23 |
| 138 | +and 1.24 releases. |
| 139 | + |
| 140 | +Once resolved, we re-enabled the feature for the 1.25 release. Since then, we |
| 141 | +have received reports from our customers running tens of thousands of Pods at a |
| 142 | +time in their clusters through the Job API. Seeing this success, we decided to |
| 143 | +graduate the feature to stable in 1.26, as part of our long term commitment to |
| 144 | +make the Job API the best way to run large batch Jobs in a Kubernetes cluster. |
| 145 | + |
| 146 | +To learn more about the feature, you can read the [KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2307-job-tracking-without-lingering-pods). |
| 147 | + |
| 148 | +## Acknowledgments |
| 149 | + |
| 150 | +As with any Kubernetes feature, multiple people contributed to getting this |
| 151 | +done, from testing and filing bugs to reviewing code. |
| 152 | + |
| 153 | +On behalf of SIG Apps, I would like to especially thank Jordan Liggitt (Google) |
| 154 | +for helping me debug and brainstorm solutions for more than one race condition |
| 155 | +and Maciej Szulik (Red Hat) for his conscious reviews. |
0 commit comments