Skip to content

Commit ea4af2e

Browse files
Add blog post for job tracking with finalizers GA (#37438)
* Add blog post for job tracking with finalizers GA * Editorial review * Editorial review 2 * Editorial review 3
1 parent 274fad5 commit ea4af2e

File tree

1 file changed

+155
-0
lines changed
  • content/en/blog/_posts/2022-12-29-scalable-job-tracking-ga

1 file changed

+155
-0
lines changed
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes 1.26: Job Tracking, to Support Massively Parallel Batch Workloads, Is Generally Available"
4+
date: 2022-12-29
5+
slug: "scalable-job-tracking-ga"
6+
---
7+
8+
**Authors:** Aldo Culquicondor (Google)
9+
10+
The Kubernetes 1.26 release includes a stable implementation of the [Job](/docs/concepts/workloads/controllers/job/)
11+
controller that can reliably track a large amount of Jobs with high levels of
12+
parallelism. [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps)
13+
and [WG Batch](https://github.com/kubernetes/community/tree/master/wg-batch)
14+
have worked on this foundational improvement since Kubernetes 1.22. After
15+
multiple iterations and scale verifications, this is now the default
16+
implementation of the Job controller.
17+
18+
Paired with the Indexed [completion mode](/docs/concepts/workloads/controllers/job/#completion-mode),
19+
the Job controller can handle massively parallel batch Jobs, supporting up to
20+
100k concurrent Pods.
21+
22+
The new implementation also made possible the development of [Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy),
23+
which is in beta in the 1.26 release.
24+
25+
## How do I use this feature?
26+
27+
To use Job tracking with finalizers, upgrade to Kubernetes 1.25 or newer and
28+
create new Jobs. You can also use this feature in v1.23 and v1.24, if you have the
29+
ability to enable the `JobTrackingWithFinalizers` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/).
30+
31+
If your cluster runs Kubernetes 1.26, Job tracking with finalizers is a stable
32+
feature. For v1.25, it's behind that feature gate, and your cluster administrators may have
33+
explicitly disabled it - for example, if you have a policy of not using
34+
beta features.
35+
36+
Jobs created before the upgrade will still be tracked using the legacy behavior.
37+
This is to avoid retroactively adding finalizers to running Pods, which might
38+
introduce race conditions.
39+
40+
For maximum performance on large Jobs, the Kubernetes project recommends
41+
using the [Indexed completion mode](/docs/concepts/workloads/controllers/job/#completion-mode).
42+
In this mode, the control plane is able to track Job progress with less API
43+
calls.
44+
45+
If you are a developer of operator(s) for batch, [HPC](https://en.wikipedia.org/wiki/High-performance_computing),
46+
[AI](https://en.wikipedia.org/wiki/Artificial_intelligence), [ML](https://en.wikipedia.org/wiki/Machine_learning)
47+
or related workloads, we encourage you to use the Job API to delegate accurate
48+
progress tracking to Kubernetes. If there is something missing in the Job API
49+
that forces you to manage plain Pods, the [Working Group Batch](https://github.com/kubernetes/community/tree/master/wg-batch)
50+
welcomes your feedback and contributions.
51+
52+
### Deprecation notices
53+
54+
During the development of the feature, the control plane added the annotation
55+
[`batch.kubernetes.io/job-tracking`](/docs/reference/labels-annotations-taints/#batch-kubernetes-io-job-tracking)
56+
to the Jobs that were created when the feature was enabled.
57+
This allowed a safe transition for older Jobs, but it was never meant to stay.
58+
59+
In the 1.26 release, we deprecated the annotation `batch.kubernetes.io/job-tracking`
60+
and the control plane will stop adding it in Kubernetes 1.27.
61+
Along with that change, we will remove the legacy Job tracking implementation.
62+
As a result, the Job controller will track all Jobs using finalizers and it will
63+
ignore Pods that don't have the aforementioned finalizer.
64+
65+
Before you upgrade your cluster to 1.27, we recommend that you verify that there
66+
are no running Jobs that don't have the annotation, or you wait for those jobs
67+
to complete.
68+
Otherwise, you might observe the control plane recreating some Pods.
69+
We expect that this shouldn't affect any users, as the feature is enabled by
70+
default since Kubernetes 1.25, giving enough buffer for old jobs to complete.
71+
72+
## What problem does the new implementation solve?
73+
74+
Generally, Kubernetes workload controllers, such as ReplicaSet or StatefulSet,
75+
rely on the existence of Pods or other objects in the API to determine the
76+
status of the workload and whether replacements are needed.
77+
For example, if a Pod that belonged to a ReplicaSet terminates or ceases to
78+
exist, the ReplicaSet controller needs to create a replacement Pod to satisfy
79+
the desired number of replicas (`.spec.replicas`).
80+
81+
Since its inception, the Job controller also relied on the existence of Pods in
82+
the API to track Job status. A Job has [completion](/docs/concepts/workloads/controllers/job/#completion-mode)
83+
and [failure handling](/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures)
84+
policies, requiring the end state of a finished Pod to determine whether to
85+
create a replacement Pod or mark the Job as completed or failed. As a result,
86+
the Job controller depended on Pods, even terminated ones, to remain in the API
87+
in order to keep track of the status.
88+
89+
This dependency made the tracking of Job status unreliable, because Pods can be
90+
deleted from the API for a number of reasons, including:
91+
- The garbage collector removing orphan Pods when a Node goes down.
92+
- The garbage collector removing terminated Pods when they reach a threshold.
93+
- The Kubernetes scheduler preempting a Pod to accomodate higher priority Pods.
94+
- The taint manager evicting a Pod that doesn't tolerate a `NoExecute` taint.
95+
- External controllers, not included as part of Kubernetes, or humans deleting
96+
Pods.
97+
98+
### The new implementation
99+
100+
When a controller needs to take an action on objects before they are removed, it
101+
should add a [finalizer](/docs/concepts/overview/working-with-objects/finalizers/)
102+
to the objects that it manages.
103+
A finalizer prevents the objects from being deleted from the API until the
104+
finalizers are removed. Once the controller is done with the cleanup and
105+
accounting for the deleted object, it can remove the finalizer from the object and the
106+
control plane removes the object from the API.
107+
108+
This is what the new Job controller is doing: adding a finalizer during Pod
109+
creation, and removing the finalizer after the Pod has terminated and has been
110+
accounted for in the Job status. However, it wasn't that simple.
111+
112+
The main challenge is that there are at least two objects involved: the Pod
113+
and the Job. While the finalizer lives in the Pod object, the accounting lives
114+
in the Job object. There is no mechanism to atomically remove the finalizer in
115+
the Pod and update the counters in the Job status. Additionally, there could be
116+
more than one terminated Pod at a given time.
117+
118+
To solve this problem, we implemented a three staged approach, each translating
119+
to an API call.
120+
1. For each terminated Pod, add the unique ID (UID) of the Pod into short-lived
121+
lists stored in the `.status` of the owning Job
122+
([.status.uncountedTerminatedPods](/docs/reference/kubernetes-api/workload-resources/job-v1/#JobStatus)).
123+
2. Remove the finalizer from the Pods(s).
124+
3. Atomically do the following operations:
125+
- remove UIDs from the short-lived lists
126+
- increment the overall `succeeded` and `failed` counters in the `status` of
127+
the Job.
128+
129+
Additional complications come from the fact that the Job controller might
130+
receive the results of the API changes in steps 1 and 2 out of order. We solved
131+
this by adding an in-memory cache for removed finalizers.
132+
133+
Still, we faced some issues during the beta stage, leaving some pods stuck
134+
with finalizers in some conditions ([#108645](https://github.com/kubernetes/kubernetes/issues/108645),
135+
[#109485](https://github.com/kubernetes/kubernetes/issues/109485), and
136+
[#111646](https://github.com/kubernetes/kubernetes/pull/111646)). As a result,
137+
we decided to switch that feature gate to be disabled by default for the 1.23
138+
and 1.24 releases.
139+
140+
Once resolved, we re-enabled the feature for the 1.25 release. Since then, we
141+
have received reports from our customers running tens of thousands of Pods at a
142+
time in their clusters through the Job API. Seeing this success, we decided to
143+
graduate the feature to stable in 1.26, as part of our long term commitment to
144+
make the Job API the best way to run large batch Jobs in a Kubernetes cluster.
145+
146+
To learn more about the feature, you can read the [KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2307-job-tracking-without-lingering-pods).
147+
148+
## Acknowledgments
149+
150+
As with any Kubernetes feature, multiple people contributed to getting this
151+
done, from testing and filing bugs to reviewing code.
152+
153+
On behalf of SIG Apps, I would like to especially thank Jordan Liggitt (Google)
154+
for helping me debug and brainstorm solutions for more than one race condition
155+
and Maciej Szulik (Red Hat) for his conscious reviews.

0 commit comments

Comments
 (0)