1
- <!--
2
- **Note:** When your KEP is complete, all of these comment blocks should be removed.
3
-
4
- To get started with this template:
5
-
6
- - [ ] **Pick a hosting SIG.**
7
- Make sure that the problem space is something the SIG is interested in taking
8
- up. KEPs should not be checked in without a sponsoring SIG.
9
- - [ ] **Create an issue in kubernetes/enhancements**
10
- When filing an enhancement tracking issue, please make sure to complete all
11
- fields in that template. One of the fields asks for a link to the KEP. You
12
- can leave that blank until this KEP is filed, and then go back to the
13
- enhancement and add the link.
14
- - [ ] **Make a copy of this template directory.**
15
- Copy this template into the owning SIG's directory and name it
16
- `NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
17
- leading-zero padding) assigned to your enhancement above.
18
- - [ ] **Fill out as much of the kep.yaml file as you can.**
19
- At minimum, you should fill in the "Title", "Authors", "Owning-sig",
20
- "Status", and date-related fields.
21
- - [ ] **Fill out this file as best you can.**
22
- At minimum, you should fill in the "Summary" and "Motivation" sections.
23
- These should be easy if you've preflighted the idea of the KEP with the
24
- appropriate SIG(s).
25
- - [ ] **Create a PR for this KEP.**
26
- Assign it to people in the SIG who are sponsoring this process.
27
- - [ ] **Merge early and iterate.**
28
- Avoid getting hung up on specific details and instead aim to get the goals of
29
- the KEP clarified and merged quickly. The best way to do this is to just
30
- start with the high-level sections and fill out details incrementally in
31
- subsequent PRs.
32
-
33
- Just because a KEP is merged does not mean it is complete or approved. Any KEP
34
- marked as `provisional` is a working document and subject to change. You can
35
- denote sections that are under active debate as follows:
36
-
37
- ```
38
- <<[UNRESOLVED optional short context or usernames ]>>
39
- Stuff that is being argued.
40
- <<[/UNRESOLVED]>>
41
- ```
42
-
43
- When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
44
- focused. If you disagree with what is already in a document, open a new PR
45
- with suggested changes.
46
-
47
- One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
48
- You do not need a new KEP to move from beta to GA, for example. If
49
- new details emerge that belong in the KEP, edit the KEP. Once a feature has become
50
- "implemented", major changes should get new KEPs.
51
-
52
- The canonical place for the latest set of instructions (and the likely source
53
- of this file) is [here](/keps/NNNN-kep-template/README.md).
54
-
55
- **Note:** Any PRs to move a KEP to `implementable`, or significant changes once
56
- it is marked `implementable`, must be approved by each of the KEP approvers.
57
- If none of those approvers are still appropriate, then changes to that list
58
- should be approved by the remaining approvers and/or the owning SIG (or
59
- SIG Architecture for cross-cutting KEPs).
60
- -->
61
1
# KEP-2307: Job tracking without lingering Pods
62
2
63
3
<!-- toc -->
@@ -75,6 +15,7 @@ SIG Architecture for cross-cutting KEPs).
75
15
- [ Design Details] ( #design-details )
76
16
- [ API changes] ( #api-changes )
77
17
- [ Algorithm] ( #algorithm )
18
+ - [ Simplified algorithm for Indexed Jobs] ( #simplified-algorithm-for-indexed-jobs )
78
19
- [ Deleted Pods] ( #deleted-pods )
79
20
- [ Deleted Jobs] ( #deleted-jobs )
80
21
- [ Pod adoption] ( #pod-adoption )
@@ -238,11 +179,17 @@ could be stopped at any point and executed again from the first step without
238
179
losing information. Generally, all the steps happen in a single Job sync
239
180
cycle.
240
181
182
+ 0 . The Job controller adds a the ` batch.kubernetes.io/job-completion ` finalizer
183
+ to the Job.
241
184
1 . The Job controller calculates the number of succeeded Pods as the sum of:
242
185
- ` .status.succeeded ` ,
243
186
- the size of ` job.status.uncountedTerminatedPods.succeeded ` and
244
187
- the number of finished Pods that are not in ` job.status.uncountedTerminatedPods.succeeded `
245
188
and have a finalizer.
189
+
190
+ The Job controller calculates the number of failed Pods similarly, and the
191
+ number of active Pods as Pods that don't have a Failed or Succeeded condition
192
+ and have a finalizer.
246
193
247
194
This number informs the creation of missing Pods to reach ` .spec.completions ` .
248
195
The controller creates Pods for a Job with the finalizer
@@ -262,6 +209,9 @@ cycle.
262
209
The counts increment the ` .status.failed ` and ` .status.succeeded ` and clears
263
210
counted Pods from ` .status.uncountedTerminatedPods ` lists. The controller
264
211
sends a status update.
212
+ 5 . The Job controller removes the ` batch.kubernetes.io/job-completion ` finalizer
213
+ from the Job if it has completed (succeeded or failed) and no Job Pod's have
214
+ finalizers.
265
215
266
216
Steps 2 to 4 might deal with a potentially big number of Pods. Thus, status
267
217
updates can potentially stress the kube-apiserver. For this reason, the Job
@@ -280,20 +230,41 @@ Steps 2 to 4 might be skipped in the scenario where a status update happened
280
230
too recently and the number of uncounted Pods is a small percentage of
281
231
parallelism.
282
232
233
+ Note that the ` .status.uncountedTerminatedPods ` struct allows to uniquely
234
+ identify finished Pods to avoid over counting.
235
+
236
+ #### Simplified algorithm for Indexed Jobs
237
+
238
+ Pods in Indexed Jobs have a unique identifier: the completion index. Even if
239
+ more than one Pod gets created for the same index, only one of them counts
240
+ towards completions. The completed indexes are available in
241
+ ` .status.completedIndexes ` in a compressed format.
242
+
243
+ When tracking Indexed Jobs, the Job controller can use
244
+ ` .status.completedIndexes ` in place of
245
+ ` .status.uncountedTerminatedPods.succeeded ` in step 2 and completely skip step 4
246
+ if there are no failed terminated pods in the same sync cycle. This saves one
247
+ API call for a Job status update.
248
+
283
249
### Deleted Pods
284
250
285
251
In the case where a user or another controller removes a Pod, which sets a
286
252
deletion timestamp, the Job controller treats it the same as any other Pod.
287
- That is, once it reaches Failed status, the controller accounts for the Pod and
288
- then removes the finalizer.
289
-
253
+ Since deleted Pods with finalizers get inevitably marked as Failed, the
254
+ Job controller already counts them as such and removes their finalizers.
290
255
This is different from the legacy tracking, where the Job controller does not
291
256
account for deleted Pods. This is a limitation that this KEP also wants to
292
257
solve.
293
258
294
- However, if the Job controller deletes the Pod (when parallelism is decreased,
295
- for example), the controller removes the finalizer before deleting it. Thus,
296
- these deletions don't count towards the failures.
259
+ One edge case is when there is a Node failure. If the Node is down long enough,
260
+ its Pods become orphan, and the garbage collector deletes them. Some of these
261
+ deleted Pods could not have finished, but the algorithm described above treats
262
+ them as failed.
263
+
264
+ On the other hand, if the Job controller deletes the Pod (when the user
265
+ decreases parallelism or suspends the Job, for example), the controller removes
266
+ the finalizer before deleting it. Thus, these deletions don't count towards the
267
+ failures.
297
268
298
269
### Deleted Jobs
299
270
@@ -332,11 +303,11 @@ the owner reference.
332
303
- Implementation:
333
304
- Job tracking without lingering Pods
334
305
- Removal of finalizer when feature gate is disabled.
306
+ - Support for [ Indexed Jobs] ( https://git.k8s.io/enhancements/keps/sig-apps/2214-indexed-job )
335
307
- Tests: unit, integration, E2E
336
308
337
309
#### Alpha -> Beta Graduation
338
310
339
- - Support for [ Indexed Jobs] ( https://git.k8s.io/enhancements/keps/sig-apps/2214-indexed-job )
340
311
- Processing 5000 Pods per minute across any number of Jobs, with Pod creation
341
312
having higher priority than status updates. This might depend on
342
313
[ Priority and Fairness] ( https://git.k8s.io/enhancements/keps/sig-api-machinery/1040-priority-and-fairness ) .
@@ -353,7 +324,7 @@ the owner reference.
353
324
354
325
### Upgrade / Downgrade Strategy
355
326
356
- When the feature ` JobTrackingWithoutLingeringPods ` is enabled for the first
327
+ When the feature ` JobTrackingWithFinalizers ` is enabled for the first
357
328
time, the cluster can have Jobs whose Pods don't have the
358
329
` batch.kubernetes.io/job-completion ` finalizer. It would be hard to add the
359
330
finalizer to all Pods while preventing race conditions.
@@ -363,9 +334,8 @@ was created after the feature was enabled. If this field is nil, the Job
363
334
controller tracks Pods using the legacy tracking.
364
335
365
336
The kube-apiserver sets ` .status.uncountedTerminatedPods ` to an empty struct
366
- when the feature gate ` JobTrackingWithoutLingeringPods ` is enabled, at Job
367
- creation. In alpha, apiserver leaves ` .status.uncountedTerminatedPods = nil `
368
- for [ Indexed Jobs] ( https://git.k8s.io/enhancements/keps/sig-apps/2214-indexed-job )
337
+ when the feature gate ` JobTrackingWithFinalizers ` is enabled, at Job
338
+ creation.
369
339
370
340
When the feature is disabled after being enabled for some time, the next time
371
341
the Job controller syncs a Job:
@@ -384,7 +354,7 @@ _This section must be completed when targeting alpha to a release._
384
354
385
355
* ** How can this feature be enabled / disabled in a live cluster?**
386
356
- [x] Feature gate (also fill in values in ` kep.yaml ` )
387
- - Feature gate name: JobTrackingWithoutLingeringPods
357
+ - Feature gate name: JobTrackingWithFinalizers
388
358
- Components depending on the feature gate:
389
359
- kube-apiserver
390
360
- kube-controller-manager
@@ -506,6 +476,9 @@ previous answers based on experience in the field._
506
476
- estimated throughput: one per Pod created by the Job controller, when Pod
507
477
finishes or is removed.
508
478
- originating component: kube-controller-manager
479
+ - PATCH Jobs, to add and remove finalizers.
480
+ - estimated throughput: two calls for each Job created.
481
+ - originating component: kube-controller-manager
509
482
- PUT Job status, to keep track of uncounted Pods.
510
483
- estimated throughput: at least one per Job sync. The job controller
511
484
throttles additional calls at 1 per a few seconds (precise throughput TBD
@@ -526,6 +499,8 @@ the existing API objects?**
526
499
527
500
- Pod
528
501
- Estimated increase: new finalizer of 33 bytes.
502
+ - Job
503
+ - Estimated increase: new finalizer of 33 bytes.
529
504
- Job status
530
505
- Estimated increase: new array temporarily containing terminated Pod UIDs.
531
506
The job controller caps the size of the array to less than 20kb.
0 commit comments