You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-apps/2307-job-tracking-without-lingering-pods/README.md
+20-19Lines changed: 20 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@
30
30
-[Feature Enablement and Rollback](#feature-enablement-and-rollback)
31
31
-[How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
32
32
-[Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior)
33
-
-[Can the feature be disabled once it has been enabled (i.e. can we roll back](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back)
33
+
-[Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement)
34
34
-[What happens if we reenable the feature if it was previously rolled back?**](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back)
35
35
-[Are there any tests for feature enablement/disablement?**](#are-there-any-tests-for-feature-enablementdisablement)
36
36
-[Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
@@ -331,16 +331,16 @@ the owner reference.
331
331
- Removal of finalizer when feature gate is disabled.
332
332
- Support for [Indexed Jobs](https://git.k8s.io/enhancements/keps/sig-apps/2214-indexed-job)
333
333
- Tests: unit, integration.
334
+
334
335
#### Alpha -> Beta Graduation
335
336
336
-
- Processing 5000 Pods per minute across any number of Jobs, with Pod creation
337
-
having higher priority than status updates, using [Priority and Fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control).
338
-
We rename the existing workload-high priority to workload-medium and, when
339
-
the feature gate JobTrackingWithFinalizers is enabled, we add a workload-high
340
-
priority that matches the job-controller ServiceAccount for Pod creation.
337
+
- Pod processing throughput per minute (mix of creating and counting finished Pods),
338
+
assuming an average Job .spec.parallelism=10.
339
+
- Up to 2500 Pods (~3000 queries) for a 50 QPS client limit for the job controller.
340
+
- Up to 5000 (~6000 queries) Pods for a 100 QPS client limit for the job controller.
341
341
- Ensure that tracking Jobs with big number of Pods doesn't cause starvation of
342
342
smaller jobs.
343
-
- Metrics for latencyand errors
343
+
- Metrics for latency, counting updates and errors.
344
344
- Job E2E tests are in Testgrid with the feature enabled and linked in KEP
345
345
346
346
#### Beta -> GA Graduation
@@ -416,8 +416,7 @@ No implications to node runtime.
416
416
- Pods removed by the user or other controllers count towards failures or
417
417
completions.
418
418
419
-
#### Can the feature be disabled once it has been enabled (i.e. can we roll back
420
-
the enablement)?
419
+
#### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
421
420
422
421
Yes.
423
422
The job controller removes finalizers in this case.
@@ -436,8 +435,6 @@ No implications to node runtime.
436
435
### Rollout, Upgrade and Rollback Planning
437
436
438
437
#### How can a rollout fail? Can it impact already running workloads?
439
-
Try to be as paranoid as possible - e.g., what if some components will restart
440
-
mid-rollout?
441
438
442
439
The change doesn't affect running Pods. If the component restarts
443
440
mid-rollout into an older version, the Job controller switches to tracking
@@ -453,8 +450,10 @@ No implications to node runtime.
453
450
#### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
454
451
455
452
Integration tests cover feature gate disablement.
456
-
A manual upgrade->downgrade->upgrade flow will be executed to ensure that a
457
-
running Job falls back to tracking without finalizers.
453
+
454
+
A manual upgrade->downgrade->upgrade flow will be executed prior to graduation
455
+
to ensure that a running Job falls back to tracking without finalizers. The
456
+
KEP will be updated with the findings of the test.
458
457
459
458
#### Is the rollout accompanied by any deprecations and/or removals of features, APIs,
460
459
fields of API types, flags, etc.?
@@ -465,10 +464,12 @@ fields of API types, flags, etc.?
465
464
466
465
#### How can an operator determine if the feature is in use by workloads?
467
466
468
-
There is no metric provided.
469
-
Administrators can check for the existence of Job objects with the annotation
470
-
`batch.kubernetes.io/job-completion` or Pods with the finalizer
471
-
`batch.kubernetes.io/job-completion`.
467
+
- The metric `job_pod_finished` (with a label result=failed/completed)
468
+
increments when the job controller removes a Pod out of
469
+
`.status.uncountedTerminatedPods` to increase the failed/completed counters.
470
+
- Administrators can check for the existence of Job objects with the annotation
471
+
`batch.kubernetes.io/job-completion` or Pods with the finalizer
472
+
`batch.kubernetes.io/job-completion`.
472
473
473
474
###### How can someone using this feature know that it is working for their instance?
474
475
@@ -482,8 +483,8 @@ fields of API types, flags, etc.?
482
483
483
484
#### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
484
485
485
-
- 99% percentile over day for Job syncs is <= 15s, assuming a client-side QPS
486
-
limit of 50 calls per second.
486
+
- 99% percentile over day for Job syncs is <= 15s for a client-side 50 QPS
487
+
limit.
487
488
488
489
#### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
0 commit comments