You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Exponential backoff delay with in-memory tracking](#exponential-backoff-delay-with-in-memory-tracking)
58
58
-[Alternative ways to support high number of completions](#alternative-ways-to-support-high-number-of-completions)
59
59
-[Keep failedIndexes field as a bitmap](#keep-failedindexes-field-as-a-bitmap)
60
60
-[Keep the list of failed indexes in a dedicated API object](#keep-the-list-of-failed-indexes-in-a-dedicated-api-object)
@@ -72,15 +72,15 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
72
72
-[x] (R) Design details are appropriately documented
73
73
-[x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
74
74
-[ ] e2e Tests for all Beta API Operations (endpoints)
75
-
-[] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
76
-
-[] (R) Minimum Two Week Window for GA e2e tests to prove flake free
75
+
-[x] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
76
+
-[x] (R) Minimum Two Week Window for GA e2e tests to prove flake free
77
77
-[x] (R) Graduation criteria is in place
78
78
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
79
79
-[x] (R) Production readiness review completed
80
80
-[x] (R) Production readiness review approved
81
81
-[x] "Implementation History" section is up-to-date for milestone
82
82
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
83
-
-[] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
83
+
-[x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The following scenarios will be covered with integration tests:
665
-
- enabling, disabling and re-enabling of the `JobBackoffLimitPerIndex` feature gate
666
-
- handling of the `.spec.backoffLimitPerIndex` when the `FailIndex` action is used,
667
-
- handling of the `.spec.backoffLimitPerIndex` when `.spec.maxFailedIndexes` isn't set,
668
-
- handling of the `.spec.backoffLimitPerIndex` when `.spec.maxFailedIndexes` is set,
669
-
- handling of the `.spec.backoffLimit` when `.spec.backoffLimitPerIndex` is set,
670
-
- handling of the expotential backoff delay per index when `.spec.backoffLimitPerIndex` is set.
665
+
- enabling, disabling and re-enabling of the `JobBackoffLimitPerIndex` feature gate ([code](https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1030))
666
+
- handling of the `.spec.backoffLimitPerIndex` when the `FailIndex` action is used ([code](https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1888)),
667
+
- handling of the `.spec.backoffLimitPerIndex` when `.spec.maxFailedIndexes` isn't set ([code](https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1688)),
668
+
- handling of the `.spec.backoffLimitPerIndex` when `.spec.maxFailedIndexes` is set ([code](https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1846)),
669
+
- handling of the `.spec.backoffLimit` when `.spec.backoffLimitPerIndex` is set ([code](https://github.com/kubernetes/kubernetes/blob/master/test/integration/job/job_test.go#L1744)),
670
+
- handling of the exponential backoff delay per index when `.spec.backoffLimitPerIndex` is set ([code](https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1120)).
671
+
672
+
The [k8s-triage] page for the [BackoffLimitPerIndex integration tests](https://storage.googleapis.com/k8s-triage/index.html?job=integration&test=BackoffLimitPerIndex).
671
673
672
674
More integration tests might be added to ensure good code coverage based on the
673
675
actual implementation.
@@ -686,9 +688,11 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
686
688
687
689
The following scenario is covered with e2e tests for Beta:
- Job should execute all indexes despite some failing when using backoffLimitPerIndex
690
-
- Job should terminate job execution when the number of failed indexes exceeds maxFailedIndexes
691
-
- Job should mark indexes as failed when the FailIndex action is matched in podFailurePolicy
691
+
- Job should execute all indexes despite some failing when using backoffLimitPerIndex ([code](https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/e2e/apps/job.go#L602))
692
+
- Job should terminate job execution when the number of failed indexes exceeds maxFailedIndexes ([code](https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/e2e/apps/job.go#L635))
693
+
- Job should mark indexes as failed when the FailIndex action is matched in podFailurePolicy ([code](https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/e2e/apps/job.go#L670))
694
+
695
+
The [k8s-triage] page for the [BackoffLimitPerIndex e2e tests](https://storage.googleapis.com/k8s-triage/index.html?job=e2e&test=should%20mark%20indexes%20as%20failed%20when%20the%20FailIndex%20action%20is%20matched%20in%20podFailurePolicy%7Cshould%20terminate%20job%20execution%20when%20the%20number%20of%20failed%20indexes%20exceeds%20maxFailedIndexes%7Cshould%20execute%20all%20indexes%20despite%20some%20failing%20when%20using%20backoffLimitPerIndex).
692
696
693
697
### Graduation Criteria
694
698
@@ -757,7 +761,7 @@ in back-to-back releases.
757
761
#### Alpha
758
762
759
763
- the feature implemented behind the `JobBackoffLimitPerIndex` feature flag
760
-
- change the logic of computing the expotential backoff delay (see [here](#expotential-backoff-delay-issue))
764
+
- change the logic of computing the exponential backoff delay (see [here](#exponential-backoff-delay-issue))
761
765
- user-facing documentation, including the warning for setting completions > 10^5
762
766
- The `JobBackoffLimitPerIndex` feature flag disabled by default
763
767
- Tests: unit and integration
@@ -781,7 +785,6 @@ in back-to-back releases.
781
785
to use `FailIndex`
782
786
- Graduate e2e tests as conformance tests
783
787
- Lock the `JobBackoffLimitPerIndex` feature gate
784
-
- Declare deprecation of the `JobBackoffLimitPerIndex` feature gate in documentation
785
788
786
789
### Upgrade / Downgrade Strategy
787
790
@@ -1390,6 +1393,15 @@ Major milestones might include:
1390
1393
- 2023-07-18: Merge the Job Controller PR [Support BackoffLimitPerIndex in Jobs](https://github.com/kubernetes/kubernetes/pull/118009)
1391
1394
- 2023-08-04: Merge user-facing docs PR [Docs update for Job's backoff limit per index (alpha in 1.28)](https://github.com/kubernetes/website/pull/41921)
1392
1395
- 2023-08-06: Merge KEP update reflecting decisions during the implementation phase [Update for KEP3850 "Backoff Limit Per Index"](https://github.com/kubernetes/enhancements/pull/4123)
1396
+
- 2023-10-02: [Update KEP-3850 "Backoff Limit Per Index" for Beta](https://github.com/kubernetes/enhancements/pull/4228)
1397
+
- 2023-10-20: [Introduce the job_finished_indexes_total metric](https://github.com/kubernetes/kubernetes/pull/121292)
1398
+
- 2023-10-23: [Graduate BackoffLimitPerIndex to Beta](https://github.com/kubernetes/kubernetes/pull/121356)
1399
+
- 2023-10-24: [Indicate Job Backoff Limit Per Index reason consts are beta](https://github.com/kubernetes/kubernetes/pull/121471)
1400
+
- 2023-10-25: [Backoff limit per index e2e test](https://github.com/kubernetes/kubernetes/pull/121368)
1401
+
- 2023-11-02: [Add remaining e2e tests for Job BackoffLimitPerIndex based on KEP](https://github.com/kubernetes/kubernetes/pull/121633)
1402
+
- 2023-11-02: [Benchmark job with backoff limit per index](https://github.com/kubernetes/kubernetes/pull/121393)
1403
+
- 2023-11-02: [Update KEP3850 "BackoffLimitPerIndex for Indexed Jobs"](https://github.com/kubernetes/enhancements/pull/4321)
1404
+
- 2025-02-07: [KEP3850: graduate Backoff Limit Per Index for Job to stable](https://github.com/kubernetes/enhancements/pull/5154)
1393
1405
1394
1406
## Drawbacks
1395
1407
@@ -1556,9 +1568,9 @@ not need to be as detailed as the proposal, but should include enough
1556
1568
information to express the idea and why it was not acceptable.
1557
1569
-->
1558
1570
1559
-
### Global expotential backoff delay
1571
+
### Global exponential backoff delay
1560
1572
1561
-
We could also consider leaving the expotential backoff delay as global and
1573
+
We could also consider leaving the exponential backoff delay as global and
1562
1574
be enabled by a dedicated API field in the future KEP, say `backoffDelayPerIndex`.
1563
1575
1564
1576
**Reasons for deferring / rejecting**
@@ -1568,9 +1580,9 @@ Thus, failures or successes in one index should not influence backoff delays
1568
1580
for another index. We are leaving the decision to the community feeback and
1569
1581
discussions though.
1570
1582
1571
-
### Expotential backoff delay with in-memory tracking
1583
+
### Exponential backoff delay with in-memory tracking
1572
1584
1573
-
Instead of modifying the definition of pod's finish time (see [Expotential backoff delay issue](#expotential-backoff-delay-issue))
1585
+
Instead of modifying the definition of pod's finish time (see [Exponential backoff delay issue](#exponential-backoff-delay-issue))
1574
1586
we could keep track of the "failure time" for failed pods in-memory.
0 commit comments