You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -77,7 +78,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
77
78
-[x] (R) Production readiness review completed
78
79
-[x] (R) Production readiness review approved
79
80
-[x] "Implementation History" section is up-to-date for milestone
80
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
81
+
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
81
82
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
82
83
83
84
[kubernetes.io]: https://kubernetes.io/
@@ -723,14 +724,12 @@ in back-to-back releases.
723
724
#### Beta
724
725
725
726
- Address reviews and bug reports from Alpha users
726
-
-Propose and implement metrics
727
+
-Implement the `job_finished_indexes_total` metric
727
728
- E2e tests are in Testgrid and linked in KEP
729
+
- Move the [new reason declarations](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/pkg/controller/job/job_controller.go#L82-L89) from Job controller to the API package
728
730
- Evaluate performance of Job controller for jobs using backoff limit per index
729
731
with benchmarks at the integration or e2e level (discussion pointers from Alpha
730
732
review: [thread1](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406) and [thread2](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076))
731
-
- Reevaluate ideas of not using `.status.uncountedTerminatedPods` for keeping track
732
-
in the `.status.Failed` field. The idea is to prevent `backoffLimit` for setting.
@@ -758,6 +757,9 @@ A downgrade to a version which does not support this feature should not require
758
757
any additional configuration changes. Jobs which specified
759
758
`.spec.backoffLimitPerIndex` (to make use of this feature) will be
760
759
handled in a default way, ie. using the `.spec.backoffLimit`.
760
+
However, since the `.spec.backoffLimit` defaults to max int32 value
761
+
(see [here](#job-api)) is might require a manual setting of the `.spec.backoffLimit`
762
+
to ensure failed pods are not retried indefinitely.
761
763
762
764
<!--
763
765
If applicable, how will the component be upgraded and downgraded? Make sure
@@ -878,7 +880,8 @@ The Job controller starts to handle pod failures according to the specified
878
880
879
881
###### Are there any tests for feature enablement/disablement?
880
882
881
-
No. The tests will be added in Alpha.
883
+
Yes, there is an [integration test](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/test/integration/job/job_test.go#L763)
884
+
which tests the following path: enablement -> disablement -> re-enablement.
882
885
883
886
<!--
884
887
The e2e framework does not currently support enabling or disabling feature
@@ -901,7 +904,16 @@ This section must be completed when targeting beta to a release.
901
904
902
905
###### How can a rollout or rollback fail? Can it impact already running workloads?
903
906
904
-
The change is opt-in, it doesn't impact already running workloads.
907
+
This change does not impact how the rollout or rollback fail.
908
+
909
+
The change is opt-in, thus a rollout doesn't impact already running pods.
910
+
911
+
The rollback might affect how pod failures are handled, since they will
912
+
be counted only against `.spec.backoffLimit`, which is defaulted to max int32
913
+
value, when using `.spec.backoffLimitPerIndex` (see [here](#job-api)).
914
+
Thus, similarly as in case of a downgrade (see [here](#downgrade))
915
+
it might be required to manually set `spec.backoffLimit` to ensure failed pods
916
+
are not retried indefinitely.
905
917
906
918
<!--
907
919
Try to be as paranoid as possible - e.g., what if some components will restart
@@ -934,7 +946,97 @@ that might indicate a serious problem?
934
946
935
947
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
936
948
937
-
It will be tested manually prior to beta launch.
949
+
The Upgrade->downgrade->upgrade testing was done manually using the `alpha`
950
+
version in 1.28 with the following steps:
951
+
952
+
1. Start the cluster with the `JobBackoffLimitPerIndex` enabled:
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
1277
+
1278
+
No. This feature does not introduce any resource exhaustive operations.
1279
+
1280
+
<!--
1281
+
Focus not just on happy cases, but primarily on more pathological cases
1282
+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
1283
+
If any of the resources can be exhausted, how this is mitigated with the existing limits
1284
+
(e.g. pods per node) or new limits added by this KEP?
1285
+
1286
+
Are there any tests that were run/should be run to understand performance characteristics better
1287
+
and validate the declared limits?
1288
+
-->
1289
+
1170
1290
### Troubleshooting
1171
1291
1172
1292
<!--
@@ -1182,8 +1302,12 @@ details). For now, we leave it here.
1182
1302
1183
1303
###### How does this feature react if the API server and/or etcd is unavailable?
1184
1304
1305
+
No change from existing behavior of the Job controller.
1306
+
1185
1307
###### What are other known failure modes?
1186
1308
1309
+
None.
1310
+
1187
1311
<!--
1188
1312
For each of them, fill in the following information by copying the below template:
1189
1313
- [Failure mode brief description]
@@ -1199,6 +1323,8 @@ For each of them, fill in the following information by copying the below templat
1199
1323
1200
1324
###### What steps should be taken if SLOs are not being met to determine the problem?
1201
1325
1326
+
N/A.
1327
+
1202
1328
## Implementation History
1203
1329
1204
1330
<!--
@@ -1219,6 +1345,8 @@ Major milestones might include:
1219
1345
- 2023-07-13: The implementation PR [Support BackoffLimitPerIndex in Jobs #118009](https://github.com/kubernetes/kubernetes/pull/118009) under review
1220
1346
- 2023-07-18: Merge the API PR [Extend the Job API for BackoffLimitPerIndex](https://github.com/kubernetes/kubernetes/pull/119294)
1221
1347
- 2023-07-18: Merge the Job Controller PR [Support BackoffLimitPerIndex in Jobs](https://github.com/kubernetes/kubernetes/pull/118009)
1348
+
- 2023-08-04: Merge user-facing docs PR [Docs update for Job's backoff limit per index (alpha in 1.28)](https://github.com/kubernetes/website/pull/41921)
1349
+
- 2023-08-06: Merge KEP update reflecting decisions during the implementation phase [Update for KEP3850 "Backoff Limit Per Index"](https://github.com/kubernetes/enhancements/pull/4123)
1222
1350
1223
1351
## Drawbacks
1224
1352
@@ -1457,6 +1585,26 @@ when a user sets `maxFailedIndexes` as 10^6 the Job may complete if the indexes
1457
1585
and consecutive, but the Job may also fail if the size of the object exceeds the
1458
1586
limits due to non-consecutive indexes failing.
1459
1587
1588
+
### Skip uncountedTerminatedPods when backoffLimitPerIndex is used
1589
+
1590
+
It's been proposed (see [link](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263879848))
1591
+
that when backoffLimitPerIndex is used, then we could skip the interim step of
1592
+
recording terminated pods in `.status.uncountedTerminatedPods`.
1593
+
1594
+
**Reasons for deferring / rejecting**
1595
+
1596
+
First, if we stop using `.status.uncountedTerminatedPods` it means that
1597
+
`.status.failed`can no longer track the number of failed pods. Thus, it would
1598
+
require a change of semantic to denote just the number of failed indexes. This
1599
+
has downsides:
1600
+
- two different semantics of the field, depending on the used feature
1601
+
- lost information about some failed pods within an index (some users may care
1602
+
to investigate succeeded indexes with at least one failed pod)
1603
+
1604
+
Second, it would only optimize the unhappy path, where there are failures. Also,
1605
+
the saving is only 1 request per 500 failed pods, which does not seem essential.
0 commit comments