Skip to content

Commit f3466c7

Browse files
Update KEP-3850 "Backoff Limit Per Index" for Beta (#4228)
* Update KEP for Beta for "Backoff Limit Per Index" * Remarks * Update keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md Co-authored-by: Aldo Culquicondor <[email protected]> * Simplify the upgrade->downgrade->upgrade section --------- Co-authored-by: Aldo Culquicondor <[email protected]>
1 parent 1e574a8 commit f3466c7

File tree

3 files changed

+160
-10
lines changed

3 files changed

+160
-10
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3850
22
alpha:
33
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

Lines changed: 157 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@
5858
- [Keep failedIndexes field as a bitmap](#keep-failedindexes-field-as-a-bitmap)
5959
- [Keep the list of failed indexes in a dedicated API object](#keep-the-list-of-failed-indexes-in-a-dedicated-api-object)
6060
- [Implicit limit on the number of failed indexes](#implicit-limit-on-the-number-of-failed-indexes)
61+
- [Skip uncountedTerminatedPods when backoffLimitPerIndex is used](#skip-uncountedterminatedpods-when-backofflimitperindex-is-used)
6162
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
6263
<!-- /toc -->
6364

@@ -77,7 +78,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
7778
- [x] (R) Production readiness review completed
7879
- [x] (R) Production readiness review approved
7980
- [x] "Implementation History" section is up-to-date for milestone
80-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
81+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
8182
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
8283

8384
[kubernetes.io]: https://kubernetes.io/
@@ -723,14 +724,12 @@ in back-to-back releases.
723724
#### Beta
724725

725726
- Address reviews and bug reports from Alpha users
726-
- Propose and implement metrics
727+
- Implement the `job_finished_indexes_total` metric
727728
- E2e tests are in Testgrid and linked in KEP
729+
- Move the [new reason declarations](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/pkg/controller/job/job_controller.go#L82-L89) from Job controller to the API package
728730
- Evaluate performance of Job controller for jobs using backoff limit per index
729731
with benchmarks at the integration or e2e level (discussion pointers from Alpha
730732
review: [thread1](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406) and [thread2](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076))
731-
- Reevaluate ideas of not using `.status.uncountedTerminatedPods` for keeping track
732-
in the `.status.Failed` field. The idea is to prevent `backoffLimit` for setting.
733-
Discussion [link](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263879848).
734733
- The feature flag enabled by default
735734

736735
#### GA
@@ -758,6 +757,9 @@ A downgrade to a version which does not support this feature should not require
758757
any additional configuration changes. Jobs which specified
759758
`.spec.backoffLimitPerIndex` (to make use of this feature) will be
760759
handled in a default way, ie. using the `.spec.backoffLimit`.
760+
However, since the `.spec.backoffLimit` defaults to max int32 value
761+
(see [here](#job-api)) is might require a manual setting of the `.spec.backoffLimit`
762+
to ensure failed pods are not retried indefinitely.
761763

762764
<!--
763765
If applicable, how will the component be upgraded and downgraded? Make sure
@@ -878,7 +880,8 @@ The Job controller starts to handle pod failures according to the specified
878880

879881
###### Are there any tests for feature enablement/disablement?
880882

881-
No. The tests will be added in Alpha.
883+
Yes, there is an [integration test](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/test/integration/job/job_test.go#L763)
884+
which tests the following path: enablement -> disablement -> re-enablement.
882885

883886
<!--
884887
The e2e framework does not currently support enabling or disabling feature
@@ -901,7 +904,16 @@ This section must be completed when targeting beta to a release.
901904

902905
###### How can a rollout or rollback fail? Can it impact already running workloads?
903906

904-
The change is opt-in, it doesn't impact already running workloads.
907+
This change does not impact how the rollout or rollback fail.
908+
909+
The change is opt-in, thus a rollout doesn't impact already running pods.
910+
911+
The rollback might affect how pod failures are handled, since they will
912+
be counted only against `.spec.backoffLimit`, which is defaulted to max int32
913+
value, when using `.spec.backoffLimitPerIndex` (see [here](#job-api)).
914+
Thus, similarly as in case of a downgrade (see [here](#downgrade))
915+
it might be required to manually set `spec.backoffLimit` to ensure failed pods
916+
are not retried indefinitely.
905917

906918
<!--
907919
Try to be as paranoid as possible - e.g., what if some components will restart
@@ -934,7 +946,97 @@ that might indicate a serious problem?
934946

935947
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
936948

937-
It will be tested manually prior to beta launch.
949+
The Upgrade->downgrade->upgrade testing was done manually using the `alpha`
950+
version in 1.28 with the following steps:
951+
952+
1. Start the cluster with the `JobBackoffLimitPerIndex` enabled:
953+
```sh
954+
kind create cluster --name per-index --image kindest/node:v1.28.0 --config config.yaml
955+
```
956+
using `config.yaml`:
957+
```yaml
958+
kind: Cluster
959+
apiVersion: kind.x-k8s.io/v1alpha4
960+
featureGates:
961+
"JobBackoffLimitPerIndex": true
962+
nodes:
963+
- role: control-plane
964+
- role: worker
965+
```
966+
967+
Then, create the job using `.spec.backoffLimitPerIndex=1`:
968+
969+
```sh
970+
kubectl create -f job.yaml
971+
```
972+
using `job.yaml`:
973+
```yaml
974+
apiVersion: batch/v1
975+
kind: Job
976+
metadata:
977+
name: job-longrun
978+
spec:
979+
parallelism: 3
980+
completions: 3
981+
completionMode: Indexed
982+
backoffLimitPerIndex: 1
983+
template:
984+
spec:
985+
restartPolicy: Never
986+
containers:
987+
- name: sleep
988+
image: busybox:1.36.1
989+
command: ["sleep"]
990+
args: ["1800"] # 30min
991+
imagePullPolicy: IfNotPresent
992+
```
993+
994+
Await for the pods to be running and delete 0-indexed pod:
995+
```sh
996+
kubectl delete pods -l job-name=job-longrun -l batch.kubernetes.io/job-completion-index=0 --grace-period=1
997+
```
998+
Await for the replacement pod to be created and repeat the deletion.
999+
1000+
Check job status and confirm `.status.failedIndexes="0"`
1001+
1002+
```sh
1003+
kubectl get jobs -ljob-name=job-longrun -oyaml
1004+
```
1005+
Also, notice that `.status.active=2`, because the pod for a failed index is not
1006+
re-created.
1007+
1008+
2. Simulate downgrade by disabling the feature for api server and control-plane.
1009+
1010+
Then, verify that 3 pods are running again, and the `.status.failedIndexes` is
1011+
gone by:
1012+
```sh
1013+
kubectl get jobs -ljob-name=job-longrun -oyaml
1014+
```
1015+
this will produce output similar to:
1016+
```yaml
1017+
...
1018+
status:
1019+
active: 3
1020+
failed: 2
1021+
ready: 2
1022+
```
1023+
1024+
3. Simulate upgrade by re-enabling the feature for api server and control-plane.
1025+
1026+
Then, delete 1-indexed pod:
1027+
```sh
1028+
kubectl delete pods -l job-name=job-longrun -l batch.kubernetes.io/job-completion-index=1 --grace-period=1
1029+
```
1030+
Await for the replacement pod to be created and repeat the deletion.
1031+
Check job status and confirm `.status.failedIndexes="1"`
1032+
1033+
```sh
1034+
kubectl get jobs -ljob-name=job-longrun -oyaml
1035+
```
1036+
Also, notice that `.status.active=2`, because the pod for a failed index is not
1037+
re-created.
1038+
1039+
This demonstrates that the feature is working again for the job.
9381040

9391041
<!--
9401042
Describe manual testing that was done and the outcomes.
@@ -992,6 +1094,8 @@ Recall that end users cannot usually observe component logs or access metrics.
9921094

9931095
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
9941096

1097+
This feature does not propose SLOs.
1098+
9951099
<!--
9961100
This is your opportunity to define what "normal" quality of service looks like
9971101
for a feature.
@@ -1017,11 +1121,13 @@ Pick one more of these and delete the rest.
10171121
- Metric name:
10181122
- `job_sync_duration_seconds` (existing): can be used to see how much the
10191123
feature enablement increases the time spent in the sync job
1124+
- `job_finished_indexes_total` (new): can be used to determine if the indexes
1125+
are marked failed,
10201126
- Components exposing the metric: kube-controller-manager
10211127

10221128
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10231129

1024-
For Beta we will consider introduction of a new metric `job_finished_indexes_total`
1130+
For Beta we will introduce a new metric `job_finished_indexes_total`
10251131
with labels `status=(failed|succeeded)`, and `backoffLimit=(perIndex|global)`.
10261132
It will count the number of failed and succeeded indexes across jobs using
10271133
`backoffLimitPerIndex`, or regular Indexed Jobs (using only `.spec.backoffLimit`).
@@ -1167,6 +1273,20 @@ This through this both in small and large cases, again with respect to the
11671273
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
11681274
-->
11691275

1276+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
1277+
1278+
No. This feature does not introduce any resource exhaustive operations.
1279+
1280+
<!--
1281+
Focus not just on happy cases, but primarily on more pathological cases
1282+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
1283+
If any of the resources can be exhausted, how this is mitigated with the existing limits
1284+
(e.g. pods per node) or new limits added by this KEP?
1285+
1286+
Are there any tests that were run/should be run to understand performance characteristics better
1287+
and validate the declared limits?
1288+
-->
1289+
11701290
### Troubleshooting
11711291

11721292
<!--
@@ -1182,8 +1302,12 @@ details). For now, we leave it here.
11821302

11831303
###### How does this feature react if the API server and/or etcd is unavailable?
11841304

1305+
No change from existing behavior of the Job controller.
1306+
11851307
###### What are other known failure modes?
11861308

1309+
None.
1310+
11871311
<!--
11881312
For each of them, fill in the following information by copying the below template:
11891313
- [Failure mode brief description]
@@ -1199,6 +1323,8 @@ For each of them, fill in the following information by copying the below templat
11991323

12001324
###### What steps should be taken if SLOs are not being met to determine the problem?
12011325

1326+
N/A.
1327+
12021328
## Implementation History
12031329

12041330
<!--
@@ -1219,6 +1345,8 @@ Major milestones might include:
12191345
- 2023-07-13: The implementation PR [Support BackoffLimitPerIndex in Jobs #118009](https://github.com/kubernetes/kubernetes/pull/118009) under review
12201346
- 2023-07-18: Merge the API PR [Extend the Job API for BackoffLimitPerIndex](https://github.com/kubernetes/kubernetes/pull/119294)
12211347
- 2023-07-18: Merge the Job Controller PR [Support BackoffLimitPerIndex in Jobs](https://github.com/kubernetes/kubernetes/pull/118009)
1348+
- 2023-08-04: Merge user-facing docs PR [Docs update for Job's backoff limit per index (alpha in 1.28)](https://github.com/kubernetes/website/pull/41921)
1349+
- 2023-08-06: Merge KEP update reflecting decisions during the implementation phase [Update for KEP3850 "Backoff Limit Per Index"](https://github.com/kubernetes/enhancements/pull/4123)
12221350

12231351
## Drawbacks
12241352

@@ -1457,6 +1585,26 @@ when a user sets `maxFailedIndexes` as 10^6 the Job may complete if the indexes
14571585
and consecutive, but the Job may also fail if the size of the object exceeds the
14581586
limits due to non-consecutive indexes failing.
14591587

1588+
### Skip uncountedTerminatedPods when backoffLimitPerIndex is used
1589+
1590+
It's been proposed (see [link](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263879848))
1591+
that when backoffLimitPerIndex is used, then we could skip the interim step of
1592+
recording terminated pods in `.status.uncountedTerminatedPods`.
1593+
1594+
**Reasons for deferring / rejecting**
1595+
1596+
First, if we stop using `.status.uncountedTerminatedPods` it means that
1597+
`.status.failed` can no longer track the number of failed pods. Thus, it would
1598+
require a change of semantic to denote just the number of failed indexes. This
1599+
has downsides:
1600+
- two different semantics of the field, depending on the used feature
1601+
- lost information about some failed pods within an index (some users may care
1602+
to investigate succeeded indexes with at least one failed pod)
1603+
1604+
Second, it would only optimize the unhappy path, where there are failures. Also,
1605+
the saving is only 1 request per 500 failed pods, which does not seem essential.
1606+
1607+
14601608
## Infrastructure Needed (Optional)
14611609

14621610
<!--

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ stage: alpha
1919
# The most recent milestone for which work toward delivery of this KEP has been
2020
# done. This can be the current (upcoming) milestone, if it is being actively
2121
# worked on.
22-
latest-milestone: "v1.28"
22+
latest-milestone: "v1.29"
2323

2424
# The milestone at which this feature was, or is targeted to be, at each stage.
2525
milestone:

0 commit comments

Comments
 (0)