Skip to content

Commit 8ce4cde

Browse files
committed
Remarks
Signed-off-by: Michal Wozniak <[email protected]>
1 parent 439c237 commit 8ce4cde

File tree

1 file changed

+24
-8
lines changed
  • keps/sig-apps/3329-retriable-and-non-retriable-failures

1 file changed

+24
-8
lines changed

keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -885,7 +885,9 @@ the pod is actually in the terminal phase (`Failed`), to ensure their state is
885885
not modified while Job controller matches them against the pod failure policy.
886886

887887
However, there are scenarios in which a pod gets stuck in a non-terminal phase,
888-
but is doomed to be failed, as it is terminating (has `deletionTimestamp` set).
888+
but is doomed to be failed, as it is terminating (has `deletionTimestamp` set, also
889+
known as the `DELETING` state, see:
890+
[The API Object Lifecycle](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/object-lifecycle.md)).
889891
In order to workaround this issue, Job controller, when pod failure policy is
890892
disabled, considers any terminating pod that is in a non-terminal phase as failed.
891893
Note that, it is important that when Job controller considers such pods as failed
@@ -974,7 +976,7 @@ spec:
974976
rules: []
975977
backoffLimit: 0
976978
```
977-
2. delete the pod with `k delete pods -l job-name=invalid-image`
979+
2. delete the pod with `kubectl delete pods -l job-name=invalid-image`
978980

979981
The relevant fields of the pod:
980982

@@ -1047,7 +1049,7 @@ spec:
10471049
rules: []
10481050
backoffLimit: 0
10491051
```
1050-
2. delete the pod with `k delete pods -l job-name=invalid-configmap-ref`
1052+
2. delete the pod with `kubectl delete pods -l job-name=invalid-configmap-ref`
10511053

10521054
The relevant fields of the pod:
10531055

@@ -1099,12 +1101,12 @@ spec:
10991101
- name: huge-image
11001102
image: sagemathinc/cocalc # this is around 20GB
11011103
command: ["bash"]
1102-
args: ["-c", 'echo "Hello world"']
1104+
args: ["-c", 'sleep 60 && echo "Hello world"']
11031105
podFailurePolicy:
11041106
rules: []
11051107
backoffLimit: 0
11061108
```
1107-
2. delete the pod with `k delete pods -l job-name=huge-image`
1109+
2. delete the pod with `kubectl delete pods -l job-name=huge-image`
11081110

11091111
The relevant fields of the pod:
11101112

@@ -1131,9 +1133,10 @@ The relevant fields of the pod:
11311133

11321134
Here, the pod is not stuck, however it transitions to `Running` and fails
11331135
soon after, making the interim transition to `Running` unnecessary. Also, there
1134-
is a race condition, in some situations the running pod may complete with the
1135-
`Succeeded` status before its containers are killed and in transitions in the
1136-
`Failed` phase. This is already problematic for the Job controller, which might
1136+
is a race condition, if the container succeeds before the graceful period for
1137+
pod termination (if not for the `sleep 60` in the example above) the running pod may complete with the
1138+
`Succeeded` status before its containers are killed (and it transitions in the
1139+
`Failed` phase). This is already problematic for the Job controller, which might
11371140
count the pod as failed, despite the pod eventually succeeding. With the proposed
11381141
change, in the scenario, the pod transitions directly from the `Pending` phase
11391142
to `Failed`.
@@ -2209,6 +2212,19 @@ This through this both in small and large cases, again with respect to the
22092212
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
22102213
-->
22112214

2215+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
2216+
2217+
No. This feature does not introduce any resource exhaustive operations.
2218+
2219+
<!--
2220+
Focus not just on happy cases, but primarily on more pathological cases
2221+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
2222+
If any of the resources can be exhausted, how this is mitigated with the existing limits
2223+
(e.g. pods per node) or new limits added by this KEP?
2224+
Are there any tests that were run/should be run to understand performance characteristics better
2225+
and validate the declared limits?
2226+
-->
2227+
22122228
### Troubleshooting
22132229

22142230
<!--

0 commit comments

Comments
 (0)