@@ -25,7 +25,15 @@ at the same time. In API terms, a pod is considered terminating when it has a
25
25
The scenario when two Pods are running at a given time is problematic for
26
26
some popular machine learning frameworks, such as
27
27
TensorFlow and [ JAX] ( https://jax.readthedocs.io/en/latest/ ) , which require at most one Pod running at the same time,
28
- for a given index (see more details in the [ issue] ( https://github.com/kubernetes/kubernetes/issues/115844 ) ).
28
+ for a given index.
29
+ Tensorflow gives the following error if two pods are running for a given index.
30
+
31
+ ```
32
+ /job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
33
+ ```
34
+
35
+ See more details in the ([ issue] ( https://github.com/kubernetes/kubernetes/issues/115844 ) ).
36
+
29
37
30
38
Creating the replacement Pod before the previous one fully terminates can also
31
39
cause problems in clusters with scarce resources or with tight budgets, such as:
@@ -61,14 +69,11 @@ Additionally, you can inspect the `.status.terminating` field of a Job. The valu
61
69
of the field is the number of Pods owned by the Job that are currently terminating.
62
70
63
71
``` shell
64
- kubectl get jobs/myjob -o yaml
72
+ kubectl get jobs/myjob -o=jsonpath= ' {.items[*].status.terminating} '
65
73
```
66
74
67
- ``` yaml
68
- apiVersion : batch/v1
69
- kind : Job
70
- status :
71
- terminating : 3 # three Pods are terminating and have not yet reached the Failed phase
75
+ ```
76
+ 3 # three Pods are terminating and have not yet reached the Failed phase
72
77
```
73
78
74
79
This can be particularly useful for external queueing controllers, such as
@@ -130,7 +135,9 @@ spec:
130
135
spec :
131
136
restartPolicy : Never
132
137
containers :
133
- - name: example
138
+ - name : example # this example container returns an error, and fails,
139
+ # when it is run as the second or third index in any Job
140
+ # (even after a retry)
134
141
image : python
135
142
command :
136
143
- python3
0 commit comments