Skip to content

Commit 5130290

Browse files
committed
update with sftim comments
1 parent 1fab6ea commit 5130290

File tree

1 file changed

+15
-8
lines changed

1 file changed

+15
-8
lines changed

content/en/blog/_posts/2023-08-21-job-update-post.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,15 @@ at the same time. In API terms, a pod is considered terminating when it has a
2525
The scenario when two Pods are running at a given time is problematic for
2626
some popular machine learning frameworks, such as
2727
TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one Pod running at the same time,
28-
for a given index (see more details in the [issue](https://github.com/kubernetes/kubernetes/issues/115844)).
28+
for a given index.
29+
Tensorflow gives the following error if two pods are running for a given index.
30+
31+
```
32+
/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
33+
```
34+
35+
See more details in the ([issue](https://github.com/kubernetes/kubernetes/issues/115844)).
36+
2937

3038
Creating the replacement Pod before the previous one fully terminates can also
3139
cause problems in clusters with scarce resources or with tight budgets, such as:
@@ -61,14 +69,11 @@ Additionally, you can inspect the `.status.terminating` field of a Job. The valu
6169
of the field is the number of Pods owned by the Job that are currently terminating.
6270

6371
```shell
64-
kubectl get jobs/myjob -o yaml
72+
kubectl get jobs/myjob -o=jsonpath='{.items[*].status.terminating}'
6573
```
6674

67-
```yaml
68-
apiVersion: batch/v1
69-
kind: Job
70-
status:
71-
terminating: 3 # three Pods are terminating and have not yet reached the Failed phase
75+
```
76+
3 # three Pods are terminating and have not yet reached the Failed phase
7277
```
7378

7479
This can be particularly useful for external queueing controllers, such as
@@ -130,7 +135,9 @@ spec:
130135
spec:
131136
restartPolicy: Never
132137
containers:
133-
- name: example
138+
- name: example # this example container returns an error, and fails,
139+
# when it is run as the second or third index in any Job
140+
# (even after a retry)
134141
image: python
135142
command:
136143
- python3

0 commit comments

Comments
 (0)