@@ -230,7 +230,7 @@ thousands of nodes requires usage of pod restart policies in order
230
230
to account for infrastructure failures.
231
231
232
232
Currently, kubernetes Job API offers a way to account for infrastructure
233
- failures by setting ` .backoffLimit > 0 ` . However, this mechanism intructs the
233
+ failures by setting ` .backoffLimit > 0 ` . However, this mechanism instructs the
234
234
job controller to restart all failed pods - regardless of the root cause
235
235
of the failures. Thus, in some scenarios this leads to unnecessary
236
236
restarts of many pods, resulting in a waste of time and computational
@@ -354,7 +354,7 @@ As a machine learning researcher, I run jobs comprising thousands
354
354
of long-running pods on a cluster comprising thousands of nodes. The jobs often
355
355
run at night or over weekend without any human monitoring. In order to account
356
356
for random infrastructure failures we define ` .backoffLimit: 6 ` for the job.
357
- However, a signifficant portion of the failures happen due to bugs in code.
357
+ However, a significant portion of the failures happen due to bugs in code.
358
358
Moreover, the failures may happen late during the program execution time. In
359
359
such case, restarting such a pod results in wasting a lot of computational time.
360
360
@@ -729,7 +729,7 @@ in different messages for pods.
729
729
- Reproduction : Run kube-controller-manager with disabled taint-manager (with the
730
730
flag `--enable-taint-manager=false`). Then, run a job with a long-running pod and
731
731
disconnect the node
732
- - Comments : handled by node lifcycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
732
+ - Comments : handled by node lifecycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
733
733
However, the pod phase remains `Running`.
734
734
- Pod status :
735
735
- status : Unknown
@@ -762,7 +762,7 @@ In Alpha, there is no support for Pod conditions for failures or disruptions ini
762
762
763
763
For Beta we introduce handling of Pod failures initiated by Kubelet by adding
764
764
the pod disruption condition (introduced in Alpha) in case of disruptions
765
- initiated by kubetlet (see [Design details](#design-details)).
765
+ initiated by Kubelet (see [Design details](#design-details)).
766
766
767
767
Kubelet can also evict a pod in some scenarios which are not covered with
768
768
adding a pod failure condition :
@@ -863,7 +863,7 @@ dies) between appending a pod condition and deleting the pod.
863
863
In particular, scheduler can possibly decide to preempt
864
864
a different pod the next time (or none). This would leave a pod with a
865
865
condition that it was preempted, when it actually wasn't. This in turn
866
- could lead to inproper handling of the pod by the job controller.
866
+ could lead to improper handling of the pod by the job controller.
867
867
868
868
As a solution we implement a worker, part of the disruption
869
869
controller, which clears the pod condition added if `DeletionTimestamp` is
@@ -1218,7 +1218,7 @@ the pod failure does not match any of the specified rules, then default
1218
1218
handling of failed pods applies.
1219
1219
1220
1220
If we limit this feature to use `onExitCodes` only when `restartPolicy=Never`
1221
- (see : [limitting this feature](#limitting-this-feature)), then the rules using
1221
+ (see : [limiting this feature](#limitting-this-feature)), then the rules using
1222
1222
` onExitCodes` are evaluated only against the exit codes in the `state` field
1223
1223
(under `terminated.exitCode`) of `pod.status.containerStatuses` and
1224
1224
` pod.status.initContainerStatuses` . We may also need to check for the exit codes
@@ -1279,9 +1279,9 @@ the following scenarios will be covered with unit tests:
1279
1279
- handling of a pod failure, in accordance with the specified `spec.podFailurePolicy`,
1280
1280
when the failure is associated with
1281
1281
- a failed container with non-zero exit code,
1282
- - a dedicated Pod condition indicating termmination originated by a kubernetes component
1282
+ - a dedicated Pod condition indicating termination originated by a kubernetes component
1283
1283
- adding of the `DisruptionTarget` by Kubelet in case of :
1284
- - eviciton due to graceful node shutdown
1284
+ - eviction due to graceful node shutdown
1285
1285
- eviction due to node pressure
1286
1286
<!--
1287
1287
Additionally, for Alpha try to enumerate the core package you will be touching
@@ -1313,7 +1313,7 @@ The following scenarios will be covered with integration tests:
1313
1313
- pod failure is caused by a failed container with a non-zero exit code
1314
1314
1315
1315
More integration tests might be added to ensure good code coverage based on the
1316
- actual implemention .
1316
+ actual implementation .
1317
1317
1318
1318
<!--
1319
1319
This question should be filled when targeting a release.
@@ -1453,7 +1453,7 @@ N/A
1453
1453
An upgrade to a version which supports this feature should not require any
1454
1454
additional configuration changes. In order to use this feature after an upgrade
1455
1455
users will need to configure their Jobs by specifying `spec.podFailurePolicy`. The
1456
- only noticeable difference in behaviour , without specifying `spec.podFailurePolicy`,
1456
+ only noticeable difference in behavior , without specifying `spec.podFailurePolicy`,
1457
1457
is that Pods terminated by kubernetes components will have an additional
1458
1458
condition appended to `status.conditions`.
1459
1459
@@ -1668,7 +1668,7 @@ Manual test performed to simulate the upgrade->downgrade->upgrade scenario:
1668
1668
- Scenario 2 :
1669
1669
- Create a job with a long running containers and `backoffLimit=0`.
1670
1670
- Verify that the job continues after the node in uncordoned
1671
- 1. Disable the feature gates. Verify that the above scenarios result in default behaviour :
1671
+ 1. Disable the feature gates. Verify that the above scenarios result in default behavior :
1672
1672
- In scenario 1 : the job restarts pods failed with exit code `42`
1673
1673
- In scenario 2 : the job is failed due to exceeding the `backoffLimit` as the failed pod failed during the draining
1674
1674
1. Re-enable the feature gates
@@ -1967,7 +1967,7 @@ technics apply):
1967
1967
is an increase of the Job controller processing time.
1968
1968
- Inspect the Job controller's `job_pods_finished_total` metric for the
1969
1969
to check if the numbers of pod failures handled by specific actions (counted
1970
- by the `failure_policy_action` label) agree with the expetations .
1970
+ by the `failure_policy_action` label) agree with the expectations .
1971
1971
For example, if a user configures job failure policy with `Ignore` action for
1972
1972
the `DisruptionTarget` condition, then a node drain is expected to increase
1973
1973
the metric for `failure_policy_action=Ignore`.
@@ -1977,7 +1977,7 @@ technics apply):
1977
1977
1978
1978
- 2022-06-23 : Initial KEP merged
1979
1979
- 2022-07-12 : Preparatory PR "Refactor gc_controller to do not use the deletePod stub" merged
1980
- - 2022-07-14 : Preparatory PR "efactor taint_manager to do not use getPod and getNode stubs" merged
1980
+ - 2022-07-14 : Preparatory PR "Refactor taint_manager to do not use getPod and getNode stubs" merged
1981
1981
- 2022-07-20 : Preparatory PR "Add integration test for podgc" merged
1982
1982
- 2022-07-28 : KEP updates merged
1983
1983
- 2022-08-01 : Additional KEP updates merged
@@ -1986,7 +1986,7 @@ technics apply):
1986
1986
- 2022-08-04 : PR "Support handling of pod failures with respect to the configured rules" merged
1987
1987
- 2022-09-09 : Bugfix PR for test "Fix the TestRoundTripTypes by adding default to the fuzzer" merged
1988
1988
- 2022-09-26 : Prepared PR for KEP Beta update. Summary of the changes:
1989
- - propsal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
1989
+ - proposal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
1990
1990
- DisruptionTarget for evictions due graceful node shutdown, admission errors, node pressure or Pod admission errors
1991
1991
- ResourceExhausted for evictions due to OOM killer and exceeding Pod's ephemeral-storage limits
1992
1992
- extended the review of pod eviction scenarios by kubelet-initiated pod evictions :
0 commit comments