Skip to content

Commit 525634f

Browse files
authored
Merge pull request kubernetes#3769 from mimowo/pod-failure-policy-testgrid
Testgrid links to e2e tests for "KEP-3329: Retriable and non-retriable Pod failures for Jobs"
2 parents c65b026 + 17d8ed0 commit 525634f

File tree

1 file changed

+30
-16
lines changed
  • keps/sig-apps/3329-retriable-and-non-retriable-failures

1 file changed

+30
-16
lines changed

keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md

Lines changed: 30 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ thousands of nodes requires usage of pod restart policies in order
230230
to account for infrastructure failures.
231231

232232
Currently, kubernetes Job API offers a way to account for infrastructure
233-
failures by setting `.backoffLimit > 0`. However, this mechanism intructs the
233+
failures by setting `.backoffLimit > 0`. However, this mechanism instructs the
234234
job controller to restart all failed pods - regardless of the root cause
235235
of the failures. Thus, in some scenarios this leads to unnecessary
236236
restarts of many pods, resulting in a waste of time and computational
@@ -354,7 +354,7 @@ As a machine learning researcher, I run jobs comprising thousands
354354
of long-running pods on a cluster comprising thousands of nodes. The jobs often
355355
run at night or over weekend without any human monitoring. In order to account
356356
for random infrastructure failures we define `.backoffLimit: 6` for the job.
357-
However, a signifficant portion of the failures happen due to bugs in code.
357+
However, a significant portion of the failures happen due to bugs in code.
358358
Moreover, the failures may happen late during the program execution time. In
359359
such case, restarting such a pod results in wasting a lot of computational time.
360360

@@ -729,7 +729,7 @@ in different messages for pods.
729729
- Reproduction: Run kube-controller-manager with disabled taint-manager (with the
730730
flag `--enable-taint-manager=false`). Then, run a job with a long-running pod and
731731
disconnect the node
732-
- Comments: handled by node lifcycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
732+
- Comments: handled by node lifecycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
733733
However, the pod phase remains `Running`.
734734
- Pod status:
735735
- status: Unknown
@@ -762,7 +762,7 @@ In Alpha, there is no support for Pod conditions for failures or disruptions ini
762762

763763
For Beta we introduce handling of Pod failures initiated by Kubelet by adding
764764
the pod disruption condition (introduced in Alpha) in case of disruptions
765-
initiated by kubetlet (see [Design details](#design-details)).
765+
initiated by Kubelet (see [Design details](#design-details)).
766766

767767
Kubelet can also evict a pod in some scenarios which are not covered with
768768
adding a pod failure condition:
@@ -863,7 +863,7 @@ dies) between appending a pod condition and deleting the pod.
863863
In particular, scheduler can possibly decide to preempt
864864
a different pod the next time (or none). This would leave a pod with a
865865
condition that it was preempted, when it actually wasn't. This in turn
866-
could lead to inproper handling of the pod by the job controller.
866+
could lead to improper handling of the pod by the job controller.
867867

868868
As a solution we implement a worker, part of the disruption
869869
controller, which clears the pod condition added if `DeletionTimestamp` is
@@ -1218,7 +1218,7 @@ the pod failure does not match any of the specified rules, then default
12181218
handling of failed pods applies.
12191219

12201220
If we limit this feature to use `onExitCodes` only when `restartPolicy=Never`
1221-
(see: [limitting this feature](#limitting-this-feature)), then the rules using
1221+
(see: [limiting this feature](#limitting-this-feature)), then the rules using
12221222
`onExitCodes` are evaluated only against the exit codes in the `state` field
12231223
(under `terminated.exitCode`) of `pod.status.containerStatuses` and
12241224
`pod.status.initContainerStatuses`. We may also need to check for the exit codes
@@ -1279,9 +1279,9 @@ the following scenarios will be covered with unit tests:
12791279
- handling of a pod failure, in accordance with the specified `spec.podFailurePolicy`,
12801280
when the failure is associated with
12811281
- a failed container with non-zero exit code,
1282-
- a dedicated Pod condition indicating termmination originated by a kubernetes component
1282+
- a dedicated Pod condition indicating termination originated by a kubernetes component
12831283
- adding of the `DisruptionTarget` by Kubelet in case of:
1284-
- eviciton due to graceful node shutdown
1284+
- eviction due to graceful node shutdown
12851285
- eviction due to node pressure
12861286
<!--
12871287
Additionally, for Alpha try to enumerate the core package you will be touching
@@ -1313,7 +1313,7 @@ The following scenarios will be covered with integration tests:
13131313
- pod failure is caused by a failed container with a non-zero exit code
13141314

13151315
More integration tests might be added to ensure good code coverage based on the
1316-
actual implemention.
1316+
actual implementation.
13171317

13181318
<!--
13191319
This question should be filled when targeting a release.
@@ -1338,8 +1338,22 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
13381338
- <test>: <link to test coverage
13391339

13401340
-->
1341-
The following scenario will be covered with e2e tests:
1342-
- early job termination when a container fails with a non-retriable exit code
1341+
The following scenario are covered with e2e tests:
1342+
- [sig-apps#gce](https://testgrid.k8s.io/sig-apps#gce):
1343+
- Job Using a pod failure policy to not count some failures towards the backoffLimit Ignore DisruptionTarget condition
1344+
- Job Using a pod failure policy to not count some failures towards the backoffLimit Ignore exit code 137
1345+
- Job should allow to use the pod failure policy on exit code to fail the job early
1346+
- Job should allow to use the pod failure policy to not count the failure towards the backoffLimit
1347+
- [sig-scheduling#gce-serial](https://testgrid.k8s.io/sig-scheduling#gce-serial):
1348+
- SchedulerPreemption [Serial] validates pod disruption condition is added to the preempted pod
1349+
- [sig-release-master-informing#gce-cos-master-serial](https://testgrid.k8s.io/sig-release-master-informing#gce-cos-master-serial):
1350+
- NoExecuteTaintManager Single Pod [Serial] pods evicted from tainted nodes have pod disruption condition
1351+
1352+
The following scenarios are covered with node e2e tests
1353+
([sig-node-presubmits#pr-kubelet-gce-e2e-pod-disruption-conditions](https://testgrid.k8s.io/sig-node-presubmits#pr-kubelet-gce-e2e-pod-disruption-conditions) and
1354+
[sig-node-presubmits#pr-node-kubelet-serial-containerd](https://testgrid.k8s.io/sig-node-presubmits#pr-node-kubelet-serial-containerd)):
1355+
- GracefulNodeShutdown [Serial] [NodeFeature:GracefulNodeShutdown] [NodeFeature:GracefulNodeShutdownBasedOnPodPriority] graceful node shutdown when PodDisruptionConditions are enabled [NodeFeature:PodDisruptionConditions] should add the DisruptionTarget pod failure condition to the evicted pods
1356+
- PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive][NodeFeature:Eviction] when we run containers that should cause PIDPressure; PodDisruptionConditions enabled [NodeFeature:PodDisruptionConditions] should eventually evict all of the correct pods
13431357

13441358
More e2e test scenarios might be considered during implementation if practical.
13451359

@@ -1439,7 +1453,7 @@ N/A
14391453
An upgrade to a version which supports this feature should not require any
14401454
additional configuration changes. In order to use this feature after an upgrade
14411455
users will need to configure their Jobs by specifying `spec.podFailurePolicy`. The
1442-
only noticeable difference in behaviour, without specifying `spec.podFailurePolicy`,
1456+
only noticeable difference in behavior, without specifying `spec.podFailurePolicy`,
14431457
is that Pods terminated by kubernetes components will have an additional
14441458
condition appended to `status.conditions`.
14451459

@@ -1654,7 +1668,7 @@ Manual test performed to simulate the upgrade->downgrade->upgrade scenario:
16541668
- Scenario 2:
16551669
- Create a job with a long running containers and `backoffLimit=0`.
16561670
- Verify that the job continues after the node in uncordoned
1657-
1. Disable the feature gates. Verify that the above scenarios result in default behaviour:
1671+
1. Disable the feature gates. Verify that the above scenarios result in default behavior:
16581672
- In scenario 1: the job restarts pods failed with exit code `42`
16591673
- In scenario 2: the job is failed due to exceeding the `backoffLimit` as the failed pod failed during the draining
16601674
1. Re-enable the feature gates
@@ -1953,7 +1967,7 @@ technics apply):
19531967
is an increase of the Job controller processing time.
19541968
- Inspect the Job controller's `job_pods_finished_total` metric for the
19551969
to check if the numbers of pod failures handled by specific actions (counted
1956-
by the `failure_policy_action` label) agree with the expetations.
1970+
by the `failure_policy_action` label) agree with the expectations.
19571971
For example, if a user configures job failure policy with `Ignore` action for
19581972
the `DisruptionTarget` condition, then a node drain is expected to increase
19591973
the metric for `failure_policy_action=Ignore`.
@@ -1963,7 +1977,7 @@ technics apply):
19631977

19641978
- 2022-06-23: Initial KEP merged
19651979
- 2022-07-12: Preparatory PR "Refactor gc_controller to do not use the deletePod stub" merged
1966-
- 2022-07-14: Preparatory PR "efactor taint_manager to do not use getPod and getNode stubs" merged
1980+
- 2022-07-14: Preparatory PR "Refactor taint_manager to do not use getPod and getNode stubs" merged
19671981
- 2022-07-20: Preparatory PR "Add integration test for podgc" merged
19681982
- 2022-07-28: KEP updates merged
19691983
- 2022-08-01: Additional KEP updates merged
@@ -1972,7 +1986,7 @@ technics apply):
19721986
- 2022-08-04: PR "Support handling of pod failures with respect to the configured rules" merged
19731987
- 2022-09-09: Bugfix PR for test "Fix the TestRoundTripTypes by adding default to the fuzzer" merged
19741988
- 2022-09-26: Prepared PR for KEP Beta update. Summary of the changes:
1975-
- propsal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
1989+
- proposal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
19761990
- DisruptionTarget for evictions due graceful node shutdown, admission errors, node pressure or Pod admission errors
19771991
- ResourceExhausted for evictions due to OOM killer and exceeding Pod's ephemeral-storage limits
19781992
- extended the review of pod eviction scenarios by kubelet-initiated pod evictions:

0 commit comments

Comments
 (0)