Skip to content

Commit 8fdd98a

Browse files
authored
Graduate Job Pod Failure Policy to stable (kubernetes#4661)
* Graduate Job Pod Failure Policy to stable * Review remarks
1 parent c7c9de8 commit 8fdd98a

File tree

3 files changed

+31
-10
lines changed

3 files changed

+31
-10
lines changed

keps/prod-readiness/sig-apps/3329.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ alpha:
33
approver: "@johnbelamaric"
44
beta:
55
approver: "@johnbelamaric"
6+
stable:
7+
approver: "@johnbelamaric"

keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -802,6 +802,10 @@ in terms of retriability and evolving Pod condition types
802802
to do not add any pod condition in this case. It should be re-considered in the
803803
future if there is a good motivating use-case.
804804

805+
The reported issue which could be addressed by the new condition for exceeding
806+
the active deadline timeout:
807+
[Pod Failure Policy Edge Case: Job Retries When Pod Finishes Successfully](https://github.com/kubernetes/kubernetes/issues/115688).
808+
805809
##### Admission failures
806810

807811
In some scenarios a pod admission failure could result in a successful pod restart on another
@@ -1628,19 +1632,31 @@ The core packages (with their unit test coverage) which are going to be modified
16281632
- `k8s.io/kubernetes/pkg/controller/job`: `13 June 2022` - `88%` <!--(handling of failed pods with regards to the configured podFailurePolicy)-->
16291633
- `k8s.io/kubernetes/pkg/apis/batch/validation`: `13 June 2022` - `94.4%` <!--(validation of the job configuration with regards to the podFailurePolicy)-->
16301634
- `k8s.io/kubernetes/pkg/apis/batch/v1`: `13 June 2022` - `83.6%` <!--(extension of JobSpec)-->
1635+
- `k8s.io/kubernetes/pkg/controller/podgc`: `4 June 2024` - `81.0%` <!--(pod deletion by PodGC)-->
1636+
- `k8s.io/kubernetes/pkg/controller/tainteviction`: `4 June 2024` - `81.8%` <!--(pod eviction by taints)-->
1637+
- `k8s.io/kubernetes/pkg/registry/core/pod/storage`: `4 June 2024` - `78.8%` <!--(pod eviction by API)-->
1638+
- `k8s.io/kubernetes/pkg/controller/disruption`: `4 June 2024` - `79.3%` <!--(cleanup of stale DisruptionTarget conditions)-->
1639+
- `k8s.io/kubernetes/pkg/scheduler/framework/preemption`: `4 June 2024` - `30.1%` <!--(pod preemption by kube-scheduler)-->
16311640

16321641
The kubelet packages (with their unit test coverage) which are going to be modified during implementation:
16331642
- `k8s.io/kubernetes/pkg/kubelet/nodeshutdown`: `13 Sep 2022` - `74.9%` <!--(handling of nodeshutdown)-->
16341643
- `k8s.io/kubernetes/pkg/kubelet/eviction`: `13 Sep 2022` - `67.7%` <!--(handling of node-pressure eviction)-->
1644+
- `k8s.io/kubernetes/pkg/kubelet/preemption`: `4 June 2024` - `73.7%` <!--(handling of preemption for a critical pod)-->
16351645

16361646
##### Integration tests
16371647

16381648
The following scenarios will be covered with integration tests:
1639-
- enabling, disabling and re-enabling of the feature gate
1649+
- enabling, disabling and re-enabling of the feature gate [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/job/job_test.go#L257)
16401650
- pod failure is triggered by a delete API request along with appending a
16411651
Pod condition indicating termination originated by a kubernetes component
16421652
(we aim to cover all such scenarios)
1643-
- pod failure is caused by a failed container with a non-zero exit code
1653+
* PreemptionByKubeScheduler [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/scheduler/preemption/preemption_test.go#L212-L237)
1654+
* DeletionByTaintManager [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/node/lifecycle_test.go#L48)
1655+
* EvictionByEvictionAPI [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/evictions/evictions_test.go#L347)
1656+
* DeletionByPodGC [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/podgc/podgc_test.go#L41) and [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/podgc/podgc_test.go#L171)
1657+
1658+
- pod failure is caused by a failed container with a non-zero exit code [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/job/job_test.go#L357-L372)
1659+
- cleanup of a stale DisruptionTarget condition [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/disruption/disruption_test.go#L638)
16441660

16451661
More integration tests might be added to ensure good code coverage based on the
16461662
actual implementation.
@@ -1684,6 +1700,7 @@ The following scenarios are covered with node e2e tests
16841700
[sig-node-presubmits#pr-node-kubelet-serial-containerd](https://testgrid.k8s.io/sig-node-presubmits#pr-node-kubelet-serial-containerd)):
16851701
- GracefulNodeShutdown [Serial] [NodeFeature:GracefulNodeShutdown] [NodeFeature:GracefulNodeShutdownBasedOnPodPriority] graceful node shutdown when PodDisruptionConditions are enabled [NodeFeature:PodDisruptionConditions] should add the DisruptionTarget pod failure condition to the evicted pods
16861702
- PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive][NodeFeature:Eviction] when we run containers that should cause PIDPressure; PodDisruptionConditions enabled [NodeFeature:PodDisruptionConditions] should eventually evict all of the correct pods
1703+
- CriticalPod [Serial] [Disruptive] [NodeFeature:CriticalPod] when we need to admit a critical pod should add DisruptionTarget condition to the preempted pod [NodeFeature:PodDisruptionConditions]
16871704

16881705
More e2e test scenarios might be considered during implementation if practical.
16891706

@@ -1753,18 +1770,22 @@ Third iteration (1.28):
17531770
Also, backport this fix to 1.26 and 1.27 release branches, and update the user-facing documentation to reflect this change.
17541771
- Avoid creation of replacement Pods for terminating Pods until they reach
17551772
the terminal phase. Update user-facing documentation.
1756-
Might be considered for backport to 1.27.
1773+
It was back-ported to [1.27](https://github.com/kubernetes/kubernetes/pull/118219).
17571774

17581775
Fourth iteration (1.29):
17591776
- Fix the [Pod Garbage collector fails to clean up PODs from nodes that are not running anymore](https://github.com/kubernetes/kubernetes/issues/118261).
17601777
by withdrawing from SSA in the k8s controllers which were adding the `DisruptionTarget` condition.
17611778
We will reconsider returning to SSA if the issue is fixed, but we consider the
17621779
transition as a technical detail, not impacting the API, which can be done
17631780
independently of the KEP graduation cycles.
1781+
The fix was back-ported to [1.28](https://github.com/kubernetes/kubernetes/pull/121379), [1.27](https://github.com/kubernetes/kubernetes/pull/118219), and [1.26](https://github.com/kubernetes/kubernetes/pull/121381).
17641782

17651783
#### GA
17661784

17671785
- Address reviews and bug reports from Beta users
1786+
- Improved tests coverage:
1787+
* unit test for preemption by kube-scheduler, if feasible
1788+
* integration test for re-enabling of the feature gate
17681789
- Write a blog post about the feature
17691790
- Graduate e2e tests as conformance tests
17701791
- Lock the `PodDisruptionConditions` and `JobPodFailurePolicy` feature-gates
@@ -1784,10 +1805,8 @@ in back-to-back releases.
17841805

17851806
#### Deprecation
17861807

1787-
In GA+1 release:
1788-
- Modify the code to ignore the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
1789-
17901808
In GA+2 release:
1809+
- Modify the code to ignore the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
17911810
- Remove the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
17921811

17931812
### Upgrade / Downgrade Strategy

keps/sig-apps/3329-retriable-and-non-retriable-failures/kep.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ owning-sig: sig-apps
66
participating-sigs:
77
- sig-scheduling
88
- sig-node
9-
status: implementable
9+
status: implemented
1010
creation-date: 2022-06-07
1111
reviewers:
1212
- "@liggitt"
@@ -22,18 +22,18 @@ see-also:
2222
- "/keps/sig-apps/3939-allow-replacement-when-fully-terminated"
2323

2424
# The target maturity stage in the current dev cycle for this KEP.
25-
stage: beta
25+
stage: stable
2626

2727
# The most recent milestone for which work toward delivery of this KEP has been
2828
# done. This can be the current (upcoming) milestone, if it is being actively
2929
# worked on.
30-
latest-milestone: "v1.28"
30+
latest-milestone: "v1.31"
3131

3232
# The milestone at which this feature was, or is targeted to be, at each stage.
3333
milestone:
3434
alpha: "v1.25"
3535
beta: "v1.26"
36-
stable: "v1.30"
36+
stable: "v1.31"
3737

3838
# The following PRR answers are required at alpha release
3939
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)