You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
+25-6Lines changed: 25 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -802,6 +802,10 @@ in terms of retriability and evolving Pod condition types
802
802
to do not add any pod condition in this case. It should be re-considered in the
803
803
future if there is a good motivating use-case.
804
804
805
+
The reported issue which could be addressed by the new condition for exceeding
806
+
the active deadline timeout:
807
+
[Pod Failure Policy Edge Case: Job Retries When Pod Finishes Successfully](https://github.com/kubernetes/kubernetes/issues/115688).
808
+
805
809
##### Admission failures
806
810
807
811
In some scenarios a pod admission failure could result in a successful pod restart on another
@@ -1628,19 +1632,31 @@ The core packages (with their unit test coverage) which are going to be modified
1628
1632
- `k8s.io/kubernetes/pkg/controller/job`: `13 June 2022`- `88%` <!--(handling of failed pods with regards to the configured podFailurePolicy)-->
1629
1633
- `k8s.io/kubernetes/pkg/apis/batch/validation`: `13 June 2022`- `94.4%` <!--(validation of the job configuration with regards to the podFailurePolicy)-->
1630
1634
- `k8s.io/kubernetes/pkg/apis/batch/v1`: `13 June 2022`- `83.6%` <!--(extension of JobSpec)-->
1635
+
- `k8s.io/kubernetes/pkg/controller/podgc`: `4 June 2024`- `81.0%` <!--(pod deletion by PodGC)-->
1636
+
- `k8s.io/kubernetes/pkg/controller/tainteviction`: `4 June 2024`- `81.8%` <!--(pod eviction by taints)-->
1637
+
- `k8s.io/kubernetes/pkg/registry/core/pod/storage`: `4 June 2024`- `78.8%` <!--(pod eviction by API)-->
1638
+
- `k8s.io/kubernetes/pkg/controller/disruption`: `4 June 2024`- `79.3%` <!--(cleanup of stale DisruptionTarget conditions)-->
1639
+
- `k8s.io/kubernetes/pkg/scheduler/framework/preemption`: `4 June 2024`- `30.1%` <!--(pod preemption by kube-scheduler)-->
1631
1640
1632
1641
The kubelet packages (with their unit test coverage) which are going to be modified during implementation:
1633
1642
- `k8s.io/kubernetes/pkg/kubelet/nodeshutdown`: `13 Sep 2022`- `74.9%` <!--(handling of nodeshutdown)-->
1634
1643
- `k8s.io/kubernetes/pkg/kubelet/eviction`: `13 Sep 2022`- `67.7%` <!--(handling of node-pressure eviction)-->
1644
+
- `k8s.io/kubernetes/pkg/kubelet/preemption`: `4 June 2024`- `73.7%` <!--(handling of preemption for a critical pod)-->
1635
1645
1636
1646
##### Integration tests
1637
1647
1638
1648
The following scenarios will be covered with integration tests:
1639
-
- enabling, disabling and re-enabling of the feature gate
1649
+
- enabling, disabling and re-enabling of the feature gate [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/job/job_test.go#L257)
1640
1650
- pod failure is triggered by a delete API request along with appending a
1641
1651
Pod condition indicating termination originated by a kubernetes component
1642
1652
(we aim to cover all such scenarios)
1643
-
- pod failure is caused by a failed container with a non-zero exit code
* DeletionByPodGC [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/podgc/podgc_test.go#L41) and [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/podgc/podgc_test.go#L171)
1657
+
1658
+
- pod failure is caused by a failed container with a non-zero exit code [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/job/job_test.go#L357-L372)
1659
+
- cleanup of a stale DisruptionTarget condition [link](https://github.com/kubernetes/kubernetes/blob/ff5b5f9b2c15c1bef2a7449295f0a6e8fa0bfb59/test/integration/disruption/disruption_test.go#L638)
1644
1660
1645
1661
More integration tests might be added to ensure good code coverage based on the
1646
1662
actual implementation.
@@ -1684,6 +1700,7 @@ The following scenarios are covered with node e2e tests
- GracefulNodeShutdown [Serial] [NodeFeature:GracefulNodeShutdown] [NodeFeature:GracefulNodeShutdownBasedOnPodPriority] graceful node shutdown when PodDisruptionConditions are enabled [NodeFeature:PodDisruptionConditions] should add the DisruptionTarget pod failure condition to the evicted pods
1686
1702
- PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive][NodeFeature:Eviction] when we run containers that should cause PIDPressure; PodDisruptionConditions enabled [NodeFeature:PodDisruptionConditions] should eventually evict all of the correct pods
1703
+
- CriticalPod [Serial] [Disruptive] [NodeFeature:CriticalPod] when we need to admit a critical pod should add DisruptionTarget condition to the preempted pod [NodeFeature:PodDisruptionConditions]
1687
1704
1688
1705
More e2e test scenarios might be considered during implementation if practical.
1689
1706
@@ -1753,18 +1770,22 @@ Third iteration (1.28):
1753
1770
Also, backport this fix to 1.26 and 1.27 release branches, and update the user-facing documentation to reflect this change.
1754
1771
- Avoid creation of replacement Pods for terminating Pods until they reach
1755
1772
the terminal phase. Update user-facing documentation.
1756
-
Might be considered for backport to 1.27.
1773
+
It was back-ported to [1.27](https://github.com/kubernetes/kubernetes/pull/118219).
1757
1774
1758
1775
Fourth iteration (1.29):
1759
1776
- Fix the [Pod Garbage collector fails to clean up PODs from nodes that are not running anymore](https://github.com/kubernetes/kubernetes/issues/118261).
1760
1777
by withdrawing from SSA in the k8s controllers which were adding the `DisruptionTarget` condition.
1761
1778
We will reconsider returning to SSA if the issue is fixed, but we consider the
1762
1779
transition as a technical detail, not impacting the API, which can be done
1763
1780
independently of the KEP graduation cycles.
1781
+
The fix was back-ported to [1.28](https://github.com/kubernetes/kubernetes/pull/121379), [1.27](https://github.com/kubernetes/kubernetes/pull/118219), and [1.26](https://github.com/kubernetes/kubernetes/pull/121381).
1764
1782
1765
1783
#### GA
1766
1784
1767
1785
- Address reviews and bug reports from Beta users
1786
+
- Improved tests coverage:
1787
+
* unit test for preemption by kube-scheduler, if feasible
1788
+
* integration test for re-enabling of the feature gate
1768
1789
- Write a blog post about the feature
1769
1790
- Graduate e2e tests as conformance tests
1770
1791
- Lock the `PodDisruptionConditions` and `JobPodFailurePolicy` feature-gates
@@ -1784,10 +1805,8 @@ in back-to-back releases.
1784
1805
1785
1806
#### Deprecation
1786
1807
1787
-
In GA+1 release:
1788
-
- Modify the code to ignore the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
1789
-
1790
1808
In GA+2 release:
1809
+
- Modify the code to ignore the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
1791
1810
- Remove the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
0 commit comments