You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -169,7 +170,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
169
170
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
170
171
-[x] (R) Production readiness review completed
171
172
-[x] (R) Production readiness review approved
172
-
-[] "Implementation History" section is up-to-date for milestone
173
+
-[x] "Implementation History" section is up-to-date for milestone
173
174
-[ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
174
175
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
175
176
@@ -674,7 +675,8 @@ The Pod status (which includes the `conditions` field and the container exit
674
675
codes) could be lost if the failed pod is garbage collected.
675
676
676
677
Losing Pod's status before it is interpreted by Job Controller can be prevented
677
-
by using the feature of [job tracking with finalizers](https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/).
678
+
by using the feature of [job tracking with finalizers](https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/)
679
+
(see more about the design details section: [Interim FailureTarget condition](#interim-failuretarget-condition)).
678
680
679
681
#### Evolving condition types
680
682
@@ -739,13 +741,33 @@ pod delete requests are issued to modify the code to also append a meaningful
739
741
condition with dedicated `Type`, `Reason` and `Message` fields based on the
740
742
invocation context.
741
743
744
+
### Interim FailureTarget condition
745
+
746
+
There is a risk of losing the Pod status information due to PodGC, which could
747
+
prevent Job Controller to react to a pod failure with respect to the configured
748
+
pod failure policy rules (see also: [Garbage collected pods](#garbage-collected-pods)).
749
+
750
+
In order to make sure all pods are checked against the rules we require the
751
+
feature of [job tracking with finalizers](https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/)
752
+
to be enabled.
753
+
754
+
Additionally, before we actually remove the finalizers from the pods
755
+
(allowing them to be deleted by PodGC) we record the determined job failure
756
+
message (if any rule with `JobFail` matched) in an interim job condition, called
757
+
`FailureTarget`. Once the pod finalizers are removed we update the job status
758
+
with the final `Failed` job condition. This strategy eliminates a possible
759
+
race condition that we could lose the information about the job failure if
760
+
Job Controller crashed between removing the pod finalizers are updating the final
761
+
`Failed`condition in the job status.
762
+
742
763
### JobSpec API
743
764
744
765
We extend the Job API in order to allow to apply different actions depending
745
766
on the conditions associated with the pod failure.
746
767
747
768
```golang
748
769
// PodFailurePolicyAction specifies how a Pod failure is handled.
0 commit comments