Skip to content

Commit b6c64c3

Browse files
tenzen-ysoltyshatiratreemimoworytswd
authored
BlogPost: Job's Success Policy Goes GA (#49998)
* Add Job SuccessPolicy GA graduation blog Signed-off-by: Yuki Iwai <[email protected]> * Apply suggestions from code review Co-authored-by: Maciej Szulik <[email protected]> * Update content/en/blog/_posts/2025-04-23-jobs-successpolicy-goes-ga.md Co-authored-by: Maciej Szulik <[email protected]> * Describe what is leader-followers pattern Signed-off-by: Yuki Iwai <[email protected]> * Add mentioning for Indexd Job Signed-off-by: Yuki Iwai <[email protected]> * Apply suggestions from code review Co-authored-by: Maciej Szulik <[email protected]> Co-authored-by: Filip Křepinský <[email protected]> * Apply suggestions Signed-off-by: Yuki Iwai <[email protected]> * Describe leader-follower in advance Signed-off-by: Yuki Iwai <[email protected]> * Make representation more flexible Signed-off-by: Yuki Iwai <[email protected]> * Replace example succeededIndexes with 0 Signed-off-by: Yuki Iwai <[email protected]> * Add what is SuccessCriteriaMet condition Signed-off-by: Yuki Iwai <[email protected]> * Update content/en/blog/_posts/2025-04-23-jobs-successpolicy-goes-ga.md Co-authored-by: Michał Woźniak <[email protected]> * Replace workers with followers Signed-off-by: Yuki Iwai <[email protected]> * Apply suggestions from code review Co-authored-by: Filip Křepinský <[email protected]> * Fix complete criteria Signed-off-by: Yuki Iwai <[email protected]> * Mention the Complete condition when all terminating processes is finished Signed-off-by: Yuki Iwai <[email protected]> * Update content/en/blog/_posts/2025-04-23-jobs-successpolicy-goes-ga.md Co-authored-by: Ryota <[email protected]> * Apply suggestions from code review Co-authored-by: Maciej Szulik <[email protected]> Co-authored-by: Filip Křepinský <[email protected]> * Clarify SuccessCriteriaMet process Signed-off-by: Yuki Iwai <[email protected]> * Apply suggestions from code review Co-authored-by: Tim Bannister <[email protected]> * Remove 'we' Signed-off-by: Yuki Iwai <[email protected]> * Update content/en/blog/_posts/2025-04-23-jobs-successpolicy-goes-ga.md Co-authored-by: Tim Bannister <[email protected]> * Update content/en/blog/_posts/2025-04-23-jobs-successpolicy-goes-ga.md Co-authored-by: Tim Bannister <[email protected]> Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Co-authored-by: Maciej Szulik <[email protected]> Co-authored-by: Filip Křepinský <[email protected]> Co-authored-by: Michał Woźniak <[email protected]> Co-authored-by: Ryota <[email protected]> Co-authored-by: Tim Bannister <[email protected]>
1 parent a2a70dd commit b6c64c3

File tree

1 file changed

+84
-0
lines changed

1 file changed

+84
-0
lines changed
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes 1.33: Job's SuccessPolicy Goes GA"
4+
date: 2025-04-23
5+
draft: true
6+
slug: kubernetes-1-33-jobs-success-policy-goes-ga
7+
authors: >
8+
[Yuki Iwai](https://github.com/tenzen-y) (CyberAgent, Inc)
9+
---
10+
11+
On behalf of the Kubernetes project, I'm pleased to announce that Job _success policy_ has graduated to General Availability (GA) as part of the v1.33 release.
12+
13+
## About Job's Success Policy
14+
15+
In batch workloads, you might want to use leader-follower patterns like [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface),
16+
in which the leader controls the execution, including the followers' lifecycle.
17+
18+
In this case, you might want to mark it as succeeded
19+
even if some of the indexes failed. Unfortunately, a leader-follower Kubernetes Job that didn't use a success policy, in most cases, would have to require **all** Pods to finish successfully
20+
for that Job to reach an overall succeeded state.
21+
22+
For Kubernetes Jobs, the API allows you to specify the early exit criteria using the `.spec.successPolicy`
23+
field (you can only use the `.spec.successPolicy` field for an [indexed Job](/docs/concept/workloads/controllers/job/#completion-mode)).
24+
Which describes a set of rules either using a list of succeeded indexes for a job, or defining a minimal required size of succeeded indexes.
25+
26+
This newly stable field is especially valuable for scientific simulation, AI/ML and High-Performance Computing (HPC) batch workloads.
27+
Users in these areas often run numerous experiments and may only need a specific number to complete successfully, rather than requiring all of them to succeed.
28+
In this case, the leader index failure is the only relevant Job exit criteria, and the outcomes for individual follower Pods are handled
29+
only indirectly via the status of the leader index.
30+
Moreover, followers do not know when they can terminate themselves.
31+
32+
After Job meets any __Success Policy__, the Job is marked as succeeded, and all Pods are terminated including the running ones.
33+
34+
## How it works
35+
36+
The following excerpt from a Job manifest, using `.successPolicy.rules[0].succeededCount`, shows an example of
37+
using a custom success policy:
38+
39+
```yaml
40+
parallelism: 10
41+
completions: 10
42+
completionMode: Indexed
43+
successPolicy:
44+
rules:
45+
- succeededCount: 1
46+
```
47+
48+
Here, the Job is marked as succeeded when one index succeeded regardless of its number.
49+
Additionally, you can constrain index numbers against `succeededCount` in `.successPolicy.rules[0].succeededCount`
50+
as shown below:
51+
52+
```yaml
53+
parallelism: 10
54+
completions: 10
55+
completionMode: Indexed
56+
successPolicy:
57+
rules:
58+
- succeededIndexes: 0 # index of the leader Pod
59+
succeededCount: 1
60+
```
61+
62+
This example shows that the Job will be marked as succeeded once a Pod with a specific index (Pod index 0) has succeeded.
63+
64+
Once the Job either reaches one of the `successPolicy` rules, or achieves its `Complete` criteria based on `.spec.completions`,
65+
the Job controller within kube-controller-manager adds the `SuccessCriteriaMet` condition to the Job status.
66+
After that, the job-controller initiates cleanup and termination of Pods for Jobs with `SuccessCriteriaMet` condition.
67+
Eventually, Jobs obtain `Complete` condition when the job-controller finished cleanup and termination.
68+
69+
## Learn more
70+
71+
- Read the documentation for
72+
[success policy](/docs/concepts/workloads/controllers/job/#success-policy).
73+
- Read the KEP for the [Job success/completion policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3998-job-success-completion-policy)
74+
75+
## Get involved
76+
77+
This work was led by the Kubernetes
78+
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch)
79+
in close collaboration with the
80+
[SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) community.
81+
82+
If you are interested in working on new features in the space I recommend
83+
subscribing to our [Slack](https://kubernetes.slack.com/messages/wg-batch)
84+
channel and attending the regular community meetings.

0 commit comments

Comments
 (0)