Skip to content

Commit cfc66c0

Browse files
author
Adhityaa Chandrasekar
committed
blog: introducing suspended jobs
Signed-off-by: Adhityaa Chandrasekar <[email protected]>
1 parent b41a02d commit cfc66c0

File tree

1 file changed

+110
-0
lines changed

1 file changed

+110
-0
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: "Introducing Suspended Jobs"
3+
date: 2021-04-12
4+
slug: introducing-suspended-jobs
5+
layout: blog
6+
---
7+
8+
**Author:** Adhityaa Chandrasekar (Google)
9+
10+
[Jobs](/docs/concepts/workloads/controllers/job/) are a crucial part of
11+
Kubernetes API. While other kinds of workloads such as [Deployments](/docs/concepts/workloads/controllers/deployment/),
12+
[ReplicaSets](/docs/concepts/workloads/controllers/replicaset/),
13+
[StatefulSets](/docs/concepts/workloads/controllers/statefulset/), and
14+
[DaemonSets](/docs/concepts/workloads/controllers/daemonset/)
15+
solve use-cases that require Pods to run forever, Jobs are useful when Pods need
16+
to run to completion. Commonly used in parallel batch processing, Jobs can be
17+
used in a variety of applications ranging from video rendering and database
18+
maintenance to sending bulk emails and scientific computing.
19+
20+
While the amount of parallelism and the conditions for Job completion are
21+
configurable, the Kubernetes API lacked the ability to suspend and resume Jobs.
22+
This is often desired when cluster resources are limited and a higher priority
23+
Job needs to execute in the place of another Job. Deleting the lower priority
24+
Job is a poor workaround as Pod completion history and other metrics associated
25+
with the Job will be lost.
26+
27+
With the recent Kubernetes 1.21 release, you will be able to suspend a Job by
28+
updating its spec. The feature is currently in **alpha** and requires you to
29+
enable the `SuspendJob` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
30+
on the [API server](/docs/reference/command-line-tools-reference/kube-apiserver/)
31+
and the [controller manager](/docs/reference/command-line-tools-reference/kube-controller-manager/)
32+
in order to use it.
33+
34+
## API changes
35+
36+
A new boolean field `suspend` is introduced in the Job spec API. Let's say I
37+
create the following Job:
38+
39+
```yaml
40+
apiVersion: batch/v1
41+
kind: Job
42+
metadata:
43+
name: my-job
44+
spec:
45+
suspend: true
46+
parallelism: 2
47+
completions: 10
48+
template:
49+
spec:
50+
containers:
51+
- name: my-container
52+
image: busybox
53+
command: ["sleep", "5"]
54+
restartPolicy: Never
55+
```
56+
57+
Jobs are not suspended by default, so I'm explicitly setting the `suspend` field
58+
to true in the above Job spec. In the above example, the Job controller will
59+
refrain from creating Pods until I'm ready to start the Job, which I can do by
60+
updating the field to false.
61+
62+
As another example, consider a Job that was created with the `suspend` field
63+
omitted. The Job controller will happily create Pods to work towards Job
64+
completion. However, before the Job completes, if I explicitly set the field to
65+
true with a Job update, the Job controller will terminate all active Pods that
66+
are running and will wait indefinitely for the flag to be flipped back to false.
67+
Pod termination is done by sending a SIGTERM signal to all active Pods; the
68+
[graceful termination period](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
69+
defined in the Pod spec will be honoured. Pods terminated this way will not be
70+
counted as failures by the Job controller.
71+
72+
It is important to understand that succeeded and failed Pods from the past will
73+
continue to exist after you suspend a Job. That is, that they will count towards
74+
Job completion once you resume it. You can verify this by looking at Job's
75+
status before and after suspension.
76+
77+
Read the [documentation](/docs/concepts/workloads/controllers/job#suspending-a-job)
78+
for a full overview of this new feature.
79+
80+
## Where is this useful?
81+
82+
Let's say I'm the operator of a large cluster. I have many users submitting Jobs
83+
to the cluster, but not all Jobs are created equal — some Jobs are more
84+
important than others. Cluster resources aren't infinite either, so all users
85+
must share resources. If all Jobs were created in the suspended state and placed
86+
in a pending queue, I can achieve priority-based Job scheduling by resuming Jobs
87+
in the right order.
88+
89+
As another motivational use-case, consider a cloud provider where compute
90+
resources are cheaper at night than in the morning. If I have a long-running Job
91+
that takes multiple days to complete, being able to suspend the Job in the
92+
morning and then resume it in the evening every day can reduce costs.
93+
94+
Since this field is a part of the Job spec, CronJobs automatically get this
95+
feature for free too.
96+
97+
## References and next steps
98+
99+
If you're interested in a deeper dive into the rationale behind this feature and
100+
the decisions we have taken, consider reading the [enhancement proposal](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2232-suspend-jobs).
101+
There's more detail on suspending and resuming jobs in the documentation for [Job](/docs/concepts/workloads/controllers/job#suspending-a-job).
102+
103+
As previously mentioned, this feature is currently in alpha and is available
104+
only if you explicitly opt-in through the `SuspendJob` feature gate. If this is
105+
a feature you're interested in, please consider testing suspended Jobs in your
106+
cluster and providing feedback. You can discuss this enhancement [on GitHub](https://github.com/kubernetes/enhancements/issues/2232).
107+
The SIG Apps community also [meets regularly](https://github.com/kubernetes/community/tree/master/sig-apps#meetings)
108+
and can be reached through [Slack or the mailing list](https://github.com/kubernetes/community/tree/master/sig-apps#contact).
109+
Barring any unexpected changes to the API, we intend to graduate the feature to
110+
beta in Kubernetes 1.22, so that the feature becomes available by default.

0 commit comments

Comments
 (0)