|
| 1 | +--- |
| 2 | +title: "Introducing Suspended Jobs" |
| 3 | +date: 2021-04-12 |
| 4 | +slug: introducing-suspended-jobs |
| 5 | +layout: blog |
| 6 | +--- |
| 7 | + |
| 8 | +**Author:** Adhityaa Chandrasekar (Google) |
| 9 | + |
| 10 | +[Jobs](/docs/concepts/workloads/controllers/job/) are a crucial part of |
| 11 | +Kubernetes API. While other kinds of workloads such as [Deployments](/docs/concepts/workloads/controllers/deployment/), |
| 12 | +[ReplicaSets](/docs/concepts/workloads/controllers/replicaset/), |
| 13 | +[StatefulSets](/docs/concepts/workloads/controllers/statefulset/), and |
| 14 | +[DaemonSets](/docs/concepts/workloads/controllers/daemonset/) |
| 15 | +solve use-cases that require Pods to run forever, Jobs are useful when Pods need |
| 16 | +to run to completion. Commonly used in parallel batch processing, Jobs can be |
| 17 | +used in a variety of applications ranging from video rendering and database |
| 18 | +maintenance to sending bulk emails and scientific computing. |
| 19 | + |
| 20 | +While the amount of parallelism and the conditions for Job completion are |
| 21 | +configurable, the Kubernetes API lacked the ability to suspend and resume Jobs. |
| 22 | +This is often desired when cluster resources are limited and a higher priority |
| 23 | +Job needs to execute in the place of another Job. Deleting the lower priority |
| 24 | +Job is a poor workaround as Pod completion history and other metrics associated |
| 25 | +with the Job will be lost. |
| 26 | + |
| 27 | +With the recent Kubernetes 1.21 release, you will be able to suspend a Job by |
| 28 | +updating its spec. The feature is currently in **alpha** and requires you to |
| 29 | +enable the `SuspendJob` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) |
| 30 | +on the [API server](/docs/reference/command-line-tools-reference/kube-apiserver/) |
| 31 | +and the [controller manager](/docs/reference/command-line-tools-reference/kube-controller-manager/) |
| 32 | +in order to use it. |
| 33 | + |
| 34 | +## API changes |
| 35 | + |
| 36 | +A new boolean field `suspend` is introduced in the Job spec API. Let's say I |
| 37 | +create the following Job: |
| 38 | + |
| 39 | +```yaml |
| 40 | +apiVersion: batch/v1 |
| 41 | +kind: Job |
| 42 | +metadata: |
| 43 | + name: my-job |
| 44 | +spec: |
| 45 | + suspend: true |
| 46 | + parallelism: 2 |
| 47 | + completions: 10 |
| 48 | + template: |
| 49 | + spec: |
| 50 | + containers: |
| 51 | + - name: my-container |
| 52 | + image: busybox |
| 53 | + command: ["sleep", "5"] |
| 54 | + restartPolicy: Never |
| 55 | +``` |
| 56 | +
|
| 57 | +Jobs are not suspended by default, so I'm explicitly setting the `suspend` field |
| 58 | +to true in the above Job spec. In the above example, the Job controller will |
| 59 | +refrain from creating Pods until I'm ready to start the Job, which I can do by |
| 60 | +updating the field to false. |
| 61 | + |
| 62 | +As another example, consider a Job that was created with the `suspend` field |
| 63 | +omitted. The Job controller will happily create Pods to work towards Job |
| 64 | +completion. However, before the Job completes, if I explicitly set the field to |
| 65 | +true with a Job update, the Job controller will terminate all active Pods that |
| 66 | +are running and will wait indefinitely for the flag to be flipped back to false. |
| 67 | +Pod termination is done by sending a SIGTERM signal to all active Pods; the |
| 68 | +[graceful termination period](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination) |
| 69 | +defined in the Pod spec will be honoured. Pods terminated this way will not be |
| 70 | +counted as failures by the Job controller. |
| 71 | + |
| 72 | +It is important to understand that succeeded and failed Pods from the past will |
| 73 | +continue to exist after you suspend a Job. That is, that they will count towards |
| 74 | +Job completion once you resume it. You can verify this by looking at Job's |
| 75 | +status before and after suspension. |
| 76 | + |
| 77 | +Read the [documentation](/docs/concepts/workloads/controllers/job#suspending-a-job) |
| 78 | +for a full overview of this new feature. |
| 79 | + |
| 80 | +## Where is this useful? |
| 81 | + |
| 82 | +Let's say I'm the operator of a large cluster. I have many users submitting Jobs |
| 83 | +to the cluster, but not all Jobs are created equal — some Jobs are more |
| 84 | +important than others. Cluster resources aren't infinite either, so all users |
| 85 | +must share resources. If all Jobs were created in the suspended state and placed |
| 86 | +in a pending queue, I can achieve priority-based Job scheduling by resuming Jobs |
| 87 | +in the right order. |
| 88 | + |
| 89 | +As another motivational use-case, consider a cloud provider where compute |
| 90 | +resources are cheaper at night than in the morning. If I have a long-running Job |
| 91 | +that takes multiple days to complete, being able to suspend the Job in the |
| 92 | +morning and then resume it in the evening every day can reduce costs. |
| 93 | + |
| 94 | +Since this field is a part of the Job spec, CronJobs automatically get this |
| 95 | +feature for free too. |
| 96 | + |
| 97 | +## References and next steps |
| 98 | + |
| 99 | +If you're interested in a deeper dive into the rationale behind this feature and |
| 100 | +the decisions we have taken, consider reading the [enhancement proposal](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2232-suspend-jobs). |
| 101 | +There's more detail on suspending and resuming jobs in the documentation for [Job](/docs/concepts/workloads/controllers/job#suspending-a-job). |
| 102 | + |
| 103 | +As previously mentioned, this feature is currently in alpha and is available |
| 104 | +only if you explicitly opt-in through the `SuspendJob` feature gate. If this is |
| 105 | +a feature you're interested in, please consider testing suspended Jobs in your |
| 106 | +cluster and providing feedback. You can discuss this enhancement [on GitHub](https://github.com/kubernetes/enhancements/issues/2232). |
| 107 | +The SIG Apps community also [meets regularly](https://github.com/kubernetes/community/tree/master/sig-apps#meetings) |
| 108 | +and can be reached through [Slack or the mailing list](https://github.com/kubernetes/community/tree/master/sig-apps#contact). |
| 109 | +Barring any unexpected changes to the API, we intend to graduate the feature to |
| 110 | +beta in Kubernetes 1.22, so that the feature becomes available by default. |
0 commit comments