|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: 'Kubernetes 1.21: CronJob Reaches GA' |
| 4 | +date: 2021-04-09 |
| 5 | +slug: kubernetes-release-1.21-cronjob-ga |
| 6 | +--- |
| 7 | + |
| 8 | + **Authors:** Alay Patel (Red Hat), and Maciej Szulik (Red Hat) |
| 9 | + |
| 10 | +In Kubernetes v1.21, the |
| 11 | +[CronJob](/docs/concepts/workloads/controllers/cron-jobs/) resource |
| 12 | +reached general availability (GA). We've also substantially improved the |
| 13 | +performance of CronJobs since Kubernetes v1.19, by implementing a new |
| 14 | +controller. |
| 15 | + |
| 16 | +In Kubernetes v1.20 we launched a revised v2 controller for CronJobs, |
| 17 | +initially as an alpha feature. Kubernetes 1.21 uses the newer controller by |
| 18 | +default, and the CronJob resource itself is now GA (group version: `batch/v1`). |
| 19 | + |
| 20 | +In this article, we'll take you through the driving forces behind this new |
| 21 | +development, give you a brief description of controller design for core |
| 22 | +Kubernetes, and we'll outline what you will gain from this improved controller. |
| 23 | + |
| 24 | +The driving force behind promoting the API was Kubernetes' policy choice to |
| 25 | +[ensure APIs move beyond beta](/blog/2020/08/21/moving-forward-from-beta/). |
| 26 | +That policy aims to prevent APIs from being stuck in a “permanent beta” state. |
| 27 | +Over the years the old CronJob controller implementation had received healthy |
| 28 | +feedback from the community, with reports of several widely recognized |
| 29 | +[issues](https://github.com/kubernetes/kubernetes/issues/82659). |
| 30 | + |
| 31 | +If the beta API for CronJob was to be supported as GA, the existing controller |
| 32 | +code would need substantial rework. Instead, the SIG Apps community decided |
| 33 | +to introduce a new controller and gradually replace the old one. |
| 34 | + |
| 35 | +## How do controllers work? |
| 36 | + |
| 37 | +Kubernetes [controllers](/docs/concepts/architecture/controller/) are control |
| 38 | +loops that watch the state of resource(s) in your cluster, then make or |
| 39 | +request changes where needed. Each controller tries to move part of the |
| 40 | +current cluster state closer to the desired state. |
| 41 | + |
| 42 | +The v1 CronJob controller works by performing a periodic poll and sweep of all |
| 43 | +the CronJob objects in your cluster, in order to act on them. It is a single |
| 44 | +worker implementation that gets all CronJobs every 10 seconds, iterates over |
| 45 | +each one of them, and syncs them to their desired state. This was the default |
| 46 | +way of doing things almost 5 years ago when the controller was initially |
| 47 | +written. In hindsight, we can certainly say that such an approach can |
| 48 | +overload the API server at scale. |
| 49 | + |
| 50 | +These days, every core controller in kubernetes must follow the guidelines |
| 51 | +described in [Writing Controllers](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md#readme). |
| 52 | +Among many details, that document prescribes using |
| 53 | +[shared informers](https://www.cncf.io/blog/2019/10/15/extend-kubernetes-via-a-shared-informer/) |
| 54 | +to “receive notifications of adds, updates, and deletes for a particular |
| 55 | +resource”. Upon any such events, the related object(s) is placed in a queue. |
| 56 | +Workers pull items from the queue and process them one at a time. This |
| 57 | +approach ensures consistency and scalability. |
| 58 | + |
| 59 | +The picture below shows the flow of information from kubernetes API server, |
| 60 | +through shared informers and queue, to the main part of a controller - a |
| 61 | +reconciliation loop which is responsible for performing the core functionality. |
| 62 | + |
| 63 | + |
| 64 | + |
| 65 | +The CronJob controller V2 uses a queue that implements the DelayingInterface to |
| 66 | +handle the scheduling aspect. This queue allows processing an element after a |
| 67 | +specific time interval. Every time there is a change in a CronJob or its related |
| 68 | +Jobs, the key that represents the CronJob is pushed to the queue. The main |
| 69 | +handler pops the key, processes the CronJob, and after completion |
| 70 | +pushes the key back into the queue for the next scheduled time interval. This is |
| 71 | +immediately a more performant implementation, as it no longer requires a linear |
| 72 | +scan of all the CronJobs. On top of that, this controller can be scaled by |
| 73 | +increasing the number of workers processing the CronJobs in parallel. |
| 74 | + |
| 75 | +## Performance impact of the new controller {#performance-impact} |
| 76 | + |
| 77 | +In order to test the performance difference of the two controllers a VM instance |
| 78 | +with 128 GiB RAM and 64 vCPUs was used to set up a single node Kubernetes cluster. |
| 79 | +Initially, a sample workload was created with 20 CronJob instances with a schedule |
| 80 | +to run every minute, and 2100 CronJobs running every 20 hours. Additionally, |
| 81 | +over the next few minutes we added 1000 CronJobs with a schedule to run every |
| 82 | +20 hours, until we reached a total of 5120 CronJobs. |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | +We observed that for every 1000 CronJobs added, the old controller used |
| 87 | +around 90 to 120 seconds more wall-clock time to schedule 20 Jobs every cycle. |
| 88 | +That is, at 5120 CronJobs, the old controller took approximately 9 minutes |
| 89 | +to create 20 Jobs. Hence, during each cycle, about 8 schedules were missed. |
| 90 | +The new controller, implemented with architectural change explained above, |
| 91 | +created 20 Jobs without any delay, even when we created an additional batch |
| 92 | +of 1000 CronJobs reaching a total of 6120. |
| 93 | + |
| 94 | +As a closing remark, the new controller exposes a histogram metric |
| 95 | +`cronjob_controller_cronjob_job_creation_skew_duration_seconds` which helps |
| 96 | +monitor the time difference between when a CronJob is meant to run and when |
| 97 | +the actual Job is created. |
| 98 | + |
| 99 | +Hopefully the above description is a sufficient argument to follow the |
| 100 | +guidelines and standards set in the Kubernetes project, even for your own |
| 101 | +controllers. As mentioned before, the new controller is on by default starting |
| 102 | +from Kubernetes v1.21; if you want to check it out in the previous release (1.20), |
| 103 | +you can enable the `CronJobControllerV2` |
| 104 | +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) |
| 105 | +for the kube-controller-manger: `--feature-gate="CronJobControllerV2=true"`. |
0 commit comments