|
| 1 | +# KEP-4026: Add job creation timestamp to job annotations |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | + - [Goals](#goals) |
| 8 | + - [Non-Goals](#non-goals) |
| 9 | +- [Proposal](#proposal) |
| 10 | + - [User Stories (Optional)](#user-stories-optional) |
| 11 | + - [Story 1](#story-1) |
| 12 | + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) |
| 13 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 14 | +- [Design Details](#design-details) |
| 15 | + - [Test Plan](#test-plan) |
| 16 | + - [Prerequisite testing updates](#prerequisite-testing-updates) |
| 17 | + - [Unit tests](#unit-tests) |
| 18 | + - [Integration tests](#integration-tests) |
| 19 | + - [e2e tests](#e2e-tests) |
| 20 | + - [Graduation Criteria](#graduation-criteria) |
| 21 | + - [Beta](#beta) |
| 22 | + - [GA](#ga) |
| 23 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 24 | + - [Version Skew Strategy](#version-skew-strategy) |
| 25 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 26 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 27 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 28 | + - [Monitoring Requirements](#monitoring-requirements) |
| 29 | + - [Dependencies](#dependencies) |
| 30 | + - [Scalability](#scalability) |
| 31 | + - [Troubleshooting](#troubleshooting) |
| 32 | +- [Implementation History](#implementation-history) |
| 33 | +- [Drawbacks](#drawbacks) |
| 34 | +- [Alternatives](#alternatives) |
| 35 | +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) |
| 36 | +<!-- /toc --> |
| 37 | + |
| 38 | +## Release Signoff Checklist |
| 39 | + |
| 40 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 41 | + |
| 42 | +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 43 | +- [X] (R) KEP approvers have approved the KEP status as `implementable` |
| 44 | +- [X] (R) Design details are appropriately documented |
| 45 | +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) |
| 46 | + - [ ] e2e Tests for all Beta API Operations (endpoints) |
| 47 | + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
| 48 | + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free |
| 49 | +- [X] (R) Graduation criteria is in place |
| 50 | + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
| 51 | +- [ ] (R) Production readiness review completed |
| 52 | +- [ ] (R) Production readiness review approved |
| 53 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 54 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 55 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 56 | + |
| 57 | +[kubernetes.io]: https://kubernetes.io/ |
| 58 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 59 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 60 | +[kubernetes/website]: https://git.k8s.io/website |
| 61 | + |
| 62 | +## Summary |
| 63 | + |
| 64 | +Currently, there is no supported way to get the original/expected initial scheduled timestamp for the job created from a cronjob. This KEP proposes to set the original scheduled time as an annotation in the job metadata. |
| 65 | + |
| 66 | +## Motivation |
| 67 | + |
| 68 | +### Goals |
| 69 | + |
| 70 | +- Set job scheduled timestamp as an annotation on the job. |
| 71 | +- Adding the annotation should not be disruptive to existing workloads. |
| 72 | + |
| 73 | +### Non-Goals |
| 74 | + |
| 75 | +## Proposal |
| 76 | + |
| 77 | +At a high level, the proposal is to modify the CronJob controller to set the job scheduled timestamp as a job annotation. The details of this are outlined in the Design Details section below. |
| 78 | + |
| 79 | +Job scheduled timestamp annotation: `batch.kubernetes.io/cronjob-scheduled-timestamp` |
| 80 | + |
| 81 | +### User Stories (Optional) |
| 82 | + |
| 83 | +#### Story 1 |
| 84 | + |
| 85 | +As a user, I would like to get the job's scheduled timestamp that this job was expected to be running. |
| 86 | + |
| 87 | +### Notes/Constraints/Caveats (Optional) |
| 88 | + |
| 89 | +### Risks and Mitigations |
| 90 | + |
| 91 | +CronJobs are always working with the assumption that the changes apply only to newly created jobs after the change. Therefore, the change will be to inject the annotation for newly created Jobs from CronJobs for when the feature is on. This will nicely play with downgrade and doesn't introduce unnecessary complexity. |
| 92 | + |
| 93 | +## Design Details |
| 94 | + |
| 95 | +The CronJob controller will only need a minor update to the [getJobFromTemplate2](https://github.com/kubernetes/kubernetes/blob/7024beeeeb1f2e4cde93805a137cd7ad92fec466/pkg/controller/cronjob/utils.go#L188) function, to add the job scheduled timestamp as the job annotation `batch.kubernetes.io/cronjob-scheduled-timestamp`. The scheduled timestamp is represented in `RFC3339`. |
| 96 | + |
| 97 | +For the scheduled timestamp's timezone, the initial thought was to use `UTC` as it's used as the primary one for less confusion. However, since the `job` object has a `spec.timeZone`, it was a better to use the same timezone within the same object. If the job `spec.timeZone` is not set or `nil`, the annotation will use the `UTC` timezone as a default. |
| 98 | + |
| 99 | +### Test Plan |
| 100 | + |
| 101 | +- [X] I/we understand the owners of the involved components may require updates to |
| 102 | +existing tests to make this code solid enough prior to committing the changes necessary |
| 103 | +to implement this enhancement. |
| 104 | + |
| 105 | +##### Prerequisite testing updates |
| 106 | + |
| 107 | + |
| 108 | +##### Unit tests |
| 109 | + |
| 110 | +- `k8s.io/kubernetes/pkg/controller/cronjob`: `05/22/2023` - `96.2%` |
| 111 | + |
| 112 | +##### Integration tests |
| 113 | + |
| 114 | +- Unit tests will ensure the new annotation is correctly added to jobs. |
| 115 | +- The integration test should ensure the annotation is present when the feature is on and missing when off. It will also verify that the annotation is only added to jobs from newly created CronJobs, not existing workloads. |
| 116 | + |
| 117 | +##### e2e tests |
| 118 | + |
| 119 | +E2E tests will not provide any additional coverage that isn't already covered by unit + integration tests, since we are simply adding an annotation, so no e2e tests will be necessary for this change. |
| 120 | + |
| 121 | +### Graduation Criteria |
| 122 | + |
| 123 | +The feature will be released directly in Beta state since there is no benefit in having an alpha release, since we are simply adding a new annotation so there is very little risk. |
| 124 | + |
| 125 | +#### Beta |
| 126 | + |
| 127 | +- Feature implemented behind the `CronJobsScheduledAnnotation` feature gate. |
| 128 | +- Unit and integration tests passing. |
| 129 | + |
| 130 | +#### GA |
| 131 | + |
| 132 | +Fix any potentially reported bugs. |
| 133 | + |
| 134 | +### Upgrade / Downgrade Strategy |
| 135 | + |
| 136 | +No changes required to existing cluster to use this feature. |
| 137 | + |
| 138 | +### Version Skew Strategy |
| 139 | + |
| 140 | +N/A. This feature doesn't require coordination between control plane components, |
| 141 | +the changes to each controller are self-contained. |
| 142 | + |
| 143 | +## Production Readiness Review Questionnaire |
| 144 | + |
| 145 | + |
| 146 | +### Feature Enablement and Rollback |
| 147 | + |
| 148 | + |
| 149 | +###### How can this feature be enabled / disabled in a live cluster? |
| 150 | + |
| 151 | + |
| 152 | +- [X] Feature gate (also fill in values in `kep.yaml`) |
| 153 | + - Feature gate name: `CronJobCreationAnnotation` |
| 154 | + - Components depending on the feature gate: `kube-controller-manager` |
| 155 | +- [ ] Other |
| 156 | + - Describe the mechanism: N/A. |
| 157 | + - Will enabling / disabling the feature require downtime of the control |
| 158 | + plane? No |
| 159 | + - Will enabling / disabling the feature require downtime or re-provisioning of a node? No |
| 160 | + |
| 161 | +###### Does enabling the feature change any default behavior? |
| 162 | + |
| 163 | +The jobs newly created by cronjob controller will contain a new annotation `CronJobsScheduledAnnotation`. |
| 164 | + |
| 165 | +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? |
| 166 | + |
| 167 | +Yes. If the feature gate is disabled, the CronJob controller will not add the |
| 168 | +scheduled timestamp as an annotation. |
| 169 | + |
| 170 | +###### What happens if we reenable the feature if it was previously rolled back? |
| 171 | + |
| 172 | +The CronJob controller will begin adding the scheduled timestamp as an annotation to jobs created while the feature is enabled, and existing jobs will be unaffected. |
| 173 | + |
| 174 | +###### Are there any tests for feature enablement/disablement? |
| 175 | + |
| 176 | +Given the feature results in adding an annotation only to newly created objects, those tests won't really be different from the actual feature tests. |
| 177 | + |
| 178 | +### Rollout, Upgrade and Rollback Planning |
| 179 | + |
| 180 | +###### How can a rollout or rollback fail? Can it impact already running workloads? |
| 181 | + |
| 182 | +This change will not impact the rollout or rollback fail. It also will not impact the already running workloads. |
| 183 | + |
| 184 | +###### What specific metrics should inform a rollback? |
| 185 | + |
| 186 | +- Users can monitor CronJobs metrics `job_creation_skew_duration_seconds` and `cronjob_controller_rate_limiter_use`, `cronjob_job_creation_skew`. |
| 187 | + |
| 188 | +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? |
| 189 | + |
| 190 | +The feature will be tested manually prior to beta launch. |
| 191 | + |
| 192 | +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? |
| 193 | + |
| 194 | +No. |
| 195 | + |
| 196 | +### Monitoring Requirements |
| 197 | + |
| 198 | + |
| 199 | +###### How can an operator determine if the feature is in use by workloads? |
| 200 | + |
| 201 | +Randomly checking the CronJobs annotation `batch.kubernetes.io/cronjob-scheduled-timestamp` is sufficient. For monitoring purposes, we can rely on pre-existing metrics which monitor both the cronjob queue and the job creation skew, which should provide sufficient signal if the controller is working as expected. For small clusters, checking the annotation will determine the feature is used. |
| 202 | + |
| 203 | +###### How can someone using this feature know that it is working for their instance? |
| 204 | + |
| 205 | +- [ ] Events |
| 206 | + - Event Reason: |
| 207 | +- [X] API .metadata |
| 208 | + - Condition name: |
| 209 | + - Other field: |
| 210 | + - `.metadata.annotations['batch.kubernetes.io/cronjob-scheduled-timestamp']` |
| 211 | + |
| 212 | +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? |
| 213 | + |
| 214 | +- 99% percentile over day for Job syncs is <= 15s for a client-side 50 QPS limit. |
| 215 | + |
| 216 | +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? |
| 217 | + |
| 218 | +- [X] Metrics |
| 219 | + - Metric name: cronjob_job_creation_skew |
| 220 | + - Components exposing the metric: kube-controller-manager |
| 221 | + - Metric name: job_creation_skew_duration_seconds |
| 222 | + - Components exposing the metric: kube-controller-manager |
| 223 | + |
| 224 | +###### Are there any missing metrics that would be useful to have to improve observability of this feature? |
| 225 | + |
| 226 | +No. |
| 227 | + |
| 228 | +### Dependencies |
| 229 | + |
| 230 | +###### Does this feature depend on any specific services running in the cluster? |
| 231 | + |
| 232 | +No. |
| 233 | + |
| 234 | +### Scalability |
| 235 | + |
| 236 | +###### Will enabling / using this feature result in any new API calls? |
| 237 | + |
| 238 | +No. |
| 239 | + |
| 240 | +###### Will enabling / using this feature result in introducing new API types? |
| 241 | + |
| 242 | +No. |
| 243 | + |
| 244 | +###### Will enabling / using this feature result in any new calls to the cloud provider? |
| 245 | + |
| 246 | +No. |
| 247 | + |
| 248 | +###### Will enabling / using this feature result in increasing size or count of the existing API objects? |
| 249 | + |
| 250 | +Yes, each job created by a cronjob-controller will have an additional annotation containing `RFC3339` timestamp, which together with annotation name results in ~70B per job object. |
| 251 | + |
| 252 | +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? |
| 253 | + |
| 254 | +No. |
| 255 | + |
| 256 | +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? |
| 257 | + |
| 258 | +No. |
| 259 | + |
| 260 | +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? |
| 261 | + |
| 262 | +No. |
| 263 | + |
| 264 | +### Troubleshooting |
| 265 | + |
| 266 | +###### How does this feature react if the API server and/or etcd is unavailable? |
| 267 | + |
| 268 | +No change comparing to existing failure modes. |
| 269 | + |
| 270 | +###### What are other known failure modes? |
| 271 | + |
| 272 | +N/A |
| 273 | + |
| 274 | +###### What steps should be taken if SLOs are not being met to determine the problem? |
| 275 | + |
| 276 | +- 2023-06-06: KEP published |
| 277 | + |
| 278 | +## Implementation History |
| 279 | + |
| 280 | +## Drawbacks |
| 281 | + |
| 282 | +## Alternatives |
| 283 | + |
| 284 | +- Add label instead of annotation |
| 285 | + - Labels are unnecessary as we need to pass data that won't be used with search or satisfy certain conditions. |
| 286 | + |
| 287 | +- Add a status field |
| 288 | + - The object already has the `CreationTimestamp` field, but it will get overridden with the time the CronJob will start. The point of the new annotation is to pass the original/expected scheduled timestamp information. |
| 289 | + |
| 290 | +## Infrastructure Needed (Optional) |
0 commit comments