Skip to content

Commit ff9a1de

Browse files
committed
cronjob timezone to beta
1 parent 737efb1 commit ff9a1de

File tree

3 files changed

+70
-166
lines changed

3 files changed

+70
-166
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3140
22
alpha:
33
approver: deads2k
4+
beta:
5+
approver: deads2k

keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md

Lines changed: 66 additions & 164 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@
1313
- [CronJob API](#cronjob-api)
1414
- [CronJob controller](#cronjob-controller)
1515
- [Test Plan](#test-plan)
16+
- [Prerequisite testing updates](#prerequisite-testing-updates)
17+
- [Unit tests](#unit-tests)
18+
- [Integration tests](#integration-tests)
19+
- [e2e tests](#e2e-tests)
1620
- [Graduation Criteria](#graduation-criteria)
1721
- [Alpha](#alpha)
1822
- [Beta](#beta)
@@ -39,17 +43,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
3943
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
4044
- [x] (R) KEP approvers have approved the KEP status as `implementable`
4145
- [x] (R) Design details are appropriately documented
42-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
43-
- [ ] e2e Tests for all Beta API Operations (endpoints)
44-
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
45-
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
46-
- [ ] (R) Graduation criteria is in place
47-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
46+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
47+
- [x] e2e Tests for all Beta API Operations (endpoints)
48+
- [x] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
49+
- [x] (R) Minimum Two Week Window for GA e2e tests to prove flake free
50+
- [x] (R) Graduation criteria is in place
51+
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4852
- [x] (R) Production readiness review completed
4953
- [x] (R) Production readiness review approved
5054
- [x] "Implementation History" section is up-to-date for milestone
51-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
52-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
55+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
56+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5357

5458
<!--
5559
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -159,14 +163,29 @@ In all other cases the controller will maintain the current behavior.
159163

160164
### Test Plan
161165

162-
Unit and integration tests covering the time zone mechanics of CronJob, including:
166+
[x] I/we understand the owners of the involved components may require updates to
167+
existing tests to make this code solid enough prior to committing the changes necessary
168+
to implement this enhancement.
163169

164-
- defaulting
165-
- validation
166-
- creating CronJob
167-
- updating CronJob
170+
##### Prerequisite testing updates
168171

169-
Additionally, all of tests will be performed with feature gate enabled and disabled.
172+
1. Add tests ensuring that case insensitive location loading is properly handled.
173+
See [beta requirements](#beta) for more details.
174+
2. Add at least integration and optionally e2e covering TimeZone usage.
175+
176+
##### Unit tests
177+
178+
- `k8s.io/kubernetes/pkg/apis/batch/validation`: `2022-06-09` - `94.4%`
179+
- `k8s.io/kubernetes/pkg/controller/cronjob`: `2022-06-09` - `50.8%`
180+
- `k8s.io/kubernetes/pkg/registry/batch/cronjob`: `2022-06-09` - `61.8%`
181+
182+
##### Integration tests
183+
184+
None.
185+
186+
##### e2e tests
187+
188+
None.
170189

171190
### Graduation Criteria
172191

@@ -182,8 +201,6 @@ Additionally, all of tests will be performed with feature gate enabled and disab
182201
- Test skipped on MacOS (https://github.com/kubernetes/kubernetes/pull/109218)
183202
- Golang issue (https://github.com/golang/go/issues/21512)
184203

185-
More TBD
186-
187204
#### GA
188205

189206
TBD
@@ -251,7 +268,6 @@ This feature has no node runtime implications.
251268

252269
###### How can this feature be enabled / disabled in a live cluster?
253270

254-
255271
- [x] Feature gate (also fill in values in `kep.yaml`)
256272
- Feature gate name: CronJobTimeZone
257273
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
@@ -279,151 +295,62 @@ Yes, both units and integration tests for enablement, disablement and transition
279295

280296
### Rollout, Upgrade and Rollback Planning
281297

282-
<!--
283-
This section must be completed when targeting beta to a release.
284-
-->
285-
286298
###### How can a rollout or rollback fail? Can it impact already running workloads?
287299

288-
<!--
289-
Try to be as paranoid as possible - e.g., what if some components will restart
290-
mid-rollout?
291-
292-
Be sure to consider highly-available clusters, where, for example,
293-
feature flags will be enabled on some API servers and not others during the
294-
rollout. Similarly, consider large clusters and how enablement/disablement
295-
will rollout across nodes.
296-
-->
297-
298300
An upgrade flow can be vulnerable to the enable, disable, enable if you have
299301
a lease that is acquired by a new kube-controller-manager, then an old
300302
kube-controller-manager, then a new kube-controller-manager.
301303

302304
###### What specific metrics should inform a rollback?
303305

304-
<!--
305-
What signals should users be paying attention to when the feature is young
306-
that might indicate a serious problem?
307-
-->
306+
Increased `cronjob_job_creation_skew` which tracks how much a job creation
307+
is delayed compared to requested time slot.
308308

309309
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
310310

311-
<!--
312-
Describe manual testing that was done and the outcomes.
313-
Longer term, we may want to require automated upgrade/rollback tests, but we
314-
are missing a bunch of machinery and tooling and can't do that now.
315-
-->
311+
Upgrade->downgrade->upgrade path was manually tested. No issues were found during tests.
316312

317313
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
318314

319-
<!--
320-
Even if applying deprecation policies, they may still surprise some users.
321-
-->
315+
No.
322316

323317
### Monitoring Requirements
324318

325-
<!--
326-
This section must be completed when targeting beta to a release.
327-
-->
328-
329319
###### How can an operator determine if the feature is in use by workloads?
330320

331-
<!--
332-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
333-
checking if there are objects with field X set) may be a last resort. Avoid
334-
logs or events for this purpose.
335-
-->
321+
There's no explicit metric for TimeZone but operator should monitor `cronjob_job_creation_skew`,
322+
ensuring the job creation skew is not increasing.
336323

337324
###### How can someone using this feature know that it is working for their instance?
338325

339-
<!--
340-
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
341-
for each individual pod.
342-
Pick one more of these and delete the rest.
343-
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
344-
and operation of this feature.
345-
Recall that end users cannot usually observe component logs or access metrics.
346-
-->
347-
348-
- [ ] Events
349-
- Event Reason:
350-
- [ ] API .status
351-
- Condition name:
352-
- Other field:
353-
- [ ] Other (treat as last resort)
354-
- Details:
326+
- [x] Events
327+
- Event Reason: `UnknownTimeZone` when specified TimeZone is not correct
355328

356329
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
357330

358-
<!--
359-
This is your opportunity to define what "normal" quality of service looks like
360-
for a feature.
361-
362-
It's impossible to provide comprehensive guidance, but at the very
363-
high level (needs more precise definitions) those may be things like:
364-
- per-day percentage of API calls finishing with 5XX errors <= 1%
365-
- 99% percentile over day of absolute value from (job creation time minus expected
366-
job creation time) for cron job <= 10%
367-
- 99.9% of /health requests per day finish with 200 code
368-
369-
These goals will help you determine what you need to measure (SLIs) in the next
370-
question.
371-
-->
331+
99th percentile of cron_job_creation_skew <= 5 seconds per cluster-day.
372332

373333
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
374334

375-
<!--
376-
Pick one more of these and delete the rest.
377-
-->
378-
379335
- [x] Metrics
380336
- Metric name: `cronjob_controller_rate_limiter_use`
381337
- Components exposing the metric: `kube-controller-manager`
382-
- [ ] Other (treat as last resort)
383-
- Details:
338+
- Metric name: `cron_job_creation_skew`
339+
- Components exposing the metric: `kube-controller-manager`
340+
384341

385342
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
386343

387-
<!--
388-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
389-
implementation difficulties, etc.).
390-
-->
344+
No.
391345

392346
### Dependencies
393347

394-
<!--
395-
This section must be completed when targeting beta to a release.
396-
-->
397-
398348
###### Does this feature depend on any specific services running in the cluster?
399349

400-
<!--
401-
Think about both cluster-level services (e.g. metrics-server) as well
402-
as node-level agents (e.g. specific version of CRI). Focus on external or
403-
optional services that are needed. For example, if this feature depends on
404-
a cloud provider API, or upon an external software-defined storage or network
405-
control plane.
406-
407-
For each of these, fill in the following—thinking about running existing user workloads
408-
and creating new ones, as well as about cluster-level services (e.g. DNS):
409-
- [Dependency name]
410-
- Usage description:
411-
- Impact of its outage on the feature:
412-
- Impact of its degraded performance or high-error rates on the feature:
413-
-->
350+
None.
414351

415352
### Scalability
416353

417-
<!--
418-
For alpha, this section is encouraged: reviewers should consider these questions
419-
and attempt to answer them.
420-
421-
For beta, this section is required: reviewers must answer these questions.
422-
423-
For GA, this section is required: approvers should be able to confirm the
424-
previous answers based on experience in the field.
425-
-->
426-
427354
###### Will enabling / using this feature result in any new API calls?
428355

429356
No new API calls are expected.
@@ -455,67 +382,42 @@ We're not using it, yet.
455382

456383
### Troubleshooting
457384

458-
<!--
459-
This section must be completed when targeting beta to a release.
460-
461-
The Troubleshooting section currently serves the `Playbook` role. We may consider
462-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
463-
details). For now, we leave it here.
464-
-->
465-
466385
###### How does this feature react if the API server and/or etcd is unavailable?
467386

468387
###### What are other known failure modes?
469388

470-
<!--
471-
For each of them, fill in the following information by copying the below template:
472-
- [Failure mode brief description]
473-
- Detection: How can it be detected via metrics? Stated another way:
474-
how can an operator troubleshoot without logging into a master or worker node?
475-
- Mitigations: What can be done to stop the bleeding, especially for already
476-
running user workloads?
477-
- Diagnostics: What are the useful log messages and their required logging
478-
levels that could help debug the issue?
479-
Not required until feature graduated to beta.
480-
- Testing: Are there any tests for failure mode? If not, describe why.
481-
-->
389+
- [Incorrect TimeZone]
390+
- Detection: `UnknownTimeZone` events being reported for a CronJob.
391+
- Mitigations: Fix the TimeZone or suspend a CronJob.
392+
- Diagnostics: Logs containing `TimeZone` phrase.
393+
- Testing: A set of unit tests is ensuring that invalid TimeZone is properly
394+
handled both in the apiserver and in the controller itself, reporting to
395+
user the problem.
396+
482397

483398
###### What steps should be taken if SLOs are not being met to determine the problem?
484399

400+
If possible increase the log level for kube-controller-manager and check cronjob's
401+
controller logs looking for warnings and errors which might point where the problem
402+
lies.
403+
485404
## Implementation History
486405

487-
<!--
488-
Major milestones in the lifecycle of a KEP should be tracked in this section.
489-
Major milestones might include:
490-
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
491-
- the `Proposal` section being merged, signaling agreement on a proposed design
492-
- the date implementation started
493-
- the first Kubernetes release where an initial version of the KEP was available
494-
- the version of Kubernetes where the KEP graduated to general availability
495-
- when the KEP was retired or superseded
496-
-->
406+
- *2022-01-14* - Initial KEP draft
407+
- *2022-06-09* - Updated KEP for beta promotion.
497408

498409
## Drawbacks
499410

500-
<!--
501-
Why should this KEP _not_ be implemented?
502-
-->
411+
Using TimeZone might be simpler for users working with a cluster in different
412+
TimeZones, but adds additional complexity to the code and to the operator
413+
who will need to re-calculate when an actual CronJob will be creating a Job
414+
when `.spec.timeZone` is set.
503415

504416
## Alternatives
505417

506418
Another approach was to specify time zone as an offset to UTC, but using the
507419
name instead seems more user friendly.
508420

509-
<!--
510-
What other approaches did you consider, and why did you rule them out? These do
511-
not need to be as detailed as the proposal, but should include enough
512-
information to express the idea and why it was not acceptable.
513-
-->
514-
515421
## Infrastructure Needed (Optional)
516422

517-
<!--
518-
Use this section if you need things from the project/SIG. Examples include a
519-
new subproject, repos requested, or GitHub details. Listing these here allows a
520-
SIG to get the process for these resources started right away.
521-
-->
423+
None.

keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@ see-also:
1818
replaces:
1919

2020
# The target maturity stage in the current dev cycle for this KEP.
21-
stage: alpha
21+
stage: beta
2222

2323
# The most recent milestone for which work toward delivery of this KEP has been
2424
# done. This can be the current (upcoming) milestone, if it is being actively
2525
# worked on.
26-
latest-milestone: "v1.24"
26+
latest-milestone: "v1.25"
2727

2828
# The milestone at which this feature was, or is targeted to be, at each stage.
2929
milestone:

0 commit comments

Comments
 (0)