13
13
- [ CronJob API] ( #cronjob-api )
14
14
- [ CronJob controller] ( #cronjob-controller )
15
15
- [ Test Plan] ( #test-plan )
16
+ - [ Prerequisite testing updates] ( #prerequisite-testing-updates )
17
+ - [ Unit tests] ( #unit-tests )
18
+ - [ Integration tests] ( #integration-tests )
19
+ - [ e2e tests] ( #e2e-tests )
16
20
- [ Graduation Criteria] ( #graduation-criteria )
17
21
- [ Alpha] ( #alpha )
18
22
- [ Beta] ( #beta )
@@ -39,17 +43,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
39
43
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
40
44
- [x] (R) KEP approvers have approved the KEP status as ` implementable `
41
45
- [x] (R) Design details are appropriately documented
42
- - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
43
- - [ ] e2e Tests for all Beta API Operations (endpoints)
44
- - [ ] (R) Ensure GA e2e tests for meet requirements for [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
45
- - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
46
- - [ ] (R) Graduation criteria is in place
47
- - [ ] (R) [ all GA Endpoints] ( https://github.com/kubernetes/community/pull/1806 ) must be hit by [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
46
+ - [x ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
47
+ - [x ] e2e Tests for all Beta API Operations (endpoints)
48
+ - [x ] (R) Ensure GA e2e tests for meet requirements for [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
49
+ - [x ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
50
+ - [x ] (R) Graduation criteria is in place
51
+ - [x ] (R) [ all GA Endpoints] ( https://github.com/kubernetes/community/pull/1806 ) must be hit by [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
48
52
- [x] (R) Production readiness review completed
49
53
- [x] (R) Production readiness review approved
50
54
- [x] "Implementation History" section is up-to-date for milestone
51
- - [ ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
52
- - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
55
+ - [x ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
56
+ - [x ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
53
57
54
58
<!--
55
59
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -159,14 +163,29 @@ In all other cases the controller will maintain the current behavior.
159
163
160
164
### Test Plan
161
165
162
- Unit and integration tests covering the time zone mechanics of CronJob, including:
166
+ [ x] I/we understand the owners of the involved components may require updates to
167
+ existing tests to make this code solid enough prior to committing the changes necessary
168
+ to implement this enhancement.
163
169
164
- - defaulting
165
- - validation
166
- - creating CronJob
167
- - updating CronJob
170
+ ##### Prerequisite testing updates
168
171
169
- Additionally, all of tests will be performed with feature gate enabled and disabled.
172
+ 1 . Add tests ensuring that case insensitive location loading is properly handled.
173
+ See [ beta requirements] ( #beta ) for more details.
174
+ 2 . Add at least integration and optionally e2e covering TimeZone usage.
175
+
176
+ ##### Unit tests
177
+
178
+ - ` k8s.io/kubernetes/pkg/apis/batch/validation ` : ` 2022-06-09 ` - ` 94.4% `
179
+ - ` k8s.io/kubernetes/pkg/controller/cronjob ` : ` 2022-06-09 ` - ` 50.8% `
180
+ - ` k8s.io/kubernetes/pkg/registry/batch/cronjob ` : ` 2022-06-09 ` - ` 61.8% `
181
+
182
+ ##### Integration tests
183
+
184
+ None.
185
+
186
+ ##### e2e tests
187
+
188
+ None.
170
189
171
190
### Graduation Criteria
172
191
@@ -182,8 +201,6 @@ Additionally, all of tests will be performed with feature gate enabled and disab
182
201
- Test skipped on MacOS (https://github.com/kubernetes/kubernetes/pull/109218 )
183
202
- Golang issue (https://github.com/golang/go/issues/21512 )
184
203
185
- More TBD
186
-
187
204
#### GA
188
205
189
206
TBD
@@ -251,7 +268,6 @@ This feature has no node runtime implications.
251
268
252
269
###### How can this feature be enabled / disabled in a live cluster?
253
270
254
-
255
271
- [x] Feature gate (also fill in values in ` kep.yaml ` )
256
272
- Feature gate name: CronJobTimeZone
257
273
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
@@ -279,151 +295,80 @@ Yes, both units and integration tests for enablement, disablement and transition
279
295
280
296
### Rollout, Upgrade and Rollback Planning
281
297
282
- <!--
283
- This section must be completed when targeting beta to a release.
284
- -->
285
-
286
298
###### How can a rollout or rollback fail? Can it impact already running workloads?
287
299
288
- <!--
289
- Try to be as paranoid as possible - e.g., what if some components will restart
290
- mid-rollout?
291
-
292
- Be sure to consider highly-available clusters, where, for example,
293
- feature flags will be enabled on some API servers and not others during the
294
- rollout. Similarly, consider large clusters and how enablement/disablement
295
- will rollout across nodes.
296
- -->
297
-
298
300
An upgrade flow can be vulnerable to the enable, disable, enable if you have
299
301
a lease that is acquired by a new kube-controller-manager, then an old
300
302
kube-controller-manager, then a new kube-controller-manager.
301
303
302
304
###### What specific metrics should inform a rollback?
303
305
304
- <!--
305
- What signals should users be paying attention to when the feature is young
306
- that might indicate a serious problem?
307
- -->
306
+ Increased ` cronjob_job_creation_skew ` which tracks how much a job creation
307
+ is delayed compared to requested time slot.
308
308
309
309
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
310
310
311
- <!--
312
- Describe manual testing that was done and the outcomes.
313
- Longer term, we may want to require automated upgrade/rollback tests, but we
314
- are missing a bunch of machinery and tooling and can't do that now.
315
- -->
311
+ Upgrade->downgrade->upgrade path was manually tested. No issues were found during tests.
316
312
317
313
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
318
314
319
- <!--
320
- Even if applying deprecation policies, they may still surprise some users.
321
- -->
315
+ No.
322
316
323
317
### Monitoring Requirements
324
318
325
- <!--
326
- This section must be completed when targeting beta to a release.
327
- -->
328
-
329
319
###### How can an operator determine if the feature is in use by workloads?
330
320
331
- <!--
332
- Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
333
- checking if there are objects with field X set) may be a last resort. Avoid
334
- logs or events for this purpose.
335
- -->
321
+ There's no explicit metric for TimeZone but operator should monitor ` cronjob_job_creation_skew ` ,
322
+ ensuring the job creation skew is not increasing.
336
323
337
324
###### How can someone using this feature know that it is working for their instance?
338
325
339
- <!--
340
- For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
341
- for each individual pod.
342
- Pick one more of these and delete the rest.
343
- Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
344
- and operation of this feature.
345
- Recall that end users cannot usually observe component logs or access metrics.
346
- -->
347
-
348
- - [ ] Events
349
- - Event Reason:
350
- - [ ] API .status
351
- - Condition name:
352
- - Other field:
353
- - [ ] Other (treat as last resort)
354
- - Details:
326
+ - [x] Events
327
+ - Event Reason: ` UnknownTimeZone ` when specified TimeZone is not correct
355
328
356
329
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
357
330
358
- <!--
359
- This is your opportunity to define what "normal" quality of service looks like
360
- for a feature.
361
-
362
- It's impossible to provide comprehensive guidance, but at the very
363
- high level (needs more precise definitions) those may be things like:
364
- - per-day percentage of API calls finishing with 5XX errors <= 1%
365
- - 99% percentile over day of absolute value from (job creation time minus expected
366
- job creation time) for cron job <= 10%
367
- - 99.9% of /health requests per day finish with 200 code
368
-
369
- These goals will help you determine what you need to measure (SLIs) in the next
370
- question.
371
- -->
331
+ 99th percentile over day for cron_job_creation_skew is <= 15s
372
332
373
333
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
374
334
375
- <!--
376
- Pick one more of these and delete the rest.
377
- -->
378
-
379
335
- [x] Metrics
380
336
- Metric name: ` cronjob_controller_rate_limiter_use `
381
337
- Components exposing the metric: ` kube-controller-manager `
382
- - [ ] Other (treat as last resort)
383
- - Details:
338
+ - Metric name: ` cron_job_creation_skew `
339
+ - Components exposing the metric: ` kube-controller-manager `
340
+
384
341
385
342
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
386
343
387
- <!--
388
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
389
- implementation difficulties, etc.).
390
- -->
344
+ No.
391
345
392
346
### Dependencies
393
347
394
- <!--
395
- This section must be completed when targeting beta to a release.
396
- -->
397
-
398
348
###### Does this feature depend on any specific services running in the cluster?
399
349
400
- <!--
401
- Think about both cluster-level services (e.g. metrics-server) as well
402
- as node-level agents (e.g. specific version of CRI). Focus on external or
403
- optional services that are needed. For example, if this feature depends on
404
- a cloud provider API, or upon an external software-defined storage or network
405
- control plane.
406
-
407
- For each of these, fill in the following—thinking about running existing user workloads
408
- and creating new ones, as well as about cluster-level services (e.g. DNS):
409
- - [Dependency name]
410
- - Usage description:
411
- - Impact of its outage on the feature:
412
- - Impact of its degraded performance or high-error rates on the feature:
413
- -->
350
+ CronJob's TimeZone support relies on external TimeZone package, if one is missing
351
+ golang's internal package will be used, instead.
352
+
353
+ - kube-controller-manager and kube-apiserver
354
+ - Usage description:
355
+ Both kube-controller-manager and kube-apiserver need to have ` CronJobTimeZone `
356
+ feature gate turned for this feature to fully work.
357
+ - Impact of its outage on the feature:
358
+ CronJob's TimeZone functionality will not work.
359
+ - Impact of its degraded performance or high-error rates on the feature:
360
+ Delays in creating new Jobs.
361
+
362
+ - TimeZone package
363
+ - Usage description: CronJob's TimeZone support relies on external TimeZone package,
364
+ if one is missing golang's internal package will be used, instead.
365
+ - Impact of its outage on the feature:
366
+ TimeZone functionality will not work.
367
+ - Impact of its degraded performance or high-error rates on the feature:
368
+ Delays in creating new Jobs.
414
369
415
370
### Scalability
416
371
417
- <!--
418
- For alpha, this section is encouraged: reviewers should consider these questions
419
- and attempt to answer them.
420
-
421
- For beta, this section is required: reviewers must answer these questions.
422
-
423
- For GA, this section is required: approvers should be able to confirm the
424
- previous answers based on experience in the field.
425
- -->
426
-
427
372
###### Will enabling / using this feature result in any new API calls?
428
373
429
374
No new API calls are expected.
@@ -455,67 +400,48 @@ We're not using it, yet.
455
400
456
401
### Troubleshooting
457
402
458
- <!--
459
- This section must be completed when targeting beta to a release.
460
-
461
- The Troubleshooting section currently serves the `Playbook` role. We may consider
462
- splitting it into a dedicated `Playbook` document (potentially with some monitoring
463
- details). For now, we leave it here.
464
- -->
465
-
466
403
###### How does this feature react if the API server and/or etcd is unavailable?
467
404
468
405
###### What are other known failure modes?
469
406
470
- <!--
471
- For each of them, fill in the following information by copying the below template:
472
- - [Failure mode brief description]
473
- - Detection: How can it be detected via metrics? Stated another way:
474
- how can an operator troubleshoot without logging into a master or worker node?
475
- - Mitigations: What can be done to stop the bleeding, especially for already
476
- running user workloads?
477
- - Diagnostics: What are the useful log messages and their required logging
478
- levels that could help debug the issue?
479
- Not required until feature graduated to beta.
480
- - Testing: Are there any tests for failure mode? If not, describe why.
481
- -->
407
+ - Incorrect TimeZone
408
+ - Detection: ` UnknownTimeZone ` events being reported for a CronJob.
409
+ - Mitigations: Fix the TimeZone or suspend a CronJob.
410
+ - Diagnostics: Logs containing ` TimeZone ` phrase.
411
+ - Testing: A set of unit tests is ensuring that invalid TimeZone is properly
412
+ handled both in the apiserver and in the controller itself, reporting to
413
+ user the problem.
414
+ - Job creation problems
415
+ - Detection: ` cron_job_creation_skew ` metric is exceeding expected 15s per day.
416
+ - Mitigations: Disable ` CronJobTimeZone ` feature gate.
417
+ - Diagnostics: Check logs from CronJob controller.
418
+ - Testing: A set of unit tests is ensuring that invalid TimeZone is properly
419
+ handled both in the apiserver and in the controller itself, reporting to
420
+ user the problem.
482
421
483
422
###### What steps should be taken if SLOs are not being met to determine the problem?
484
423
424
+ If possible increase the log level for kube-controller-manager and check cronjob's
425
+ controller logs looking for warnings and errors which might point where the problem
426
+ lies.
427
+
485
428
## Implementation History
486
429
487
- <!--
488
- Major milestones in the lifecycle of a KEP should be tracked in this section.
489
- Major milestones might include:
490
- - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
491
- - the `Proposal` section being merged, signaling agreement on a proposed design
492
- - the date implementation started
493
- - the first Kubernetes release where an initial version of the KEP was available
494
- - the version of Kubernetes where the KEP graduated to general availability
495
- - when the KEP was retired or superseded
496
- -->
430
+ - * 2022-01-14* - Initial KEP draft
431
+ - * 2022-06-09* - Updated KEP for beta promotion.
497
432
498
433
## Drawbacks
499
434
500
- <!--
501
- Why should this KEP _not_ be implemented?
502
- -->
435
+ Using TimeZone might be simpler for users working with a cluster in different
436
+ TimeZones, but adds additional complexity to the code and to the operator
437
+ who will need to re-calculate when an actual CronJob will be creating a Job
438
+ when ` .spec.timeZone ` is set.
503
439
504
440
## Alternatives
505
441
506
442
Another approach was to specify time zone as an offset to UTC, but using the
507
443
name instead seems more user friendly.
508
444
509
- <!--
510
- What other approaches did you consider, and why did you rule them out? These do
511
- not need to be as detailed as the proposal, but should include enough
512
- information to express the idea and why it was not acceptable.
513
- -->
514
-
515
445
## Infrastructure Needed (Optional)
516
446
517
- <!--
518
- Use this section if you need things from the project/SIG. Examples include a
519
- new subproject, repos requested, or GitHub details. Listing these here allows a
520
- SIG to get the process for these resources started right away.
521
- -->
447
+ None.
0 commit comments