@@ -44,13 +44,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
44
44
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
45
45
- [x] (R) KEP approvers have approved the KEP status as ` implementable `
46
46
- [x] (R) Design details are appropriately documented
47
- - [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
47
+ - [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
48
+ - [x] e2e Tests for all Beta API Operations (endpoints)
49
+ - [x] (R) Ensure GA e2e tests for meet requirements for [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
50
+ - [x] (R) Minimum Two Week Window for GA e2e tests to prove flake free
48
51
- [x] (R) Graduation criteria is in place
52
+ - [x] (R) [ all GA Endpoints] ( https://github.com/kubernetes/community/pull/1806 ) must be hit by [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
49
53
- [x] (R) Production readiness review completed
50
54
- [x] (R) Production readiness review approved
51
55
- [x] "Implementation History" section is up-to-date for milestone
52
56
- [x] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
53
- - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
57
+ - [x ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
54
58
55
59
[ kubernetes.io ] : https://kubernetes.io/
56
60
[ kubernetes/enhancements ] : https://git.k8s.io/enhancements
@@ -459,8 +463,7 @@ gate enabled and disabled.
459
463
460
464
# ### Beta -> GA Graduation
461
465
462
- - E2E test graduates to conformance
463
- - 2 examples of operators using Indexed Jobs.
466
+ - [E2E](https://testgrid.k8s.io/sig-apps#gce&include-filter-by-regex=Indexed) test graduates to conformance
464
467
465
468
# ## Upgrade / Downgrade Strategy
466
469
@@ -496,80 +499,99 @@ This feature has no node runtime implications.
496
499
497
500
# ## Feature Enablement and Rollback
498
501
499
- _This section must be completed when targeting alpha to a release._
502
+ <!--
503
+ This section must be completed when targeting alpha to a release.
504
+ -->
505
+
506
+ # ##### How can this feature be enabled / disabled in a live cluster?
500
507
501
- * **How can this feature be enabled / disabled in a live cluster?**
502
508
503
509
- [x] Feature gate (also fill in values in `kep.yaml`)
504
510
- Feature gate name : IndexedJob
505
511
- Components depending on the feature gate :
506
512
- kube-apiserver
507
513
- kube-controller-manager
508
514
509
- * ** Does enabling the feature change any default behavior?**
515
+ # ##### Does enabling the feature change any default behavior?
510
516
511
517
No. Jobs need to opt-in with `.spec.completionMode=Indexed`.
512
518
513
- * **Can the feature be disabled once it has been enabled (i.e. can we roll back
514
- the enablement)?**
519
+ # ##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
515
520
516
521
Yes. Using the feature gate is the recommended way.
517
522
518
- * ** What happens if we reenable the feature if it was previously rolled back?**
523
+ # ##### What happens if we reenable the feature if it was previously rolled back?
519
524
520
525
The Job controller starts managing Indexed Jobs again.
521
526
More details covered in [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy).
522
527
523
- * ** Are there any tests for feature enablement/disablement?**
528
+ # ##### Are there any tests for feature enablement/disablement?
524
529
525
530
Yes, unit and integration test for the feature enabled, disabled and
526
531
transitions.
527
532
528
533
# ## Rollout, Upgrade and Rollback Planning
529
534
530
- _This section must be completed when targeting beta graduation to a release._
535
+ <!--
536
+ This section must be completed when targeting beta to a release.
537
+ -->
531
538
532
- * ** How can a rollout fail? Can it impact already running workloads?**
539
+ # ##### How can a rollout or rollback fail? Can it impact already running workloads?
533
540
534
541
If the new kube-controller-manager crashes, it's possible that an older
535
542
version of it would pick it up. In 1.21, when the IndexedJob feature is
536
543
disabled (default), the controller would not sync Indexed Jobs, that is : the
537
544
controller doesn't create or delete Pods and doesn't update Job status.
538
545
Running Pods are not affected.
539
546
540
- * ** What specific metrics should inform a rollback?**
547
+ # ##### What specific metrics should inform a rollback?
541
548
542
549
- job_sync_duration_seconds shows significantly more latency for label
543
550
completion_mode=Indexed Jobs than completion_mode=NonIndexed.
544
551
- job_sync_total shows more errors for completion_mode=Indexed than
545
552
completion_mode=NonIndexed.
546
553
- job_finished_total shows that Jobs with completion_mode=Indexed don't finish.
547
554
548
- * ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
555
+ # ##### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
549
556
550
- Manual test :
557
+ Manual test performed :
551
558
552
559
1. Deploy k8s 1.21 cluster
553
560
1. Upgrade to 1.22
554
561
1. Create Indexed Job with big number of completions and pods that run for ~10min.
555
562
1. Downgrade to 1.21. Verify that no new pods are created for the Indexed Job.
556
563
1. Upgrade to 1.22. Verify that new pods are created for Indexed Job.
557
564
558
- * **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
559
- fields of API types, flags, etc.?**
565
+ # ##### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
560
566
561
567
No
562
568
563
569
# ## Monitoring Requirements
564
570
565
- _This section must be completed when targeting beta graduation to a release._
571
+ <!--
572
+ This section must be completed when targeting beta to a release.
573
+ -->
566
574
567
- * ** How can an operator determine if the feature is in use by workloads?**
575
+ # ##### How can an operator determine if the feature is in use by workloads?
568
576
569
577
- job_sync_total has values for the label completion_mode=Indexed.
570
578
571
- * **What are the SLIs (Service Level Indicators) an operator can use to determine
572
- the health of the service?**
579
+ # ##### How can someone using this feature know that it is working for their instance?
580
+
581
+ - [x] Events
582
+ - Event Reason : SuccessfulCreate
583
+ - The message includes the pod name.
584
+ - [x] API .status
585
+ - Condition name :
586
+ - Other field : ` completedIndexes` will not be empty as pods terminate.
587
+
588
+ # ##### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
589
+
590
+ - per-day percentage of job_sync_total with label result=error <= 1%
591
+ - 99% percentile over day for job_sync_duration_seconds is <= 15s, assuming
592
+ a client-side QPS limit of 50 calls per second.
593
+
594
+ # ##### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
573
595
574
596
- [x] Metrics
575
597
- Metric name (all new) :
@@ -579,51 +601,46 @@ the health of the service?**
579
601
result=failed/succeeded
580
602
- Components exposing the metric : kube-controller-manager
581
603
582
- * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
604
+ # ##### Are there any missing metrics that would be useful to have to improve observability of this feature?
583
605
584
- - per-day percentage of job_sync_total with label result=error <= 1%
585
- - 99% percentile over day for job_sync_duration_seconds is <= 15s, assuming
586
- a client-side QPS limit of 50 calls per second.
587
-
588
- * **Are there any missing metrics that would be useful to have to improve observability
589
- of this feature?**
590
-
591
- N/A
606
+ No
592
607
593
608
# ## Dependencies
594
609
595
- _This section must be completed when targeting beta graduation to a release._
610
+ <!--
611
+ This section must be completed when targeting beta to a release.
612
+ -->
596
613
597
- * ** Does this feature depend on any specific services running in the cluster?**
614
+ # ##### Does this feature depend on any specific services running in the cluster?
598
615
599
- Feature only involves kube-apiserver and kube-controller-manager.
616
+ No, the feature only involves kube-apiserver and kube-controller-manager.
600
617
601
618
602
619
# ## Scalability
603
620
604
- _For alpha, this section is encouraged : reviewers should consider these questions
605
- and attempt to answer them._
621
+ <!--
622
+ For alpha, this section is encouraged : reviewers should consider these questions
623
+ and attempt to answer them.
606
624
607
- _For beta, this section is required : reviewers must answer these questions._
625
+ For beta, this section is required : reviewers must answer these questions.
608
626
609
- _For GA, this section is required : approvers should be able to confirm the
610
- previous answers based on experience in the field._
627
+ For GA, this section is required : approvers should be able to confirm the
628
+ previous answers based on experience in the field.
629
+ -->
611
630
612
- * ** Will enabling / using this feature result in any new API calls?**
631
+ # ##### Will enabling / using this feature result in any new API calls?
613
632
614
633
No.
615
634
616
- * ** Will enabling / using this feature result in introducing new API types?**
635
+ # ##### Will enabling / using this feature result in introducing new API types?
617
636
618
637
No.
619
638
620
- * **Will enabling / using this feature result in any new calls to the cloud
621
- provider?**
639
+ # ##### Will enabling / using this feature result in any new calls to the cloud provider?
622
640
623
641
No.
624
642
625
- * **Will enabling / using this feature result in increasing size or count of
626
- the existing API objects?**
643
+ # ##### Will enabling / using this feature result in increasing size or count of the existing API objects?
627
644
628
645
Yes.
629
646
@@ -639,36 +656,36 @@ the existing API objects?**
639
656
- Estimated increase in size : new annotation of about 50 bytes and hostname
640
657
which includes the index.
641
658
642
- * **Will enabling / using this feature result in increasing time taken by any
643
- operations covered by [existing SLIs/SLOs]?**
659
+ # ##### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
644
660
645
661
No.
646
662
647
- * **Will enabling / using this feature result in non-negligible increase of
648
- resource usage (CPU, RAM, disk, IO, ...) in any components?**
663
+ # ##### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
649
664
650
665
Additional CPU and memory increase in the controller-manager is negligible
651
666
and restricted to Jobs using the new completion mode.
652
667
653
668
# ## Troubleshooting
654
669
670
+ <!--
671
+ This section must be completed when targeting beta to a release.
672
+
655
673
The Troubleshooting section currently serves the `Playbook` role. We may consider
656
674
splitting it into a dedicated `Playbook` document (potentially with some monitoring
657
675
details). For now, we leave it here.
676
+ -->
658
677
659
- _This section must be completed when targeting beta graduation to a release._
660
-
661
- * **How does this feature react if the API server and/or etcd is unavailable?**
678
+ # ##### How does this feature react if the API server and/or etcd is unavailable?
662
679
663
680
The job controller can't create or delete pods nor update job status.
664
681
The metric job_sync_total increases for label result=error.
665
682
Existing pods continue to run.
666
683
667
- * ** What are other known failure modes?**
684
+ # ##### What are other known failure modes?
668
685
669
686
None.
670
687
671
- * ** What steps should be taken if SLOs are not being met to determine the problem?**
688
+ # ##### What steps should be taken if SLOs are not being met to determine the problem?
672
689
673
690
1. Check job_sync_total with label result=error. See if it varies for
674
691
different completion modes.
@@ -686,6 +703,7 @@ _This section must be completed when targeting beta graduation to a release._
686
703
completed.
687
704
* 2021-03-09: Feature implemented under feature gate disabled by default.
688
705
* 2021-04-09: KEP updated for graduation to beta.
706
+ * 2022-01-06: KEP updated for graduation to stable.
689
707
690
708
# # Drawbacks
691
709
0 commit comments