@@ -44,13 +44,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
44
44
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
45
45
- [x] (R) KEP approvers have approved the KEP status as ` implementable `
46
46
- [x] (R) Design details are appropriately documented
47
- - [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
47
+ - [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
48
+ - [x] e2e Tests for all Beta API Operations (endpoints)
49
+ - [x] (R) Ensure GA e2e tests for meet requirements for [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
50
+ - [x] (R) Minimum Two Week Window for GA e2e tests to prove flake free
48
51
- [x] (R) Graduation criteria is in place
52
+ - [x] (R) [ all GA Endpoints] ( https://github.com/kubernetes/community/pull/1806 ) must be hit by [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
49
53
- [x] (R) Production readiness review completed
50
54
- [x] (R) Production readiness review approved
51
55
- [x] "Implementation History" section is up-to-date for milestone
52
56
- [x] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
53
- - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
57
+ - [x ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
54
58
55
59
[ kubernetes.io ] : https://kubernetes.io/
56
60
[ kubernetes/enhancements ] : https://git.k8s.io/enhancements
@@ -459,8 +463,9 @@ gate enabled and disabled.
459
463
460
464
# ### Beta -> GA Graduation
461
465
462
- - E2E test graduates to conformance
463
- - 2 examples of operators using Indexed Jobs.
466
+ - [E2E](https://testgrid.k8s.io/sig-apps#gce&include-filter-by-regex=Indexed) test graduates to conformance
467
+ - Scalability tests for Jobs of varying sizes, up to 500 parallelism, that keep
468
+ track of metric `job_sync_duration_seconds`.
464
469
465
470
# ## Upgrade / Downgrade Strategy
466
471
@@ -496,80 +501,88 @@ This feature has no node runtime implications.
496
501
497
502
# ## Feature Enablement and Rollback
498
503
499
- _This section must be completed when targeting alpha to a release._
504
+ # ##### How can this feature be enabled / disabled in a live cluster?
500
505
501
- * **How can this feature be enabled / disabled in a live cluster?**
502
506
503
507
- [x] Feature gate (also fill in values in `kep.yaml`)
504
508
- Feature gate name : IndexedJob
505
509
- Components depending on the feature gate :
506
510
- kube-apiserver
507
511
- kube-controller-manager
508
512
509
- * ** Does enabling the feature change any default behavior?**
513
+ # ##### Does enabling the feature change any default behavior?
510
514
511
515
No. Jobs need to opt-in with `.spec.completionMode=Indexed`.
512
516
513
- * **Can the feature be disabled once it has been enabled (i.e. can we roll back
514
- the enablement)?**
517
+ # ##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
515
518
516
519
Yes. Using the feature gate is the recommended way.
517
520
518
- * ** What happens if we reenable the feature if it was previously rolled back?**
521
+ # ##### What happens if we reenable the feature if it was previously rolled back?
519
522
520
523
The Job controller starts managing Indexed Jobs again.
521
524
More details covered in [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy).
522
525
523
- * ** Are there any tests for feature enablement/disablement?**
526
+ # ##### Are there any tests for feature enablement/disablement?
524
527
525
528
Yes, unit and integration test for the feature enabled, disabled and
526
529
transitions.
527
530
528
531
# ## Rollout, Upgrade and Rollback Planning
529
532
530
- _This section must be completed when targeting beta graduation to a release._
531
-
532
- * **How can a rollout fail? Can it impact already running workloads?**
533
+ # ##### How can a rollout or rollback fail? Can it impact already running workloads?
533
534
534
535
If the new kube-controller-manager crashes, it's possible that an older
535
536
version of it would pick it up. In 1.21, when the IndexedJob feature is
536
537
disabled (default), the controller would not sync Indexed Jobs, that is : the
537
538
controller doesn't create or delete Pods and doesn't update Job status.
538
539
Running Pods are not affected.
539
540
540
- * ** What specific metrics should inform a rollback?**
541
+ # ##### What specific metrics should inform a rollback?
541
542
542
543
- job_sync_duration_seconds shows significantly more latency for label
543
544
completion_mode=Indexed Jobs than completion_mode=NonIndexed.
544
545
- job_sync_total shows more errors for completion_mode=Indexed than
545
546
completion_mode=NonIndexed.
546
547
- job_finished_total shows that Jobs with completion_mode=Indexed don't finish.
547
548
548
- * ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
549
+ # ##### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
549
550
550
- Manual test :
551
+ Manual test performed :
551
552
552
553
1. Deploy k8s 1.21 cluster
553
554
1. Upgrade to 1.22
554
555
1. Create Indexed Job with big number of completions and pods that run for ~10min.
555
556
1. Downgrade to 1.21. Verify that no new pods are created for the Indexed Job.
556
557
1. Upgrade to 1.22. Verify that new pods are created for Indexed Job.
557
558
558
- * **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
559
- fields of API types, flags, etc.?**
559
+ # ##### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
560
560
561
561
No
562
562
563
563
# ## Monitoring Requirements
564
564
565
- _This section must be completed when targeting beta graduation to a release._
566
-
567
- * **How can an operator determine if the feature is in use by workloads?**
565
+ # ##### How can an operator determine if the feature is in use by workloads?
568
566
569
567
- job_sync_total has values for the label completion_mode=Indexed.
570
568
571
- * **What are the SLIs (Service Level Indicators) an operator can use to determine
572
- the health of the service?**
569
+ # ##### How can someone using this feature know that it is working for their instance?
570
+
571
+ - [x] Events
572
+ - Event Reason : SuccessfulCreate
573
+ - The message includes the pod name.
574
+ - [x] API .status
575
+ - Condition name :
576
+ - Other field : ` completedIndexes` will not be empty as pods terminate.
577
+
578
+ # ##### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
579
+
580
+ - per-day percentage of job_sync_total with label result=error <= 1%
581
+ - 99% percentile over day for job_sync_duration_seconds is <= 15s, assuming
582
+ a client-side QPS limit of 50 calls per second. Note that this is the
583
+ expected SLO for NonIndexed jobs as well.
584
+
585
+ # ##### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
573
586
574
587
- [x] Metrics
575
588
- Metric name (all new) :
@@ -579,51 +592,42 @@ the health of the service?**
579
592
result=failed/succeeded
580
593
- Components exposing the metric : kube-controller-manager
581
594
582
- * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
595
+ # ##### Are there any missing metrics that would be useful to have to improve observability of this feature?
583
596
584
- - per-day percentage of job_sync_total with label result=error <= 1%
585
- - 99% percentile over day for job_sync_duration_seconds is <= 15s, assuming
586
- a client-side QPS limit of 50 calls per second.
587
-
588
- * **Are there any missing metrics that would be useful to have to improve observability
589
- of this feature?**
590
-
591
- N/A
597
+ No
592
598
593
599
# ## Dependencies
594
600
595
- _This section must be completed when targeting beta graduation to a release._
596
-
597
- * **Does this feature depend on any specific services running in the cluster?**
601
+ # ##### Does this feature depend on any specific services running in the cluster?
598
602
599
- Feature only involves kube-apiserver and kube-controller-manager.
603
+ No, the feature only involves kube-apiserver and kube-controller-manager.
600
604
601
605
602
606
# ## Scalability
603
607
604
- _For alpha, this section is encouraged : reviewers should consider these questions
605
- and attempt to answer them._
608
+ <!--
609
+ For alpha, this section is encouraged : reviewers should consider these questions
610
+ and attempt to answer them.
606
611
607
- _For beta, this section is required : reviewers must answer these questions._
612
+ For beta, this section is required : reviewers must answer these questions.
608
613
609
- _For GA, this section is required : approvers should be able to confirm the
610
- previous answers based on experience in the field._
614
+ For GA, this section is required : approvers should be able to confirm the
615
+ previous answers based on experience in the field.
616
+ -->
611
617
612
- * ** Will enabling / using this feature result in any new API calls?**
618
+ # ##### Will enabling / using this feature result in any new API calls?
613
619
614
620
No.
615
621
616
- * ** Will enabling / using this feature result in introducing new API types?**
622
+ # ##### Will enabling / using this feature result in introducing new API types?
617
623
618
624
No.
619
625
620
- * **Will enabling / using this feature result in any new calls to the cloud
621
- provider?**
626
+ # ##### Will enabling / using this feature result in any new calls to the cloud provider?
622
627
623
628
No.
624
629
625
- * **Will enabling / using this feature result in increasing size or count of
626
- the existing API objects?**
630
+ # ##### Will enabling / using this feature result in increasing size or count of the existing API objects?
627
631
628
632
Yes.
629
633
@@ -639,36 +643,28 @@ the existing API objects?**
639
643
- Estimated increase in size : new annotation of about 50 bytes and hostname
640
644
which includes the index.
641
645
642
- * **Will enabling / using this feature result in increasing time taken by any
643
- operations covered by [existing SLIs/SLOs]?**
646
+ # ##### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
644
647
645
648
No.
646
649
647
- * **Will enabling / using this feature result in non-negligible increase of
648
- resource usage (CPU, RAM, disk, IO, ...) in any components?**
650
+ # ##### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
649
651
650
652
Additional CPU and memory increase in the controller-manager is negligible
651
653
and restricted to Jobs using the new completion mode.
652
654
653
655
# ## Troubleshooting
654
656
655
- The Troubleshooting section currently serves the `Playbook` role. We may consider
656
- splitting it into a dedicated `Playbook` document (potentially with some monitoring
657
- details). For now, we leave it here.
658
-
659
- _This section must be completed when targeting beta graduation to a release._
660
-
661
- * **How does this feature react if the API server and/or etcd is unavailable?**
657
+ # ##### How does this feature react if the API server and/or etcd is unavailable?
662
658
663
659
The job controller can't create or delete pods nor update job status.
664
660
The metric job_sync_total increases for label result=error.
665
661
Existing pods continue to run.
666
662
667
- * ** What are other known failure modes?**
663
+ # ##### What are other known failure modes?
668
664
669
665
None.
670
666
671
- * ** What steps should be taken if SLOs are not being met to determine the problem?**
667
+ # ##### What steps should be taken if SLOs are not being met to determine the problem?
672
668
673
669
1. Check job_sync_total with label result=error. See if it varies for
674
670
different completion modes.
@@ -686,6 +682,7 @@ _This section must be completed when targeting beta graduation to a release._
686
682
completed.
687
683
* 2021-03-09: Feature implemented under feature gate disabled by default.
688
684
* 2021-04-09: KEP updated for graduation to beta.
685
+ * 2022-01-06: KEP updated for graduation to stable.
689
686
690
687
# # Drawbacks
691
688
0 commit comments