Skip to content

Commit 68dadd2

Browse files
Graduate Indexed Job to stable
1 parent d452a4b commit 68dadd2

File tree

3 files changed

+77
-57
lines changed

3 files changed

+77
-57
lines changed

keps/prod-readiness/sig-apps/2214.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ alpha:
33
approver: "@wojtek-t"
44
beta:
55
approver: "@wojtek-t"
6+
stable:
7+
approver: "@wojtek-t"

keps/sig-apps/2214-indexed-job/README.md

Lines changed: 72 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4444
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
4545
- [x] (R) KEP approvers have approved the KEP status as `implementable`
4646
- [x] (R) Design details are appropriately documented
47-
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
47+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
48+
- [x] e2e Tests for all Beta API Operations (endpoints)
49+
- [x] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
50+
- [x] (R) Minimum Two Week Window for GA e2e tests to prove flake free
4851
- [x] (R) Graduation criteria is in place
52+
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4953
- [x] (R) Production readiness review completed
5054
- [x] (R) Production readiness review approved
5155
- [x] "Implementation History" section is up-to-date for milestone
5256
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
53-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
57+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5458

5559
[kubernetes.io]: https://kubernetes.io/
5660
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -459,8 +463,7 @@ gate enabled and disabled.
459463

460464
#### Beta -> GA Graduation
461465

462-
- E2E test graduates to conformance
463-
- 2 examples of operators using Indexed Jobs.
466+
- [E2E](https://testgrid.k8s.io/sig-apps#gce&include-filter-by-regex=Indexed) test graduates to conformance
464467

465468
### Upgrade / Downgrade Strategy
466469

@@ -496,80 +499,99 @@ This feature has no node runtime implications.
496499

497500
### Feature Enablement and Rollback
498501

499-
_This section must be completed when targeting alpha to a release._
502+
<!--
503+
This section must be completed when targeting alpha to a release.
504+
-->
505+
506+
###### How can this feature be enabled / disabled in a live cluster?
500507

501-
* **How can this feature be enabled / disabled in a live cluster?**
502508

503509
- [x] Feature gate (also fill in values in `kep.yaml`)
504510
- Feature gate name: IndexedJob
505511
- Components depending on the feature gate:
506512
- kube-apiserver
507513
- kube-controller-manager
508514

509-
* **Does enabling the feature change any default behavior?**
515+
###### Does enabling the feature change any default behavior?
510516

511517
No. Jobs need to opt-in with `.spec.completionMode=Indexed`.
512518

513-
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
514-
the enablement)?**
519+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
515520

516521
Yes. Using the feature gate is the recommended way.
517522

518-
* **What happens if we reenable the feature if it was previously rolled back?**
523+
###### What happens if we reenable the feature if it was previously rolled back?
519524

520525
The Job controller starts managing Indexed Jobs again.
521526
More details covered in [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy).
522527

523-
* **Are there any tests for feature enablement/disablement?**
528+
###### Are there any tests for feature enablement/disablement?
524529

525530
Yes, unit and integration test for the feature enabled, disabled and
526531
transitions.
527532

528533
### Rollout, Upgrade and Rollback Planning
529534

530-
_This section must be completed when targeting beta graduation to a release._
535+
<!--
536+
This section must be completed when targeting beta to a release.
537+
-->
531538

532-
* **How can a rollout fail? Can it impact already running workloads?**
539+
###### How can a rollout or rollback fail? Can it impact already running workloads?
533540

534541
If the new kube-controller-manager crashes, it's possible that an older
535542
version of it would pick it up. In 1.21, when the IndexedJob feature is
536543
disabled (default), the controller would not sync Indexed Jobs, that is: the
537544
controller doesn't create or delete Pods and doesn't update Job status.
538545
Running Pods are not affected.
539546

540-
* **What specific metrics should inform a rollback?**
547+
###### What specific metrics should inform a rollback?
541548

542549
- job_sync_duration_seconds shows significantly more latency for label
543550
completion_mode=Indexed Jobs than completion_mode=NonIndexed.
544551
- job_sync_total shows more errors for completion_mode=Indexed than
545552
completion_mode=NonIndexed.
546553
- job_finished_total shows that Jobs with completion_mode=Indexed don't finish.
547554

548-
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
555+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
549556

550-
Manual test:
557+
Manual test performed:
551558

552559
1. Deploy k8s 1.21 cluster
553560
1. Upgrade to 1.22
554561
1. Create Indexed Job with big number of completions and pods that run for ~10min.
555562
1. Downgrade to 1.21. Verify that no new pods are created for the Indexed Job.
556563
1. Upgrade to 1.22. Verify that new pods are created for Indexed Job.
557564

558-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
559-
fields of API types, flags, etc.?**
565+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
560566

561567
No
562568

563569
### Monitoring Requirements
564570

565-
_This section must be completed when targeting beta graduation to a release._
571+
<!--
572+
This section must be completed when targeting beta to a release.
573+
-->
566574

567-
* **How can an operator determine if the feature is in use by workloads?**
575+
###### How can an operator determine if the feature is in use by workloads?
568576

569577
- job_sync_total has values for the label completion_mode=Indexed.
570578

571-
* **What are the SLIs (Service Level Indicators) an operator can use to determine
572-
the health of the service?**
579+
###### How can someone using this feature know that it is working for their instance?
580+
581+
- [x] Events
582+
- Event Reason: SuccessfulCreate
583+
- The message includes the pod name.
584+
- [x] API .status
585+
- Condition name:
586+
- Other field: `completedIndexes` will not be empty as pods terminate.
587+
588+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
589+
590+
- per-day percentage of job_sync_total with label result=error <= 1%
591+
- 99% percentile over day for job_sync_duration_seconds is <= 15s, assuming
592+
a client-side QPS limit of 50 calls per second.
593+
594+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
573595

574596
- [x] Metrics
575597
- Metric name (all new):
@@ -579,51 +601,46 @@ the health of the service?**
579601
result=failed/succeeded
580602
- Components exposing the metric: kube-controller-manager
581603

582-
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
604+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
583605

584-
- per-day percentage of job_sync_total with label result=error <= 1%
585-
- 99% percentile over day for job_sync_duration_seconds is <= 15s, assuming
586-
a client-side QPS limit of 50 calls per second.
587-
588-
* **Are there any missing metrics that would be useful to have to improve observability
589-
of this feature?**
590-
591-
N/A
606+
No
592607

593608
### Dependencies
594609

595-
_This section must be completed when targeting beta graduation to a release._
610+
<!--
611+
This section must be completed when targeting beta to a release.
612+
-->
596613

597-
* **Does this feature depend on any specific services running in the cluster?**
614+
###### Does this feature depend on any specific services running in the cluster?
598615

599-
Feature only involves kube-apiserver and kube-controller-manager.
616+
No, the feature only involves kube-apiserver and kube-controller-manager.
600617

601618

602619
### Scalability
603620

604-
_For alpha, this section is encouraged: reviewers should consider these questions
605-
and attempt to answer them._
621+
<!--
622+
For alpha, this section is encouraged: reviewers should consider these questions
623+
and attempt to answer them.
606624

607-
_For beta, this section is required: reviewers must answer these questions._
625+
For beta, this section is required: reviewers must answer these questions.
608626

609-
_For GA, this section is required: approvers should be able to confirm the
610-
previous answers based on experience in the field._
627+
For GA, this section is required: approvers should be able to confirm the
628+
previous answers based on experience in the field.
629+
-->
611630

612-
* **Will enabling / using this feature result in any new API calls?**
631+
###### Will enabling / using this feature result in any new API calls?
613632

614633
No.
615634

616-
* **Will enabling / using this feature result in introducing new API types?**
635+
###### Will enabling / using this feature result in introducing new API types?
617636

618637
No.
619638

620-
* **Will enabling / using this feature result in any new calls to the cloud
621-
provider?**
639+
###### Will enabling / using this feature result in any new calls to the cloud provider?
622640

623641
No.
624642

625-
* **Will enabling / using this feature result in increasing size or count of
626-
the existing API objects?**
643+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
627644

628645
Yes.
629646

@@ -639,36 +656,36 @@ the existing API objects?**
639656
- Estimated increase in size: new annotation of about 50 bytes and hostname
640657
which includes the index.
641658

642-
* **Will enabling / using this feature result in increasing time taken by any
643-
operations covered by [existing SLIs/SLOs]?**
659+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
644660

645661
No.
646662

647-
* **Will enabling / using this feature result in non-negligible increase of
648-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
663+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
649664

650665
Additional CPU and memory increase in the controller-manager is negligible
651666
and restricted to Jobs using the new completion mode.
652667

653668
### Troubleshooting
654669

670+
<!--
671+
This section must be completed when targeting beta to a release.
672+
655673
The Troubleshooting section currently serves the `Playbook` role. We may consider
656674
splitting it into a dedicated `Playbook` document (potentially with some monitoring
657675
details). For now, we leave it here.
676+
-->
658677

659-
_This section must be completed when targeting beta graduation to a release._
660-
661-
* **How does this feature react if the API server and/or etcd is unavailable?**
678+
###### How does this feature react if the API server and/or etcd is unavailable?
662679

663680
The job controller can't create or delete pods nor update job status.
664681
The metric job_sync_total increases for label result=error.
665682
Existing pods continue to run.
666683

667-
* **What are other known failure modes?**
684+
###### What are other known failure modes?
668685

669686
None.
670687

671-
* **What steps should be taken if SLOs are not being met to determine the problem?**
688+
###### What steps should be taken if SLOs are not being met to determine the problem?
672689

673690
1. Check job_sync_total with label result=error. See if it varies for
674691
different completion modes.
@@ -686,6 +703,7 @@ _This section must be completed when targeting beta graduation to a release._
686703
completed.
687704
* 2021-03-09: Feature implemented under feature gate disabled by default.
688705
* 2021-04-09: KEP updated for graduation to beta.
706+
* 2022-01-06: KEP updated for graduation to stable.
689707

690708
## Drawbacks
691709

keps/sig-apps/2214-indexed-job/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,18 @@ approvers:
1414
- "@janetkuo"
1515

1616
# The target maturity stage in the current dev cycle for this KEP.
17-
stage: beta
17+
stage: stable
1818

1919
# The most recent milestone for which work toward delivery of this KEP has been
2020
# done. This can be the current (upcoming) milestone, if it is being actively
2121
# worked on.
22-
latest-milestone: "v1.22"
22+
latest-milestone: "v1.24"
2323

2424
# The milestone at which this feature was, or is targeted to be, at each stage.
2525
milestone:
2626
alpha: "v1.21"
2727
beta: "v1.22"
28-
stable: "v1.23"
28+
stable: "v1.24"
2929

3030
# The following PRR answers are required at alpha release
3131
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)