Skip to content

Commit 66ea524

Browse files
authored
Merge pull request kubernetes#3113 from alculquicondor/indexed-job
Graduate Indexed Job to stable
2 parents d31a23f + f6fbf42 commit 66ea524

File tree

3 files changed

+63
-64
lines changed

3 files changed

+63
-64
lines changed

keps/prod-readiness/sig-apps/2214.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ alpha:
33
approver: "@wojtek-t"
44
beta:
55
approver: "@wojtek-t"
6+
stable:
7+
approver: "@wojtek-t"

keps/sig-apps/2214-indexed-job/README.md

Lines changed: 58 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4444
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
4545
- [x] (R) KEP approvers have approved the KEP status as `implementable`
4646
- [x] (R) Design details are appropriately documented
47-
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
47+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
48+
- [x] e2e Tests for all Beta API Operations (endpoints)
49+
- [x] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
50+
- [x] (R) Minimum Two Week Window for GA e2e tests to prove flake free
4851
- [x] (R) Graduation criteria is in place
52+
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4953
- [x] (R) Production readiness review completed
5054
- [x] (R) Production readiness review approved
5155
- [x] "Implementation History" section is up-to-date for milestone
5256
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
53-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
57+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5458

5559
[kubernetes.io]: https://kubernetes.io/
5660
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -459,8 +463,9 @@ gate enabled and disabled.
459463

460464
#### Beta -> GA Graduation
461465

462-
- E2E test graduates to conformance
463-
- 2 examples of operators using Indexed Jobs.
466+
- [E2E](https://testgrid.k8s.io/sig-apps#gce&include-filter-by-regex=Indexed) test graduates to conformance
467+
- Scalability tests for Jobs of varying sizes, up to 500 parallelism, that keep
468+
track of metric `job_sync_duration_seconds`.
464469

465470
### Upgrade / Downgrade Strategy
466471

@@ -496,80 +501,88 @@ This feature has no node runtime implications.
496501

497502
### Feature Enablement and Rollback
498503

499-
_This section must be completed when targeting alpha to a release._
504+
###### How can this feature be enabled / disabled in a live cluster?
500505

501-
* **How can this feature be enabled / disabled in a live cluster?**
502506

503507
- [x] Feature gate (also fill in values in `kep.yaml`)
504508
- Feature gate name: IndexedJob
505509
- Components depending on the feature gate:
506510
- kube-apiserver
507511
- kube-controller-manager
508512

509-
* **Does enabling the feature change any default behavior?**
513+
###### Does enabling the feature change any default behavior?
510514

511515
No. Jobs need to opt-in with `.spec.completionMode=Indexed`.
512516

513-
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
514-
the enablement)?**
517+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
515518

516519
Yes. Using the feature gate is the recommended way.
517520

518-
* **What happens if we reenable the feature if it was previously rolled back?**
521+
###### What happens if we reenable the feature if it was previously rolled back?
519522

520523
The Job controller starts managing Indexed Jobs again.
521524
More details covered in [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy).
522525

523-
* **Are there any tests for feature enablement/disablement?**
526+
###### Are there any tests for feature enablement/disablement?
524527

525528
Yes, unit and integration test for the feature enabled, disabled and
526529
transitions.
527530

528531
### Rollout, Upgrade and Rollback Planning
529532

530-
_This section must be completed when targeting beta graduation to a release._
531-
532-
* **How can a rollout fail? Can it impact already running workloads?**
533+
###### How can a rollout or rollback fail? Can it impact already running workloads?
533534

534535
If the new kube-controller-manager crashes, it's possible that an older
535536
version of it would pick it up. In 1.21, when the IndexedJob feature is
536537
disabled (default), the controller would not sync Indexed Jobs, that is: the
537538
controller doesn't create or delete Pods and doesn't update Job status.
538539
Running Pods are not affected.
539540

540-
* **What specific metrics should inform a rollback?**
541+
###### What specific metrics should inform a rollback?
541542

542543
- job_sync_duration_seconds shows significantly more latency for label
543544
completion_mode=Indexed Jobs than completion_mode=NonIndexed.
544545
- job_sync_total shows more errors for completion_mode=Indexed than
545546
completion_mode=NonIndexed.
546547
- job_finished_total shows that Jobs with completion_mode=Indexed don't finish.
547548

548-
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
549+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
549550

550-
Manual test:
551+
Manual test performed:
551552

552553
1. Deploy k8s 1.21 cluster
553554
1. Upgrade to 1.22
554555
1. Create Indexed Job with big number of completions and pods that run for ~10min.
555556
1. Downgrade to 1.21. Verify that no new pods are created for the Indexed Job.
556557
1. Upgrade to 1.22. Verify that new pods are created for Indexed Job.
557558

558-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
559-
fields of API types, flags, etc.?**
559+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
560560

561561
No
562562

563563
### Monitoring Requirements
564564

565-
_This section must be completed when targeting beta graduation to a release._
566-
567-
* **How can an operator determine if the feature is in use by workloads?**
565+
###### How can an operator determine if the feature is in use by workloads?
568566

569567
- job_sync_total has values for the label completion_mode=Indexed.
570568

571-
* **What are the SLIs (Service Level Indicators) an operator can use to determine
572-
the health of the service?**
569+
###### How can someone using this feature know that it is working for their instance?
570+
571+
- [x] Events
572+
- Event Reason: SuccessfulCreate
573+
- The message includes the pod name.
574+
- [x] API .status
575+
- Condition name:
576+
- Other field: `completedIndexes` will not be empty as pods terminate.
577+
578+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
579+
580+
- per-day percentage of job_sync_total with label result=error <= 1%
581+
- 99% percentile over day for job_sync_duration_seconds is <= 15s, assuming
582+
a client-side QPS limit of 50 calls per second. Note that this is the
583+
expected SLO for NonIndexed jobs as well.
584+
585+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
573586

574587
- [x] Metrics
575588
- Metric name (all new):
@@ -579,51 +592,42 @@ the health of the service?**
579592
result=failed/succeeded
580593
- Components exposing the metric: kube-controller-manager
581594

582-
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
595+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
583596

584-
- per-day percentage of job_sync_total with label result=error <= 1%
585-
- 99% percentile over day for job_sync_duration_seconds is <= 15s, assuming
586-
a client-side QPS limit of 50 calls per second.
587-
588-
* **Are there any missing metrics that would be useful to have to improve observability
589-
of this feature?**
590-
591-
N/A
597+
No
592598

593599
### Dependencies
594600

595-
_This section must be completed when targeting beta graduation to a release._
596-
597-
* **Does this feature depend on any specific services running in the cluster?**
601+
###### Does this feature depend on any specific services running in the cluster?
598602

599-
Feature only involves kube-apiserver and kube-controller-manager.
603+
No, the feature only involves kube-apiserver and kube-controller-manager.
600604

601605

602606
### Scalability
603607

604-
_For alpha, this section is encouraged: reviewers should consider these questions
605-
and attempt to answer them._
608+
<!--
609+
For alpha, this section is encouraged: reviewers should consider these questions
610+
and attempt to answer them.
606611

607-
_For beta, this section is required: reviewers must answer these questions._
612+
For beta, this section is required: reviewers must answer these questions.
608613

609-
_For GA, this section is required: approvers should be able to confirm the
610-
previous answers based on experience in the field._
614+
For GA, this section is required: approvers should be able to confirm the
615+
previous answers based on experience in the field.
616+
-->
611617

612-
* **Will enabling / using this feature result in any new API calls?**
618+
###### Will enabling / using this feature result in any new API calls?
613619

614620
No.
615621

616-
* **Will enabling / using this feature result in introducing new API types?**
622+
###### Will enabling / using this feature result in introducing new API types?
617623

618624
No.
619625

620-
* **Will enabling / using this feature result in any new calls to the cloud
621-
provider?**
626+
###### Will enabling / using this feature result in any new calls to the cloud provider?
622627

623628
No.
624629

625-
* **Will enabling / using this feature result in increasing size or count of
626-
the existing API objects?**
630+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
627631

628632
Yes.
629633

@@ -639,36 +643,28 @@ the existing API objects?**
639643
- Estimated increase in size: new annotation of about 50 bytes and hostname
640644
which includes the index.
641645

642-
* **Will enabling / using this feature result in increasing time taken by any
643-
operations covered by [existing SLIs/SLOs]?**
646+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
644647

645648
No.
646649

647-
* **Will enabling / using this feature result in non-negligible increase of
648-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
650+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
649651

650652
Additional CPU and memory increase in the controller-manager is negligible
651653
and restricted to Jobs using the new completion mode.
652654

653655
### Troubleshooting
654656

655-
The Troubleshooting section currently serves the `Playbook` role. We may consider
656-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
657-
details). For now, we leave it here.
658-
659-
_This section must be completed when targeting beta graduation to a release._
660-
661-
* **How does this feature react if the API server and/or etcd is unavailable?**
657+
###### How does this feature react if the API server and/or etcd is unavailable?
662658

663659
The job controller can't create or delete pods nor update job status.
664660
The metric job_sync_total increases for label result=error.
665661
Existing pods continue to run.
666662

667-
* **What are other known failure modes?**
663+
###### What are other known failure modes?
668664

669665
None.
670666

671-
* **What steps should be taken if SLOs are not being met to determine the problem?**
667+
###### What steps should be taken if SLOs are not being met to determine the problem?
672668

673669
1. Check job_sync_total with label result=error. See if it varies for
674670
different completion modes.
@@ -686,6 +682,7 @@ _This section must be completed when targeting beta graduation to a release._
686682
completed.
687683
* 2021-03-09: Feature implemented under feature gate disabled by default.
688684
* 2021-04-09: KEP updated for graduation to beta.
685+
* 2022-01-06: KEP updated for graduation to stable.
689686

690687
## Drawbacks
691688

keps/sig-apps/2214-indexed-job/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,18 @@ approvers:
1414
- "@janetkuo"
1515

1616
# The target maturity stage in the current dev cycle for this KEP.
17-
stage: beta
17+
stage: stable
1818

1919
# The most recent milestone for which work toward delivery of this KEP has been
2020
# done. This can be the current (upcoming) milestone, if it is being actively
2121
# worked on.
22-
latest-milestone: "v1.22"
22+
latest-milestone: "v1.24"
2323

2424
# The milestone at which this feature was, or is targeted to be, at each stage.
2525
milestone:
2626
alpha: "v1.21"
2727
beta: "v1.22"
28-
stable: "v1.23"
28+
stable: "v1.24"
2929

3030
# The following PRR answers are required at alpha release
3131
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)