|
1 |
| -# Publishing kubernetes packages <!-- omit in toc --> |
| 1 | +# KEP-1731: Publishing Kubernetes packages on community infrastructure <!-- omit in toc --> |
2 | 2 |
|
3 | 3 | <!-- toc -->
|
4 |
| -- [Release Signoff Checklist](#release-signoff-checklist) |
5 | 4 | - [Summary](#summary)
|
6 | 5 | - [Motivation](#motivation)
|
7 | 6 | - [Goals](#goals)
|
8 | 7 | - [Non-Goals](#non-goals)
|
9 | 8 | - [Proposal](#proposal)
|
10 | 9 | - [User Stories](#user-stories)
|
11 | 10 | - [User Roles](#user-roles)
|
12 |
| - - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) |
| 11 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 12 | +- [Design Details](#design-details) |
13 | 13 | - [Using OBS instead of manually building and hosting packages](#using-obs-instead-of-manually-building-and-hosting-packages)
|
14 | 14 | - [How Open Build Service works?](#how-open-build-service-works)
|
15 | 15 | - [Packages, Operating Systems, and Architectures in Scope](#packages-operating-systems-and-architectures-in-scope)
|
|
21 | 21 | - [Integrating OBS with our current release pipeline](#integrating-obs-with-our-current-release-pipeline)
|
22 | 22 | - [Authentication to OBS and User Management](#authentication-to-obs-and-user-management)
|
23 | 23 | - [How are packages used?](#how-are-packages-used)
|
24 |
| - - [Risks and Mitigations](#risks-and-mitigations) |
25 |
| -- [Design Details](#design-details) |
26 | 24 | - [Test Plan](#test-plan)
|
27 | 25 | - [Graduation Criteria](#graduation-criteria)
|
28 | 26 | - [Alpha](#alpha)
|
29 | 27 | - [Alpha -> Beta Graduation](#alpha---beta-graduation)
|
30 | 28 | - [Beta -> GA Graduation](#beta---ga-graduation)
|
31 | 29 | - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
|
32 | 30 | - [Version Skew Strategy](#version-skew-strategy)
|
| 31 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 32 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 33 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 34 | + - [Monitoring Requirements](#monitoring-requirements) |
| 35 | + - [Dependencies](#dependencies) |
| 36 | + - [Scalability](#scalability) |
| 37 | + - [Troubleshooting](#troubleshooting) |
33 | 38 | - [Implementation History](#implementation-history)
|
34 |
| -- [Drawbacks [optional]](#drawbacks-optional) |
35 |
| -- [Alternatives [optional]](#alternatives-optional) |
| 39 | +- [Drawbacks](#drawbacks) |
| 40 | +- [Alternatives](#alternatives) |
| 41 | +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) |
36 | 42 | <!-- /toc -->
|
37 | 43 |
|
38 |
| -## Release Signoff Checklist |
39 |
| - |
40 |
| -- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR) |
41 |
| -- [ ] KEP approvers have set the KEP status to `implementable` |
42 |
| -- [ ] Design details are appropriately documented |
43 |
| -- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
44 |
| -- [ ] Graduation criteria is in place |
| 44 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 45 | + |
| 46 | +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 47 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 48 | +- [ ] (R) Design details are appropriately documented |
| 49 | +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) |
| 50 | + - [ ] e2e Tests for all Beta API Operations (endpoints) |
| 51 | + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
| 52 | + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free |
| 53 | +- [ ] (R) Graduation criteria is in place |
| 54 | + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
| 55 | +- [ ] (R) Production readiness review completed |
| 56 | +- [ ] (R) Production readiness review approved |
45 | 57 | - [ ] "Implementation History" section is up-to-date for milestone
|
46 | 58 | - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
|
47 |
| -- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 59 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 60 | + |
| 61 | +<!-- |
| 62 | +**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone. |
| 63 | +--> |
| 64 | + |
| 65 | +[kubernetes.io]: https://kubernetes.io/ |
| 66 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 67 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 68 | +[kubernetes/website]: https://git.k8s.io/website |
48 | 69 |
|
49 | 70 | ## Summary
|
50 | 71 |
|
@@ -162,7 +183,14 @@ Scenario: [...]
|
162 | 183 | ```
|
163 | 184 | -->
|
164 | 185 |
|
165 |
| -### Implementation Details/Notes/Constraints |
| 186 | +### Risks and Mitigations |
| 187 | + |
| 188 | +- _Risk_: The OBS installation provided by openSUSE is unable to serve the load generated by the Kubernetes project |
| 189 | + _Mitigation_: We can host our own mirrors and take some load from openSUSE (e.g. on Equinix Metal) |
| 190 | +- _Risk_: Building all the packages for all the distributions and their version takes too long to be done nightly or via cutting the release |
| 191 | + _Mitigation_: We do not deliver nightly packages or wait for packages to be published in the release pipeline. |
| 192 | + |
| 193 | +## Design Details |
166 | 194 |
|
167 | 195 | Packages will be built and published using [Open Build Service (OBS)][obs]. openSUSE will sponsor the Kubernetes
|
168 | 196 | project by giving us access to the [OBS instance hosted by openSUSE][obs-build].
|
@@ -446,17 +474,12 @@ are other manual migration steps needed (e.g. changing the GPG key), we don't co
|
446 | 474 |
|
447 | 475 | Different architectures will be published into the same repos, it is up to the package managers to pull and install the correct package for the target platform.
|
448 | 476 |
|
449 |
| -### Risks and Mitigations |
450 |
| - |
451 |
| -- _Risk_: The OBS installation provided by openSUSE is unable to serve the load generated by the Kubernetes project |
452 |
| - _Mitigation_: We can host our own mirrors and take some load from openSUSE (e.g. on Equinix Metal) |
453 |
| -- _Risk_: Building all the packages for all the distributions and their version takes too long to be done nightly or via cutting the release |
454 |
| - _Mitigation_: We do not deliver nightly packages or wait for packages to be published in the release pipeline. |
455 |
| - |
456 |
| -## Design Details |
457 |
| - |
458 | 477 | ### Test Plan
|
459 | 478 |
|
| 479 | +[x] We understand the owners of the involved components may require updates to |
| 480 | +existing tests to make this code solid enough prior to committing the changes necessary |
| 481 | +to implement this enhancement. |
| 482 | + |
460 | 483 | There should be post-publish tests, which can be run as part or after the release process
|
461 | 484 |
|
462 | 485 | - pull packages from the official mirrors
|
@@ -523,23 +546,245 @@ N/A
|
523 | 546 |
|
524 | 547 | N/A
|
525 | 548 |
|
| 549 | +## Production Readiness Review Questionnaire |
| 550 | + |
| 551 | +### Feature Enablement and Rollback |
| 552 | + |
| 553 | +It's up to the user what package repository (OBS or Google) they want to use. |
| 554 | +In case OBS doesn't work for them, they can reconfigure their systems to use |
| 555 | +the Google package repository. |
| 556 | + |
| 557 | +###### How can this feature be enabled / disabled in a live cluster? |
| 558 | + |
| 559 | +N/A. This is configured on the operating system (i.e. package manager) level. |
| 560 | + |
| 561 | +###### Does enabling the feature change any default behavior? |
| 562 | + |
| 563 | +Not anticipated. We're trying to match the existing spec files as best as we |
| 564 | +can. |
| 565 | + |
| 566 | +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? |
| 567 | + |
| 568 | +Yes. Users can rollback to the Google package repository. |
| 569 | + |
| 570 | +###### What happens if we reenable the feature if it was previously rolled back? |
| 571 | + |
| 572 | +There are no side effects anticipated. |
| 573 | + |
| 574 | +###### Are there any tests for feature enablement/disablement? |
| 575 | + |
| 576 | +N/A |
| 577 | + |
| 578 | +### Rollout, Upgrade and Rollback Planning |
| 579 | + |
| 580 | +<!-- |
| 581 | +This section must be completed when targeting beta to a release. |
| 582 | +--> |
| 583 | + |
| 584 | +###### How can a rollout or rollback fail? Can it impact already running workloads? |
| 585 | + |
| 586 | +N/A |
| 587 | + |
| 588 | +###### What specific metrics should inform a rollback? |
| 589 | + |
| 590 | +Installation and upgrading issues. For example, if a package upgrade is not |
| 591 | +possible due to some error. |
| 592 | + |
| 593 | +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? |
| 594 | + |
| 595 | +<!-- |
| 596 | +Describe manual testing that was done and the outcomes. |
| 597 | +Longer term, we may want to require automated upgrade/rollback tests, but we |
| 598 | +are missing a bunch of machinery and tooling and can't do that now. |
| 599 | +--> |
| 600 | + |
| 601 | +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? |
| 602 | + |
| 603 | +No. |
| 604 | + |
| 605 | +### Monitoring Requirements |
| 606 | + |
| 607 | +<!-- |
| 608 | +This section must be completed when targeting beta to a release. |
| 609 | +
|
| 610 | +For GA, this section is required: approvers should be able to confirm the |
| 611 | +previous answers based on experience in the field. |
| 612 | +--> |
| 613 | + |
| 614 | +###### How can an operator determine if the feature is in use by workloads? |
| 615 | + |
| 616 | +We'll ask openSUSE to provide us with metrics on the repository usage. We don't |
| 617 | +have any metrics for the Google repository and there's no way that we can |
| 618 | +get those metrics. |
| 619 | + |
| 620 | +###### How can someone using this feature know that it is working for their instance? |
| 621 | + |
| 622 | +Kubernetes is installed successfully and the Node is coming up and is "Ready". |
| 623 | + |
| 624 | +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? |
| 625 | + |
| 626 | +<!-- |
| 627 | +This is your opportunity to define what "normal" quality of service looks like |
| 628 | +for a feature. |
| 629 | +
|
| 630 | +It's impossible to provide comprehensive guidance, but at the very |
| 631 | +high level (needs more precise definitions) those may be things like: |
| 632 | + - per-day percentage of API calls finishing with 5XX errors <= 1% |
| 633 | + - 99% percentile over day of absolute value from (job creation time minus expected |
| 634 | + job creation time) for cron job <= 10% |
| 635 | + - 99.9% of /health requests per day finish with 200 code |
| 636 | +
|
| 637 | +These goals will help you determine what you need to measure (SLIs) in the next |
| 638 | +question. |
| 639 | +--> |
| 640 | + |
| 641 | +TBD |
| 642 | + |
| 643 | +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? |
| 644 | + |
| 645 | +<!-- |
| 646 | +Pick one more of these and delete the rest. |
| 647 | +--> |
| 648 | + |
| 649 | +- [ ] Metrics |
| 650 | + - Metric name: |
| 651 | + - [Optional] Aggregation method: |
| 652 | + - Components exposing the metric: |
| 653 | +- [ ] Other (treat as last resort) |
| 654 | + - Details: |
| 655 | + |
| 656 | +TBD |
| 657 | + |
| 658 | +###### Are there any missing metrics that would be useful to have to improve observability of this feature? |
| 659 | + |
| 660 | +<!-- |
| 661 | +Describe the metrics themselves and the reasons why they weren't added (e.g., cost, |
| 662 | +implementation difficulties, etc.). |
| 663 | +--> |
| 664 | + |
| 665 | +TBD |
| 666 | + |
| 667 | +### Dependencies |
| 668 | + |
| 669 | +<!-- |
| 670 | +This section must be completed when targeting beta to a release. |
| 671 | +--> |
| 672 | + |
| 673 | +###### Does this feature depend on any specific services running in the cluster? |
| 674 | + |
| 675 | +N/A -- this is not a core Kubernetes feature. |
| 676 | + |
| 677 | +### Scalability |
| 678 | + |
| 679 | +###### Will enabling / using this feature result in any new API calls? |
| 680 | + |
| 681 | +No -- this is not a core Kubernetes feature. |
| 682 | + |
| 683 | +###### Will enabling / using this feature result in introducing new API types? |
| 684 | + |
| 685 | +No -- this is not a core Kubernetes feature. |
| 686 | + |
| 687 | +###### Will enabling / using this feature result in any new calls to the cloud provider? |
| 688 | + |
| 689 | +No -- this is not a core Kubernetes feature. |
| 690 | + |
| 691 | +###### Will enabling / using this feature result in increasing size or count of the existing API objects? |
| 692 | + |
| 693 | +No -- this is not a core Kubernetes feature. |
| 694 | + |
| 695 | +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? |
| 696 | + |
| 697 | +No -- this is not a core Kubernetes feature. |
| 698 | + |
| 699 | +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? |
| 700 | + |
| 701 | +No. |
| 702 | + |
| 703 | +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? |
| 704 | + |
| 705 | +No. |
| 706 | + |
| 707 | +### Troubleshooting |
| 708 | + |
| 709 | +<!-- |
| 710 | +This section must be completed when targeting beta to a release. |
| 711 | +
|
| 712 | +For GA, this section is required: approvers should be able to confirm the |
| 713 | +previous answers based on experience in the field. |
| 714 | +
|
| 715 | +The Troubleshooting section currently serves the `Playbook` role. We may consider |
| 716 | +splitting it into a dedicated `Playbook` document (potentially with some monitoring |
| 717 | +details). For now, we leave it here. |
| 718 | +--> |
| 719 | + |
| 720 | +###### How does this feature react if the API server and/or etcd is unavailable? |
| 721 | + |
| 722 | +This isn't relevant -- this is not a core Kubernetes feature. |
| 723 | + |
| 724 | +###### What are other known failure modes? |
| 725 | + |
| 726 | +<!-- |
| 727 | +For each of them, fill in the following information by copying the below template: |
| 728 | + - [Failure mode brief description] |
| 729 | + - Detection: How can it be detected via metrics? Stated another way: |
| 730 | + how can an operator troubleshoot without logging into a master or worker node? |
| 731 | + - Mitigations: What can be done to stop the bleeding, especially for already |
| 732 | + running user workloads? |
| 733 | + - Diagnostics: What are the useful log messages and their required logging |
| 734 | + levels that could help debug the issue? |
| 735 | + Not required until feature graduated to beta. |
| 736 | + - Testing: Are there any tests for failure mode? If not, describe why. |
| 737 | +--> |
| 738 | + |
| 739 | +- OpenBuildService is down or in a degraded mode |
| 740 | + - Detection: relevant tests are failing, we're getting alerts from users, or |
| 741 | + the OBS team alerted us of such an issue |
| 742 | + - Mitigations: Such an issue wouldn't affect already provisioned nodes. Users |
| 743 | + wouldn't be able to provision new nodes. |
| 744 | + - Diagnostics: APT and Yum error messages. |
| 745 | + - Testing: No, we can't know in what way OBS can fail in case that happens. |
| 746 | + |
| 747 | +###### What steps should be taken if SLOs are not being met to determine the problem? |
| 748 | + |
526 | 749 | ## Implementation History
|
527 | 750 |
|
528 | 751 | <!--
|
529 |
| -- the `Summary` and `Motivation` sections being merged signaling SIG acceptance |
530 |
| -- the `Proposal` section being merged signaling agreement on a proposed design |
| 752 | +Major milestones in the lifecycle of a KEP should be tracked in this section. |
| 753 | +Major milestones might include: |
| 754 | +- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance |
| 755 | +- the `Proposal` section being merged, signaling agreement on a proposed design |
531 | 756 | - the date implementation started
|
532 | 757 | - the first Kubernetes release where an initial version of the KEP was available
|
533 | 758 | - the version of Kubernetes where the KEP graduated to general availability
|
534 | 759 | - when the KEP was retired or superseded
|
535 | 760 | -->
|
536 | 761 |
|
537 |
| -TBA |
| 762 | +N/A |
| 763 | + |
| 764 | +## Drawbacks |
| 765 | + |
| 766 | +<!-- |
| 767 | +Why should this KEP _not_ be implemented? |
| 768 | +--> |
| 769 | + |
| 770 | +N/A |
| 771 | + |
| 772 | +## Alternatives |
538 | 773 |
|
539 |
| -## Drawbacks [optional] |
| 774 | +<!-- |
| 775 | +What other approaches did you consider, and why did you rule them out? These do |
| 776 | +not need to be as detailed as the proposal, but should include enough |
| 777 | +information to express the idea and why it was not acceptable. |
| 778 | +--> |
540 | 779 |
|
541 | 780 | N/A
|
542 | 781 |
|
543 |
| -## Alternatives [optional] |
| 782 | +## Infrastructure Needed (Optional) |
| 783 | + |
| 784 | +<!-- |
| 785 | +Use this section if you need things from the project/SIG. Examples include a |
| 786 | +new subproject, repos requested, or GitHub details. Listing these here allows a |
| 787 | +SIG to get the process for these resources started right away. |
| 788 | +--> |
544 | 789 |
|
545 | 790 | N/A
|
0 commit comments