Skip to content

Commit 6991a6b

Browse files
authored
Merge pull request kubernetes#3917 from lauralorenz/clusterid-beta-prr-review
KEP-2149: Adding best effort PRR and scalability Qs required for beta
2 parents 5b258a9 + 915c890 commit 6991a6b

File tree

2 files changed

+52
-145
lines changed

2 files changed

+52
-145
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2149
22
alpha:
33
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-multicluster/2149-clusterid/README.md

Lines changed: 50 additions & 145 deletions
Original file line numberDiff line numberDiff line change
@@ -526,6 +526,12 @@ when drafting this test plan.
526526
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
527527
-->
528528

529+
This KEP proposes and out-of-tree CRD that is not expected to integrate with any of the Kubernetes CI infrastructure. In addition, it explicitly provides only the CRD definition and generated clients for use by third party implementers, and does not provide a controller or any other binary with business logic to test. For these reasons, we only expect to provide unit tests for a dummy controller to confirm that the generated CRD can be installed and the generated clients can be instantiated. Today those tests are available [here](https://github.com/kubernetes-sigs/about-api/blob/master/clusterproperty/controllers/suite_test.go).
530+
531+
However, similar to other out-of-tree CRDs that serve third party implementers, such as Gateway API and MCS API, there is rationale for the project to provide conformance tests for implementers to use to confirm they adhere to the restrictions set forth in this KEP that are not otherwise enforced by the CRD definition; in thise case, the constraints defined on the well-known properties `clusterset.k8s.io` and `cluster.clusterset.k8s.io`. Providing these tests are not considered blocking graduation requirements for the maturity level of this API.
532+
533+
These tests will be provided in such a way that implementers can expose one or more clusters that have the About API CRD installed in them, and run a series of tests that confirms any well-known properties stored in those clusters' `ClusterProperty` objects conform to the constraints in [Well known properties](#well-known-properties).
534+
529535
### Graduation Criteria
530536

531537
#### Alpha -> Beta Graduation
@@ -538,89 +544,17 @@ when drafting this test plan.
538544

539545
- At least one headless implementation using clusterID for MCS DNS
540546

541-
<!--
542-
**Note:** *Not required until targeted at a release.*
543-
544-
Define graduation milestones.
545-
546-
These may be defined in terms of API maturity, or as something else. The KEP
547-
should keep this high-level with a focus on what signals will be looked at to
548-
determine graduation.
549-
550-
Consider the following in developing the graduation criteria for this enhancement:
551-
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
552-
- [Deprecation policy][deprecation-policy]
553-
554-
Clearly define what graduation means by either linking to the [API doc
555-
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
556-
or by redefining what graduation means.
557-
558-
In general we try to use the same stages (alpha, beta, GA), regardless of how the
559-
functionality is accessed.
560-
561-
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
562-
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
563-
564-
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
565-
566-
#### Alpha -> Beta Graduation
567-
568-
- Gather feedback from developers and surveys
569-
- Complete features A, B, C
570-
- Tests are in Testgrid and linked in KEP
571-
572-
#### Beta -> GA Graduation
573-
574-
- N examples of real-world usage
575-
- N installs
576-
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
577-
- Allowing time for feedback
578-
579-
**Note:** Generally we also wait at least two releases between beta and
580-
GA/stable, because there's no opportunity for user feedback, or even bug reports,
581-
in back-to-back releases.
582-
583-
#### Removing a Deprecated Flag
584-
585-
- Announce deprecation and support policy of the existing flag
586-
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
587-
- Address feedback on usage/changed behavior, provided on GitHub issues
588-
- Deprecate the flag
589-
590-
**For non-optional features moving to GA, the graduation criteria must include
591-
[conformance tests].**
592-
593-
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
594-
-->
595-
596547
### Upgrade / Downgrade Strategy
597548

598-
<!--
599-
If applicable, how will the component be upgraded and downgraded? Make sure
600-
this is in the test plan.
601-
602-
Consider the following in developing an upgrade/downgrade strategy for this
603-
enhancement:
604-
- What changes (in invocations, configurations, API use, etc.) is an existing
605-
cluster required to make on upgrade, in order to maintain previous behavior?
606-
- What changes (in invocations, configurations, API use, etc.) is an existing
607-
cluster required to make on upgrade, in order to make use of the enhancement?
608-
-->
549+
Any changes to the API definition will follow the official Kubernetes API groups and versioning guidance [here](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-groups-and-versioning) and [here](https://kubernetes.io/docs/reference/using-api/#api-versioning). In short, the API will be provided in order through `v1alphaX`, `v1betaX`, to `v1`, where compatibility will be preserved from `v1beta1` and onwards; clients will be expected to eventually migrate to the `v1` implementation of the API as the prior versions are deprecated.
609550

610551
### Version Skew Strategy
611552

612-
<!--
613-
If applicable, how will the component handle version skew with other
614-
components? What are the guarantees? Make sure this is in the test plan.
615-
616-
Consider the following in developing a version skew strategy for this
617-
enhancement:
618-
- Does this enhancement involve coordinating behavior in the control plane and
619-
in the kubelet? How does an n-2 kubelet without this feature available behave
620-
when this feature is used?
621-
- Will any other components on the node change? For example, changes to CSI,
622-
CRI or CNI may require updating that component before the kubelet.
623-
-->
553+
As a CRD, this API is dependent on any changes in the version and compatibility of the CRD feature itself on which it is built. As the CRD system is in `v1` as of Kubernetes 1.14, and the Kubernetes versioning guarantees `v1` APIs to be maintained through the Kubernetes major release, and as the About API does not depend on any new features of the CRD system since then, there is no expected coordination required with any core Kubernetes components until and unless Kubernetes proceeds to version 2.X.
554+
555+
This CRD /is/ a direct dependency of the MCS API and any mcs-controller implementation as defined by that KEP. As discussed later in the PRR, it is expected that the mcs-controller (or any other controller taking this CRD as its dependency) would manage the lifecycle of this CRD, including any version skew.
556+
557+
As also mentioned below, we are aware that other features (in or out of tree) may want to use this CRD (as debated in "To CRD or Not to CRD" section, above) but we believe it is in the scope of those future features to assess the impact of this CRD's version strategy on their component's version skew and their feature's stability if they do.
624558

625559
## Production Readiness Review Questionnaire
626560

@@ -710,69 +644,50 @@ _This section must be completed when targeting alpha to a release._
710644
_This section must be completed when targeting beta graduation to a release._
711645

712646
* **How can a rollout fail? Can it impact already running workloads?**
713-
Try to be as paranoid as possible - e.g., what if some components will restart
714-
mid-rollout?
647+
648+
CRDs themselves are Kubernetes objects, and can fail to be applied if the schema definition is corrupt or incompatible with the CustomResourceDefinition schema. Unit tests and manual tests continuously confirm that as the built CRD yaml produced by this project is valid against the stable `v1 CustomResourceDefinition`. (It also could fail if the CRD is applied to a version of Kubernetes that does not have the CRD system is used (<1.14), or the API Server is unreachable, but these are both considered catastrophic failures out of scope of this KEP.)
649+
650+
Ultimately, the failure of a rollout of any CRD has the potential to disrupt all features or workloads that depend on it. Watches in controllers will fail to receive updates as the client would fail to find the CRD; a concrete known example for this CRD, the CoreDNS multicluster DNS plugin, would fail to program new DNS records and CoreDNS will answer SERVFAIL to any request made for a Kubernetes record that has not yet been synchronized. Features or workloads that depend on this CRD should plan to manage the lifecycle of this CRD or to provide transparent failure modes if the CRD is not present.
715651

716652
* **What specific metrics should inform a rollback?**
717653

654+
Metrics should be configured using a metrics solutions implementing the [Custom Metrics API](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/#full-metrics-pipeline), for example, the [metrics plugin for Custom Resources in kube-state-metrics](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/customresourcestate-metrics.md). Kubernetes does not provide default metrics for CRDs.
655+
718656
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
719-
Describe manual testing that was done and the outcomes.
720-
Longer term, we may want to require automated upgrade/rollback tests, but we
721-
are missing a bunch of machinery and tooling and can't do that now.
657+
Unit tests and manual tests confirm that the CRD is capable of being uninstalled and reinstalled.
722658

723659
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
724660
fields of API types, flags, etc.?**
725-
Even if applying deprecation policies, they may still surprise some users.
661+
No.
726662

727663
### Monitoring Requirements
728664

729665
_This section must be completed when targeting beta graduation to a release._
730666

731667
* **How can an operator determine if the feature is in use by workloads?**
732-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
733-
checking if there are objects with field X set) may be a last resort. Avoid
734-
logs or events for this purpose.
668+
669+
Kubernetes does not provide default metrics for CRDs so an operator would need to depend on custom metrics, or filter 404s from Kubernetes API server against this CRD.
735670

736671
* **What are the SLIs (Service Level Indicators) an operator can use to determine
737672
the health of the service?**
738-
- [ ] Metrics
739-
- Metric name:
740-
- [Optional] Aggregation method:
741-
- Components exposing the metric:
742-
- [ ] Other (treat as last resort)
743-
- Details:
673+
674+
N/A: This KEP does not propose a service, only leverages the existing Kuebernetes API service and CRD extension mechanism.
744675

745676
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
746-
At a high level, this usually will be in the form of "high percentile of SLI
747-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
748-
high level (needs more precise definitions) those may be things like:
749-
- per-day percentage of API calls finishing with 5XX errors <= 1%
750-
- 99% percentile over day of absolute value from (job creation time minus expected
751-
job creation time) for cron job <= 10%
752-
- 99,9% of /health requests per day finish with 200 code
677+
678+
N/A: This KEP does not propose a service, only leverages the existing Kuebernetes API service and CRD extension mechanism.
753679

754680
* **Are there any missing metrics that would be useful to have to improve observability
755681
of this feature?**
756-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
757-
implementation difficulties, etc.).
682+
683+
Default metrics for CRDs in general for number of requests by workload source would improve
758684

759685
### Dependencies
760686

761687
_This section must be completed when targeting beta graduation to a release._
762688

763689
* **Does this feature depend on any specific services running in the cluster?**
764-
Think about both cluster-level services (e.g. metrics-server) as well
765-
as node-level agents (e.g. specific version of CRI). Focus on external or
766-
optional services that are needed. For example, if this feature depends on
767-
a cloud provider API, or upon an external software-defined storage or network
768-
control plane.
769-
770-
For each of these, fill in the following—thinking about running existing user workloads
771-
and creating new ones, as well as about cluster-level services (e.g. DNS):
772-
- [Dependency name]
773-
- Usage description:
774-
- Impact of its outage on the feature:
775-
- Impact of its degraded performance or high-error rates on the feature:
690+
This feature depends only on the CustomResourceDefinition v1 in Kubernetes API server, available in Kubernetes versions 1.14+.
776691

777692

778693
### Scalability
@@ -786,45 +701,32 @@ _For GA, this section is required: approvers should be able to confirm the
786701
previous answers based on experience in the field._
787702

788703
* **Will enabling / using this feature result in any new API calls?**
789-
Describe them, providing:
790-
- API call type (e.g. PATCH pods)
791-
- estimated throughput
792-
- originating component(s) (e.g. Kubelet, Feature-X-controller)
793-
focusing mostly on:
794-
- components listing and/or watching resources they didn't before
795-
- API calls that may be triggered by changes of some Kubernetes resources
796-
(e.g. update of object X triggers new updates of object Y)
797-
- periodic API calls to reconcile state (e.g. periodic fetching state,
798-
heartbeats, leader election, etc.)
704+
705+
Installing the CRD will require a single API call to POST the new `CustomResourceDefinition` resource that represents it.
799706

800707
* **Will enabling / using this feature result in introducing new API types?**
801-
Describe them, providing:
802-
- API type
803-
- Supported number of objects per cluster
804-
- Supported number of objects per namespace (for namespace-scoped objects)
708+
709+
Yes, installing the CRD introduces the cluster-scoped `ClusterProperty` Kind. As there is no related service proposed as part of this KEP, there are no specific limits on the supported number of objects per cluster outside of Kubernetes API server storage limits.
805710

806711
* **Will enabling / using this feature result in any new calls to the cloud
807712
provider?**
808713

714+
No.
715+
809716
* **Will enabling / using this feature result in increasing size or count of
810717
the existing API objects?**
811-
Describe them, providing:
812-
- API type(s):
813-
- Estimated increase in size: (e.g., new annotation of size 32B)
814-
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
718+
719+
Besides the trivial single `CustomResourceDefinition` required to install this CRD, no other size or count of existing API objects will be affected by this KEP.
815720

816721
* **Will enabling / using this feature result in increasing time taken by any
817722
operations covered by [existing SLIs/SLOs]?**
818-
Think about adding additional work or introducing new steps in between
819-
(e.g. need to do X to start a container), etc. Please describe the details.
723+
724+
No, this KEP does not affect any of the operations covered by existing SLIs/SLOs, particularly since CustomResourceDefinitions are excluded from those SLOs.
820725

821726
* **Will enabling / using this feature result in non-negligible increase of
822727
resource usage (CPU, RAM, disk, IO, ...) in any components?**
823-
Things to keep in mind include: additional in-memory state, additional
824-
non-trivial computations, excessive access to disks (including increased log
825-
volume), significant amount of data sent and/or received over network, etc.
826-
This through this both in small and large cases, again with respect to the
827-
[supported limits].
728+
729+
This CRD will utilize the validation mechanism provided by the CRD extension for validation of structural schemas of CRDs which requires some amount of resources to validate on create or update of a CR. However, the number of expected resources (2 as of this KEP) and their rate of change (related to clusterset membership changes, itself expected to be a human decision and rarely changing state) is expected to be trivial.
828730

829731
### Troubleshooting
830732

@@ -836,20 +738,23 @@ _This section must be completed when targeting beta graduation to a release._
836738

837739
* **How does this feature react if the API server and/or etcd is unavailable?**
838740

741+
This KEP itself proposes a CRD applied to the API server; if the API server and/or etcd is unavailable, so is this CRD. Features dependent on this CRD must assess the impact of this CRD's availability on their component's availability. Most concretely today, components of the mcs-controller are expected to serve as an admission controller to this CRD or are dependent on this CRD to program DNS. If the API server and/or etcd is unavailable, those controllers will be unable to update a cluster's ClusterProperty data regarding its well-known properties as part of a ClusterSet, or to program any updates to DNS, respectively.
742+
839743
* **What are other known failure modes?**
840-
For each of them, fill in the following information by copying the below template:
841-
- [Failure mode brief description]
842-
- Detection: How can it be detected via metrics? Stated another way:
843-
how can an operator troubleshoot without logging into a master or worker node?
744+
745+
- [CRD cannot be installed]
746+
- Detection: Custom metrics or dependent feature metrics; increased 404 rate on Kube API server for the CRD.
844747
- Mitigations: What can be done to stop the bleeding, especially for already
845748
running user workloads?
846749
- Diagnostics: What are the useful log messages and their required logging
847750
levels that could help debug the issue?
848-
Not required until feature graduated to beta.
849-
- Testing: Are there any tests for failure mode? If not, describe why.
751+
Warning and above, as this is the level that 404s against the CRD will be seen.
752+
- Testing: Unit tests against generated CRD schema installation and usage of generated client.
850753

851754
* **What steps should be taken if SLOs are not being met to determine the problem?**
852755

756+
N/A: SLOs are not defined as there is no service provided by this KEP.
757+
853758
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
854759
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
855760

0 commit comments

Comments
 (0)