update CRD field selector KEP for beta promotion

jpbetz · jpbetz · commit 8cf016519549 · 2024-04-24T11:06:54.000-04:00
diff --git a/keps/prod-readiness/sig-api-machinery/4358.yaml b/keps/prod-readiness/sig-api-machinery/4358.yaml
@@ -4,3 +4,5 @@
 kep-number: 4358
 alpha:
   approver: "@deads2k"
+beta:
+  approver: "@deads2k"
diff --git a/keps/sig-api-machinery/4358-custom-resource-field-selectors/README.md b/keps/sig-api-machinery/4358-custom-resource-field-selectors/README.md
@@ -897,53 +897,31 @@ This section must be completed when targeting beta to a release.
 
 ###### How can a rollout or rollback fail? Can it impact already running workloads?
 
-<!--
-Try to be as paranoid as possible - e.g., what if some components will restart
-mid-rollout?
+This feature can only be used when enabled and does not persist any state.
 
-Be sure to consider highly-available clusters, where, for example,
-feature flags will be enabled on some API servers and not others during the
-rollout. Similarly, consider large clusters and how enablement/disablement
-will rollout across nodes.
--->
+Rollout could only fail if a defect in the implementation were to somehow impact
+list/watch code paths not using this feature.
+
+Once rolled out, rollback would only impact clients using the feature.
 
 ###### What specific metrics should inform a rollback?
 
-<!--
-What signals should users be paying attention to when the feature is young
-that might indicate a serious problem?
--->
+[request_total](https://github.com/kubernetes/kubernetes/blob/a97f4b7a3123c9768ec7136b6ca32be926e16cd6/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L81). This metric can be monitored for non-200 response codes.
+
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
-<!--
-Describe manual testing that was done and the outcomes.
-Longer term, we may want to require automated upgrade/rollback tests, but we
-are missing a bunch of machinery and tooling and can't do that now.
--->
+Yes. Note however, that a upgrade after a downgrade does not matter for this feature since it does not persist any state.
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
-<!--
-Even if applying deprecation policies, they may still surprise some users.
--->
+No
 
 ### Monitoring Requirements
 
-<!--
-This section must be completed when targeting beta to a release.
-
-For GA, this section is required: approvers should be able to confirm the
-previous answers based on experience in the field.
--->
-
 ###### How can an operator determine if the feature is in use by workloads?
 
-<!--
-Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
-checking if there are objects with field X set) may be a last resort. Avoid
-logs or events for this purpose.
--->
+Check if the "has_field_selector" label (plan is to add this for beta) on [request_total](https://github.com/kubernetes/kubernetes/blob/a97f4b7a3123c9768ec7136b6ca32be926e16cd6/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L81) is used on any CRDs.
 
 ###### How can someone using this feature know that it is working for their instance?
 
@@ -961,30 +939,15 @@ Recall that end users cannot usually observe component logs or access metrics.
 - [ ] API .status
   - Condition name: 
   - Other field: 
-- [ ] Other (treat as last resort)
-  - Details:
+- [x] Other (treat as last resort)
+  - Details: Use the feature to filter when listing CRDs.
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
-<!--
-This is your opportunity to define what "normal" quality of service looks like
-for a feature.
-
-It's impossible to provide comprehensive guidance, but at the very
-high level (needs more precise definitions) those may be things like:
-  - per-day percentage of API calls finishing with 5XX errors <= 1%
-  - 99% percentile over day of absolute value from (job creation time minus expected
-    job creation time) for cron job <= 10%
-  - 99.9% of /health requests per day finish with 200 code
-
-These goals will help you determine what you need to measure (SLIs) in the next
-question.
--->
+This feature reduces the expected sizes of list responses, and so falls under the SLOs for read responses.
 
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
-TODO: Should we add a "filtered" label to the below metrics? It would help isolate problems with selectors better.
-
 - [x] Metrics
   - Metric name:
     - [request_total](https://github.com/kubernetes/kubernetes/blob/a97f4b7a3123c9768ec7136b6ca32be926e16cd6/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L81)
@@ -997,154 +960,66 @@ High `request_duration_seconds` for list requests may indicate a performance pro
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-<!--
-Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
-implementation difficulties, etc.).
--->
+Yes, we will add "has_label_selector" and "has_field_selector" labels to the request_total and request_duration_seconds. We intend to keep it simple (for low cardinality) and only use a true/false label value.
 
 ### Dependencies
 
-<!--
-This section must be completed when targeting beta to a release.
--->
-
 ###### Does this feature depend on any specific services running in the cluster?
 
-<!--
-Think about both cluster-level services (e.g. metrics-server) as well
-as node-level agents (e.g. specific version of CRI). Focus on external or
-optional services that are needed. For example, if this feature depends on
-a cloud provider API, or upon an external software-defined storage or network
-control plane.
-
-For each of these, fill in the following—thinking about running existing user workloads
-and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high-error rates on the feature:
+No
 -->
 
 ### Scalability
 
-<!--
-For alpha, this section is encouraged: reviewers should consider these questions
-and attempt to answer them.
-
-For beta, this section is required: reviewers must answer these questions.
-
-For GA, this section is required: approvers should be able to confirm the
-previous answers based on experience in the field.
--->
-
 ###### Will enabling / using this feature result in any new API calls?
 
-<!--
-Describe them, providing:
-  - API call type (e.g. PATCH pods)
-  - estimated throughput
-  - originating component(s) (e.g. Kubelet, Feature-X-controller)
-Focusing mostly on:
-  - components listing and/or watching resources they didn't before
-  - API calls that may be triggered by changes of some Kubernetes resources
-    (e.g. update of object X triggers new updates of object Y)
-  - periodic API calls to reconcile state (e.g. periodic fetching state,
-    heartbeats, leader election, etc.)
--->
+No
 
 ###### Will enabling / using this feature result in introducing new API types?
 
-<!--
-Describe them, providing:
-  - API type
-  - Supported number of objects per cluster
-  - Supported number of objects per namespace (for namespace-scoped objects)
--->
+No
 
 ###### Will enabling / using this feature result in any new calls to the cloud provider?
 
-<!--
-Describe them, providing:
-  - Which API(s):
-  - Estimated increase:
--->
+No
 
 ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
 
-<!--
-Describe them, providing:
-  - API type(s):
-  - Estimated increase in size: (e.g., new annotation of size 32B)
-  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
--->
+No
 
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
-<!--
-Look at the [existing SLIs/SLOs].
-
-Think about adding additional work or introducing new steps in between
-(e.g. need to do X to start a container), etc. Please describe the details.
+No
 
-[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
--->
+Note that this feature does not fundamentally enable capabilities not already available.  Today, users add labels to resources
+to enable filtering. This feature merely eliminates the need to "labelize" resources in this way.
 
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
-<!--
-Things to keep in mind include: additional in-memory state, additional
-non-trivial computations, excessive access to disks (including increased log
-volume), significant amount of data sent and/or received over network, etc.
-This through this both in small and large cases, again with respect to the
-[supported limits].
-
-[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
--->
+No
 
 ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
 
-<!--
-Focus not just on happy cases, but primarily on more pathological cases
-(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
-If any of the resources can be exhausted, how this is mitigated with the existing limits
-(e.g. pods per node) or new limits added by this KEP?
-
-Are there any tests that were run/should be run to understand performance characteristics better
-and validate the declared limits?
--->
+No
 
 ### Troubleshooting
 
-<!--
-This section must be completed when targeting beta to a release.
-
-For GA, this section is required: approvers should be able to confirm the
-previous answers based on experience in the field.
+###### How does this feature react if the API server and/or etcd is unavailable?
 
-The Troubleshooting section currently serves the `Playbook` role. We may consider
-splitting it into a dedicated `Playbook` document (potentially with some monitoring
-details). For now, we leave it here.
--->
+The feature is provided by the API server and is unavailable if the API server is unavailable.
 
-###### How does this feature react if the API server and/or etcd is unavailable?
+Requests served by etcd are also unavailable when etcd is unavailable. (Watch cache served requests remain available).
 
 ###### What are other known failure modes?
 
-<!--
-For each of them, fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without logging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
--->
+N/A
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
+Check the request_duration_seconds metric where 'has_field_selector=true' on CRD types to identify if filtered
+list requests exceed the SLO. Further narrow the SLO down to a specific 'component' or inspect apiserver logs
+to identify the exact requests exceeding the SLO.
+
 ## Implementation History
 
 <!--
diff --git a/keps/sig-api-machinery/4358-custom-resource-field-selectors/kep.yaml b/keps/sig-api-machinery/4358-custom-resource-field-selectors/kep.yaml
@@ -16,16 +16,17 @@ see-also:
   - "/keps/sig-api-machinery/95-custom-resource-definitions"
 
 # The target maturity stage in the current dev cycle for this KEP.
-stage: alpha
+stage: beta
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.30"
+latest-milestone: "v1.31"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: "v1.30"
+  beta: "v1.31"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled