You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -897,53 +897,31 @@ This section must be completed when targeting beta to a release.
897
897
898
898
###### How can a rollout or rollback fail? Can it impact already running workloads?
899
899
900
-
<!--
901
-
Try to be as paranoid as possible - e.g., what if some components will restart
902
-
mid-rollout?
900
+
This feature can only be used when enabled and does not persist any state.
903
901
904
-
Be sure to consider highly-available clusters, where, for example,
905
-
feature flags will be enabled on some API servers and not others during the
906
-
rollout. Similarly, consider large clusters and how enablement/disablement
907
-
will rollout across nodes.
908
-
-->
902
+
Rollout could only fail if a defect in the implementation were to somehow impact
903
+
list/watch code paths not using this feature.
904
+
905
+
Once rolled out, rollback would only impact clients using the feature.
909
906
910
907
###### What specific metrics should inform a rollback?
911
908
912
-
<!--
913
-
What signals should users be paying attention to when the feature is young
914
-
that might indicate a serious problem?
915
-
-->
909
+
[request_total](https://github.com/kubernetes/kubernetes/blob/a97f4b7a3123c9768ec7136b6ca32be926e16cd6/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L81). This metric can be monitored for non-200 response codes.
910
+
916
911
917
912
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
918
913
919
-
<!--
920
-
Describe manual testing that was done and the outcomes.
921
-
Longer term, we may want to require automated upgrade/rollback tests, but we
922
-
are missing a bunch of machinery and tooling and can't do that now.
923
-
-->
914
+
Yes. Note however, that a upgrade after a downgrade does not matter for this feature since it does not persist any state.
924
915
925
916
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
926
917
927
-
<!--
928
-
Even if applying deprecation policies, they may still surprise some users.
929
-
-->
918
+
No
930
919
931
920
### Monitoring Requirements
932
921
933
-
<!--
934
-
This section must be completed when targeting beta to a release.
935
-
936
-
For GA, this section is required: approvers should be able to confirm the
937
-
previous answers based on experience in the field.
938
-
-->
939
-
940
922
###### How can an operator determine if the feature is in use by workloads?
941
923
942
-
<!--
943
-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
944
-
checking if there are objects with field X set) may be a last resort. Avoid
945
-
logs or events for this purpose.
946
-
-->
924
+
Check if the "has_field_selector" label (plan is to add this for beta) on [request_total](https://github.com/kubernetes/kubernetes/blob/a97f4b7a3123c9768ec7136b6ca32be926e16cd6/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L81) is used on any CRDs.
947
925
948
926
###### How can someone using this feature know that it is working for their instance?
949
927
@@ -961,30 +939,15 @@ Recall that end users cannot usually observe component logs or access metrics.
961
939
-[ ] API .status
962
940
- Condition name:
963
941
- Other field:
964
-
-[] Other (treat as last resort)
965
-
- Details:
942
+
-[x] Other (treat as last resort)
943
+
- Details: Use the feature to filter when listing CRDs.
966
944
967
945
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
968
946
969
-
<!--
970
-
This is your opportunity to define what "normal" quality of service looks like
971
-
for a feature.
972
-
973
-
It's impossible to provide comprehensive guidance, but at the very
974
-
high level (needs more precise definitions) those may be things like:
975
-
- per-day percentage of API calls finishing with 5XX errors <= 1%
976
-
- 99% percentile over day of absolute value from (job creation time minus expected
977
-
job creation time) for cron job <= 10%
978
-
- 99.9% of /health requests per day finish with 200 code
979
-
980
-
These goals will help you determine what you need to measure (SLIs) in the next
981
-
question.
982
-
-->
947
+
This feature reduces the expected sizes of list responses, and so falls under the SLOs for read responses.
983
948
984
949
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
985
950
986
-
TODO: Should we add a "filtered" label to the below metrics? It would help isolate problems with selectors better.
@@ -997,154 +960,66 @@ High `request_duration_seconds` for list requests may indicate a performance pro
997
960
998
961
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
999
962
1000
-
<!--
1001
-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
1002
-
implementation difficulties, etc.).
1003
-
-->
963
+
Yes, we will add "has_label_selector" and "has_field_selector" labels to the request_total and request_duration_seconds. We intend to keep it simple (for low cardinality) and only use a true/false label value.
1004
964
1005
965
### Dependencies
1006
966
1007
-
<!--
1008
-
This section must be completed when targeting beta to a release.
1009
-
-->
1010
-
1011
967
###### Does this feature depend on any specific services running in the cluster?
1012
968
1013
-
<!--
1014
-
Think about both cluster-level services (e.g. metrics-server) as well
1015
-
as node-level agents (e.g. specific version of CRI). Focus on external or
1016
-
optional services that are needed. For example, if this feature depends on
1017
-
a cloud provider API, or upon an external software-defined storage or network
1018
-
control plane.
1019
-
1020
-
For each of these, fill in the following—thinking about running existing user workloads
1021
-
and creating new ones, as well as about cluster-level services (e.g. DNS):
1022
-
- [Dependency name]
1023
-
- Usage description:
1024
-
- Impact of its outage on the feature:
1025
-
- Impact of its degraded performance or high-error rates on the feature:
969
+
No
1026
970
-->
1027
971
1028
972
### Scalability
1029
973
1030
-
<!--
1031
-
For alpha, this section is encouraged: reviewers should consider these questions
1032
-
and attempt to answer them.
1033
-
1034
-
For beta, this section is required: reviewers must answer these questions.
1035
-
1036
-
For GA, this section is required: approvers should be able to confirm the
1037
-
previous answers based on experience in the field.
1038
-
-->
1039
-
1040
974
###### Will enabling / using this feature result in any new API calls?
0 commit comments