Skip to content

Commit 8cf0165

Browse files
committed
update CRD field selector KEP for beta promotion
1 parent 15e9464 commit 8cf0165

File tree

3 files changed

+37
-159
lines changed

3 files changed

+37
-159
lines changed

keps/prod-readiness/sig-api-machinery/4358.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 4358
55
alpha:
66
approver: "@deads2k"
7+
beta:
8+
approver: "@deads2k"

keps/sig-api-machinery/4358-custom-resource-field-selectors/README.md

Lines changed: 32 additions & 157 deletions
Original file line numberDiff line numberDiff line change
@@ -897,53 +897,31 @@ This section must be completed when targeting beta to a release.
897897

898898
###### How can a rollout or rollback fail? Can it impact already running workloads?
899899

900-
<!--
901-
Try to be as paranoid as possible - e.g., what if some components will restart
902-
mid-rollout?
900+
This feature can only be used when enabled and does not persist any state.
903901

904-
Be sure to consider highly-available clusters, where, for example,
905-
feature flags will be enabled on some API servers and not others during the
906-
rollout. Similarly, consider large clusters and how enablement/disablement
907-
will rollout across nodes.
908-
-->
902+
Rollout could only fail if a defect in the implementation were to somehow impact
903+
list/watch code paths not using this feature.
904+
905+
Once rolled out, rollback would only impact clients using the feature.
909906

910907
###### What specific metrics should inform a rollback?
911908

912-
<!--
913-
What signals should users be paying attention to when the feature is young
914-
that might indicate a serious problem?
915-
-->
909+
[request_total](https://github.com/kubernetes/kubernetes/blob/a97f4b7a3123c9768ec7136b6ca32be926e16cd6/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L81). This metric can be monitored for non-200 response codes.
910+
916911

917912
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
918913

919-
<!--
920-
Describe manual testing that was done and the outcomes.
921-
Longer term, we may want to require automated upgrade/rollback tests, but we
922-
are missing a bunch of machinery and tooling and can't do that now.
923-
-->
914+
Yes. Note however, that a upgrade after a downgrade does not matter for this feature since it does not persist any state.
924915

925916
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
926917

927-
<!--
928-
Even if applying deprecation policies, they may still surprise some users.
929-
-->
918+
No
930919

931920
### Monitoring Requirements
932921

933-
<!--
934-
This section must be completed when targeting beta to a release.
935-
936-
For GA, this section is required: approvers should be able to confirm the
937-
previous answers based on experience in the field.
938-
-->
939-
940922
###### How can an operator determine if the feature is in use by workloads?
941923

942-
<!--
943-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
944-
checking if there are objects with field X set) may be a last resort. Avoid
945-
logs or events for this purpose.
946-
-->
924+
Check if the "has_field_selector" label (plan is to add this for beta) on [request_total](https://github.com/kubernetes/kubernetes/blob/a97f4b7a3123c9768ec7136b6ca32be926e16cd6/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L81) is used on any CRDs.
947925

948926
###### How can someone using this feature know that it is working for their instance?
949927

@@ -961,30 +939,15 @@ Recall that end users cannot usually observe component logs or access metrics.
961939
- [ ] API .status
962940
- Condition name:
963941
- Other field:
964-
- [ ] Other (treat as last resort)
965-
- Details:
942+
- [x] Other (treat as last resort)
943+
- Details: Use the feature to filter when listing CRDs.
966944

967945
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
968946

969-
<!--
970-
This is your opportunity to define what "normal" quality of service looks like
971-
for a feature.
972-
973-
It's impossible to provide comprehensive guidance, but at the very
974-
high level (needs more precise definitions) those may be things like:
975-
- per-day percentage of API calls finishing with 5XX errors <= 1%
976-
- 99% percentile over day of absolute value from (job creation time minus expected
977-
job creation time) for cron job <= 10%
978-
- 99.9% of /health requests per day finish with 200 code
979-
980-
These goals will help you determine what you need to measure (SLIs) in the next
981-
question.
982-
-->
947+
This feature reduces the expected sizes of list responses, and so falls under the SLOs for read responses.
983948

984949
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
985950

986-
TODO: Should we add a "filtered" label to the below metrics? It would help isolate problems with selectors better.
987-
988951
- [x] Metrics
989952
- Metric name:
990953
- [request_total](https://github.com/kubernetes/kubernetes/blob/a97f4b7a3123c9768ec7136b6ca32be926e16cd6/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L81)
@@ -997,154 +960,66 @@ High `request_duration_seconds` for list requests may indicate a performance pro
997960

998961
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
999962

1000-
<!--
1001-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
1002-
implementation difficulties, etc.).
1003-
-->
963+
Yes, we will add "has_label_selector" and "has_field_selector" labels to the request_total and request_duration_seconds. We intend to keep it simple (for low cardinality) and only use a true/false label value.
1004964

1005965
### Dependencies
1006966

1007-
<!--
1008-
This section must be completed when targeting beta to a release.
1009-
-->
1010-
1011967
###### Does this feature depend on any specific services running in the cluster?
1012968

1013-
<!--
1014-
Think about both cluster-level services (e.g. metrics-server) as well
1015-
as node-level agents (e.g. specific version of CRI). Focus on external or
1016-
optional services that are needed. For example, if this feature depends on
1017-
a cloud provider API, or upon an external software-defined storage or network
1018-
control plane.
1019-
1020-
For each of these, fill in the following—thinking about running existing user workloads
1021-
and creating new ones, as well as about cluster-level services (e.g. DNS):
1022-
- [Dependency name]
1023-
- Usage description:
1024-
- Impact of its outage on the feature:
1025-
- Impact of its degraded performance or high-error rates on the feature:
969+
No
1026970
-->
1027971

1028972
### Scalability
1029973

1030-
<!--
1031-
For alpha, this section is encouraged: reviewers should consider these questions
1032-
and attempt to answer them.
1033-
1034-
For beta, this section is required: reviewers must answer these questions.
1035-
1036-
For GA, this section is required: approvers should be able to confirm the
1037-
previous answers based on experience in the field.
1038-
-->
1039-
1040974
###### Will enabling / using this feature result in any new API calls?
1041975

1042-
<!--
1043-
Describe them, providing:
1044-
- API call type (e.g. PATCH pods)
1045-
- estimated throughput
1046-
- originating component(s) (e.g. Kubelet, Feature-X-controller)
1047-
Focusing mostly on:
1048-
- components listing and/or watching resources they didn't before
1049-
- API calls that may be triggered by changes of some Kubernetes resources
1050-
(e.g. update of object X triggers new updates of object Y)
1051-
- periodic API calls to reconcile state (e.g. periodic fetching state,
1052-
heartbeats, leader election, etc.)
1053-
-->
976+
No
1054977

1055978
###### Will enabling / using this feature result in introducing new API types?
1056979

1057-
<!--
1058-
Describe them, providing:
1059-
- API type
1060-
- Supported number of objects per cluster
1061-
- Supported number of objects per namespace (for namespace-scoped objects)
1062-
-->
980+
No
1063981

1064982
###### Will enabling / using this feature result in any new calls to the cloud provider?
1065983

1066-
<!--
1067-
Describe them, providing:
1068-
- Which API(s):
1069-
- Estimated increase:
1070-
-->
984+
No
1071985

1072986
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
1073987

1074-
<!--
1075-
Describe them, providing:
1076-
- API type(s):
1077-
- Estimated increase in size: (e.g., new annotation of size 32B)
1078-
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
1079-
-->
988+
No
1080989

1081990
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
1082991

1083-
<!--
1084-
Look at the [existing SLIs/SLOs].
1085-
1086-
Think about adding additional work or introducing new steps in between
1087-
(e.g. need to do X to start a container), etc. Please describe the details.
992+
No
1088993

1089-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
1090-
-->
994+
Note that this feature does not fundamentally enable capabilities not already available. Today, users add labels to resources
995+
to enable filtering. This feature merely eliminates the need to "labelize" resources in this way.
1091996

1092997
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
1093998

1094-
<!--
1095-
Things to keep in mind include: additional in-memory state, additional
1096-
non-trivial computations, excessive access to disks (including increased log
1097-
volume), significant amount of data sent and/or received over network, etc.
1098-
This through this both in small and large cases, again with respect to the
1099-
[supported limits].
1100-
1101-
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
1102-
-->
999+
No
11031000

11041001
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
11051002

1106-
<!--
1107-
Focus not just on happy cases, but primarily on more pathological cases
1108-
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
1109-
If any of the resources can be exhausted, how this is mitigated with the existing limits
1110-
(e.g. pods per node) or new limits added by this KEP?
1111-
1112-
Are there any tests that were run/should be run to understand performance characteristics better
1113-
and validate the declared limits?
1114-
-->
1003+
No
11151004

11161005
### Troubleshooting
11171006

1118-
<!--
1119-
This section must be completed when targeting beta to a release.
1120-
1121-
For GA, this section is required: approvers should be able to confirm the
1122-
previous answers based on experience in the field.
1007+
###### How does this feature react if the API server and/or etcd is unavailable?
11231008

1124-
The Troubleshooting section currently serves the `Playbook` role. We may consider
1125-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
1126-
details). For now, we leave it here.
1127-
-->
1009+
The feature is provided by the API server and is unavailable if the API server is unavailable.
11281010

1129-
###### How does this feature react if the API server and/or etcd is unavailable?
1011+
Requests served by etcd are also unavailable when etcd is unavailable. (Watch cache served requests remain available).
11301012

11311013
###### What are other known failure modes?
11321014

1133-
<!--
1134-
For each of them, fill in the following information by copying the below template:
1135-
- [Failure mode brief description]
1136-
- Detection: How can it be detected via metrics? Stated another way:
1137-
how can an operator troubleshoot without logging into a master or worker node?
1138-
- Mitigations: What can be done to stop the bleeding, especially for already
1139-
running user workloads?
1140-
- Diagnostics: What are the useful log messages and their required logging
1141-
levels that could help debug the issue?
1142-
Not required until feature graduated to beta.
1143-
- Testing: Are there any tests for failure mode? If not, describe why.
1144-
-->
1015+
N/A
11451016

11461017
###### What steps should be taken if SLOs are not being met to determine the problem?
11471018

1019+
Check the request_duration_seconds metric where 'has_field_selector=true' on CRD types to identify if filtered
1020+
list requests exceed the SLO. Further narrow the SLO down to a specific 'component' or inspect apiserver logs
1021+
to identify the exact requests exceeding the SLO.
1022+
11481023
## Implementation History
11491024

11501025
<!--

keps/sig-api-machinery/4358-custom-resource-field-selectors/kep.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,17 @@ see-also:
1616
- "/keps/sig-api-machinery/95-custom-resource-definitions"
1717

1818
# The target maturity stage in the current dev cycle for this KEP.
19-
stage: alpha
19+
stage: beta
2020

2121
# The most recent milestone for which work toward delivery of this KEP has been
2222
# done. This can be the current (upcoming) milestone, if it is being actively
2323
# worked on.
24-
latest-milestone: "v1.30"
24+
latest-milestone: "v1.31"
2525

2626
# The milestone at which this feature was, or is targeted to be, at each stage.
2727
milestone:
2828
alpha: "v1.30"
29+
beta: "v1.31"
2930

3031
# The following PRR answers are required at alpha release
3132
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)