Merge pull request kubernetes#3177 from SergeyKanzhelev/grpcbeta

k8s-ci-robot · web-flow · commit ba726bfb46fc · 2022-02-01T11:45:49.000-08:00
Promote gRPC container probes to beta
diff --git a/keps/prod-readiness/sig-node/2727.yaml b/keps/prod-readiness/sig-node/2727.yaml
@@ -1,3 +1,5 @@
 kep-number: 2727
 alpha:
   approver: "@johnbelamaric"
+beta:
+  approver: "@johnbelamaric"
diff --git a/keps/sig-node/2727-grpc-probe/README.md b/keps/sig-node/2727-grpc-probe/README.md
@@ -8,6 +8,7 @@
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
   - [Test Plan](#test-plan)
+  - [Alternative Considerations](#alternative-considerations)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
     - [Beta](#beta)
@@ -23,24 +24,26 @@
   - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
 - [Implementation History](#implementation-history-1)
+  - [Alpha](#alpha-1)
+  - [Beta](#beta-1)
 - [Alternatives](#alternatives)
 - [References](#references)
 <!-- /toc -->
 
 
 ## Release Signoff Checklist
 
-- [ ] Enhancement issue in release milestone, which links to KEP dir in
+- [X] Enhancement issue in release milestone, which links to KEP dir in
   [kubernetes/enhancements] (not the initial KEP PR)
-- [ ] KEP approvers have approved the KEP status as `implementable`
-- [ ] Design details are appropriately documented
-- [ ] Test plan is in place, giving consideration to SIG Architecture
+- [X] KEP approvers have approved the KEP status as `implementable`
+- [X] Design details are appropriately documented
+- [X] Test plan is in place, giving consideration to SIG Architecture
       and SIG Testing input
-- [ ] Graduation criteria is in place
-- [ ] "Implementation History" section is up-to-date for milestone
-- [ ] User-facing documentation has been created in
+- [X] Graduation criteria is in place
+- [X] "Implementation History" section is up-to-date for milestone
+- [X] User-facing documentation has been created in
   [kubernetes/website], for publication to [kubernetes.io]
-- [ ] Supporting documentation e.g., additional design documents,
+- [X] Supporting documentation e.g., additional design documents,
   links to mailing list discussions/SIG meetings, relevant PRs/issues,
   release notes
 
@@ -76,33 +79,25 @@ and `StartupProbe`. Example:
 ```
 
 This will result in the use of gRPC (using HTTP/2 over TLS) to use the
-standard healthcheck service to determine the health of the
-container. As spec'd, the `kubelet` probe will not allow use of client
+standard healthcheck service (`Check` method) to determine the health of the
+container. Using `Watch` method of the healthcheck service is not supported,
+but may be considered in future iterations.
+As spec'd, the `kubelet` probe will not allow use of client
 certificates nor verify the certificate on the container.  We do not
 support other protocols for the time being (unencrypted HTTP/2, QUIC).
 
-Note that `readinessProbe.grpc.service` may be confusing, some
-alternatives:
-
-- `serviceName`
-- `healthCheckServiceName`
-- `grpcService`
-- `grpcServiceName`
-
-These options can be added in Beta with user feedback.
-
 The healthcheck request will be identified with the following gRPC
 `User-Agent` metadata. This user agent will be statically defined (not
 configurable by the user):
 
 ```
-User-Agent: kubernetes/K8S_MAJOR_VER.K8S_MINOR_VER
+User-Agent: kube-probe/K8S_MAJOR_VER.K8S_MINOR_VER
 ```
 
 Example:
 
 ```
-User-Agent: kubernetes/1.22
+User-Agent: kube-probe/1.23
 ```
 
 ### Risks and Mitigations
@@ -151,6 +146,18 @@ move users away from using the (portNum, portName) union type.
 - Unit test: Add unit tests to `pkg/kubelet/prober/...`
 - e2e: Add test case and conformance test to `e2e/common/node/container_probe.go`.
 
+### Alternative Considerations
+
+Note that `readinessProbe.grpc.service` may be confusing, some
+alternatives considered:
+
+- `serviceName`
+- `healthCheckServiceName`
+- `grpcService`
+- `grpcServiceName`
+
+There were no feedback on the selected name being confusing in the context of a probe definition.
+
 ### Graduation Criteria
 
 #### Alpha
@@ -160,11 +167,7 @@ move users away from using the (portNum, portName) union type.
 
 #### Beta
 
-- Solicit feedback from the Alpha. Validate that API is appropriate
-  for users. There are some potential tunables:
-  - `User-Agent`
-  - connect timeout
-  - protocol (HTTP, QUIC)
+- Solicit feedback from the Alpha.
 - Ensure tests are stable and passing.
 
 Depending on skew strategy:
@@ -174,7 +177,11 @@ Depending on skew strategy:
 
 #### GA
 
-- Address feedback from beta
+- Address feedback from beta usage
+- Validate that API is appropriate for users. There are some potential tunables:
+  - `User-Agent`
+  - connect timeout
+  - protocol (HTTP, QUIC)
 - Close on any remaining open issues & bugs
 
 ### Upgrade / Downgrade Strategy
@@ -286,102 +293,40 @@ No
 
 ### Monitoring Requirements
 
-TODO for Beta.
-
-<!--
-
 ###### How can an operator determine if the feature is in use by workloads?
 
-TODO for Beta.
-
-<!--
-Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
-checking if there are objects with field X set) may be a last resort. Avoid
-logs or events for this purpose.
-->
+When gRPC probe is configured, Pod must be scheduled and, the metric
+`probe_total` can be observed to see the result of probe execution.
 
 ###### How can someone using this feature know that it is working for their instance?
 
-TODO for Beta.
+When gRPC probe is configured, Pod must be scheduled and, the metric
+`probe_total` can be observed to see the result of probe execution.
 
-<!--
-For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
-for each individual pod.
-Pick one more of these and delete the rest.
-Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
-and operation of this feature.
-Recall that end users cannot usually observe component logs or access metrics.
-->
-
-- [ ] Events
-  - Event Reason:
-- [ ] API .status
-  - Condition name:
-  - Other field:
-- [ ] Other (treat as last resort)
-  - Details:
+Event will be emitted for the failed probe and logs available in `kubelet.log`
+to troubleshoot the failing probes.
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
-<!--
-This is your opportunity to define what "normal" quality of service looks like
-for a feature.
-
-It's impossible to provide comprehensive guidance, but at the very
-high level (needs more precise definitions) those may be things like:
-  - per-day percentage of API calls finishing with 5XX errors <= 1%
-  - 99% percentile over day of absolute value from (job creation time minus expected
-    job creation time) for cron job <= 10%
-  - 99.9% of /health requests per day finish with 200 code
-
-These goals will help you determine what you need to measure (SLIs) in the next
-question.
-->
+Probe must succeed whenever service has returned the correct response
+in defined timeout, and fail otherwise.
 
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
-<!--
-Pick one more of these and delete the rest.
-->
-
-- [ ] Metrics
-  - Metric name:
-  - [Optional] Aggregation method:
-  - Components exposing the metric:
-- [ ] Other (treat as last resort)
-  - Details:
+The metric `probe_total` can be used to check for the probe result. Event and
+`kubelet.log` log entries can be observed to troubleshoot issues.
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-<!--
-Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
-implementation difficulties, etc.).
--->
+Creation of a probe duration metric is tracked in this issue:
+https://github.com/kubernetes/kubernetes/issues/101035 and out of scope for this
+KEP.
 
 ### Dependencies
 
-Beta TODO
-
-<!--
-This section must be completed when targeting beta to a release.
-->
-
 ###### Does this feature depend on any specific services running in the cluster?
 
-<!--
-Think about both cluster-level services (e.g. metrics-server) as well
-as node-level agents (e.g. specific version of CRI). Focus on external or
-optional services that are needed. For example, if this feature depends on
-a cloud provider API, or upon an external software-defined storage or network
-control plane.
-
-For each of these, fill in the following—thinking about running existing user workloads
-and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high-error rates on the feature:
--->
+No
 
 ### Scalability
 
@@ -399,48 +344,37 @@ No.
 
 ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
 
-Adds < 200 bytes to Pod.Spec.
-
+Adds < 200 bytes to Pod.Spec, which is consistent with other probe types.
 
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
 No.
 
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
-No.
+The overhead of executing probes is consistent with other probe types.
 
-### Troubleshooting
+We expect decrease of disk, RAM, and CPU use for many scenarios where the https://github.com/grpc-ecosystem/grpc-health-probe
+was used to probe gRPC endpoints.
 
-Beta TODO.
-<!--
-This section must be completed when targeting beta to a release.
-
-The Troubleshooting section currently serves the `Playbook` role. We may consider
-splitting it into a dedicated `Playbook` document (potentially with some monitoring
-details). For now, we leave it here.
-->
+### Troubleshooting
 
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
+No dependency on etcd availability.
+
 ###### What are other known failure modes?
 
-<!--
-For each of them, fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without logging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
-->
+None
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
--->
+- Make sure feature gate is set
+- Make sure configuration is correct and gRPC service is reacheable by kubelet.
+  This may be different when migrating off https://github.com/grpc-ecosystem/grpc-health-probe
+  and is covered in feature documentation.
+- `kubelet.log` log must be analyzed to understand why there is a mismatch of
+  service response and status reported by probe.
 
 ## Implementation History
 
@@ -462,6 +396,14 @@ Major milestones might include:
 * 2021-05-12: Cloned to this KEP to move the probe forward.
 * 2021-05-13: Updates.
 
+### Alpha
+
+Alpha feature was implemented in 1.23.
+
+### Beta
+
+Feature is promoted to beta in 1.24.
+
 ## Alternatives
 
 * 3rd party solutions like https://github.com/grpc-ecosystem/grpc-health-probe
diff --git a/keps/sig-node/2727-grpc-probe/kep.yaml b/keps/sig-node/2727-grpc-probe/kep.yaml
@@ -16,9 +16,18 @@ reviewers:
   - "@SergeyKanzhelev"
 approvers:
   - "@thockin"
+  - "@dchen1107"
 see-also:
-replaces:
 prr-approvers:
   - "@johnbelarmic"
-stage: "alpha"
-latest-milestone: "v1.23"
+stage: "beta"
+latest-milestone: "v1.24"
+milestone:
+  alpha: "v1.23"
+  beta: "v1.24"
+feature-gates:
+  - name: GRPCContainerProbe
+    components:
+      - kube-apiserver
+      - kubelet
+disable-supported: true