Skip to content

Commit ba726bf

Browse files
authored
Merge pull request kubernetes#3177 from SergeyKanzhelev/grpcbeta
Promote gRPC container probes to beta
2 parents e693612 + 97f069d commit ba726bf

File tree

3 files changed

+85
-132
lines changed

3 files changed

+85
-132
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2727
22
alpha:
33
approver: "@johnbelamaric"
4+
beta:
5+
approver: "@johnbelamaric"

keps/sig-node/2727-grpc-probe/README.md

Lines changed: 71 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
- [Risks and Mitigations](#risks-and-mitigations)
99
- [Design Details](#design-details)
1010
- [Test Plan](#test-plan)
11+
- [Alternative Considerations](#alternative-considerations)
1112
- [Graduation Criteria](#graduation-criteria)
1213
- [Alpha](#alpha)
1314
- [Beta](#beta)
@@ -23,24 +24,26 @@
2324
- [Troubleshooting](#troubleshooting)
2425
- [Implementation History](#implementation-history)
2526
- [Implementation History](#implementation-history-1)
27+
- [Alpha](#alpha-1)
28+
- [Beta](#beta-1)
2629
- [Alternatives](#alternatives)
2730
- [References](#references)
2831
<!-- /toc -->
2932

3033

3134
## Release Signoff Checklist
3235

33-
- [ ] Enhancement issue in release milestone, which links to KEP dir in
36+
- [X] Enhancement issue in release milestone, which links to KEP dir in
3437
[kubernetes/enhancements] (not the initial KEP PR)
35-
- [ ] KEP approvers have approved the KEP status as `implementable`
36-
- [ ] Design details are appropriately documented
37-
- [ ] Test plan is in place, giving consideration to SIG Architecture
38+
- [X] KEP approvers have approved the KEP status as `implementable`
39+
- [X] Design details are appropriately documented
40+
- [X] Test plan is in place, giving consideration to SIG Architecture
3841
and SIG Testing input
39-
- [ ] Graduation criteria is in place
40-
- [ ] "Implementation History" section is up-to-date for milestone
41-
- [ ] User-facing documentation has been created in
42+
- [X] Graduation criteria is in place
43+
- [X] "Implementation History" section is up-to-date for milestone
44+
- [X] User-facing documentation has been created in
4245
[kubernetes/website], for publication to [kubernetes.io]
43-
- [ ] Supporting documentation e.g., additional design documents,
46+
- [X] Supporting documentation e.g., additional design documents,
4447
links to mailing list discussions/SIG meetings, relevant PRs/issues,
4548
release notes
4649

@@ -76,33 +79,25 @@ and `StartupProbe`. Example:
7679
```
7780
7881
This will result in the use of gRPC (using HTTP/2 over TLS) to use the
79-
standard healthcheck service to determine the health of the
80-
container. As spec'd, the `kubelet` probe will not allow use of client
82+
standard healthcheck service (`Check` method) to determine the health of the
83+
container. Using `Watch` method of the healthcheck service is not supported,
84+
but may be considered in future iterations.
85+
As spec'd, the `kubelet` probe will not allow use of client
8186
certificates nor verify the certificate on the container. We do not
8287
support other protocols for the time being (unencrypted HTTP/2, QUIC).
8388

84-
Note that `readinessProbe.grpc.service` may be confusing, some
85-
alternatives:
86-
87-
- `serviceName`
88-
- `healthCheckServiceName`
89-
- `grpcService`
90-
- `grpcServiceName`
91-
92-
These options can be added in Beta with user feedback.
93-
9489
The healthcheck request will be identified with the following gRPC
9590
`User-Agent` metadata. This user agent will be statically defined (not
9691
configurable by the user):
9792

9893
```
99-
User-Agent: kubernetes/K8S_MAJOR_VER.K8S_MINOR_VER
94+
User-Agent: kube-probe/K8S_MAJOR_VER.K8S_MINOR_VER
10095
```
10196

10297
Example:
10398

10499
```
105-
User-Agent: kubernetes/1.22
100+
User-Agent: kube-probe/1.23
106101
```
107102

108103
### Risks and Mitigations
@@ -151,6 +146,18 @@ move users away from using the (portNum, portName) union type.
151146
- Unit test: Add unit tests to `pkg/kubelet/prober/...`
152147
- e2e: Add test case and conformance test to `e2e/common/node/container_probe.go`.
153148

149+
### Alternative Considerations
150+
151+
Note that `readinessProbe.grpc.service` may be confusing, some
152+
alternatives considered:
153+
154+
- `serviceName`
155+
- `healthCheckServiceName`
156+
- `grpcService`
157+
- `grpcServiceName`
158+
159+
There were no feedback on the selected name being confusing in the context of a probe definition.
160+
154161
### Graduation Criteria
155162

156163
#### Alpha
@@ -160,11 +167,7 @@ move users away from using the (portNum, portName) union type.
160167

161168
#### Beta
162169

163-
- Solicit feedback from the Alpha. Validate that API is appropriate
164-
for users. There are some potential tunables:
165-
- `User-Agent`
166-
- connect timeout
167-
- protocol (HTTP, QUIC)
170+
- Solicit feedback from the Alpha.
168171
- Ensure tests are stable and passing.
169172

170173
Depending on skew strategy:
@@ -174,7 +177,11 @@ Depending on skew strategy:
174177

175178
#### GA
176179

177-
- Address feedback from beta
180+
- Address feedback from beta usage
181+
- Validate that API is appropriate for users. There are some potential tunables:
182+
- `User-Agent`
183+
- connect timeout
184+
- protocol (HTTP, QUIC)
178185
- Close on any remaining open issues & bugs
179186

180187
### Upgrade / Downgrade Strategy
@@ -286,102 +293,40 @@ No
286293

287294
### Monitoring Requirements
288295

289-
TODO for Beta.
290-
291-
<!--
292-
293296
###### How can an operator determine if the feature is in use by workloads?
294297

295-
TODO for Beta.
296-
297-
<!--
298-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
299-
checking if there are objects with field X set) may be a last resort. Avoid
300-
logs or events for this purpose.
301-
->
298+
When gRPC probe is configured, Pod must be scheduled and, the metric
299+
`probe_total` can be observed to see the result of probe execution.
302300

303301
###### How can someone using this feature know that it is working for their instance?
304302

305-
TODO for Beta.
303+
When gRPC probe is configured, Pod must be scheduled and, the metric
304+
`probe_total` can be observed to see the result of probe execution.
306305

307-
<!--
308-
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
309-
for each individual pod.
310-
Pick one more of these and delete the rest.
311-
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
312-
and operation of this feature.
313-
Recall that end users cannot usually observe component logs or access metrics.
314-
->
315-
316-
- [ ] Events
317-
- Event Reason:
318-
- [ ] API .status
319-
- Condition name:
320-
- Other field:
321-
- [ ] Other (treat as last resort)
322-
- Details:
306+
Event will be emitted for the failed probe and logs available in `kubelet.log`
307+
to troubleshoot the failing probes.
323308

324309
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
325310

326-
<!--
327-
This is your opportunity to define what "normal" quality of service looks like
328-
for a feature.
329-
330-
It's impossible to provide comprehensive guidance, but at the very
331-
high level (needs more precise definitions) those may be things like:
332-
- per-day percentage of API calls finishing with 5XX errors <= 1%
333-
- 99% percentile over day of absolute value from (job creation time minus expected
334-
job creation time) for cron job <= 10%
335-
- 99.9% of /health requests per day finish with 200 code
336-
337-
These goals will help you determine what you need to measure (SLIs) in the next
338-
question.
339-
->
311+
Probe must succeed whenever service has returned the correct response
312+
in defined timeout, and fail otherwise.
340313

341314
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
342315

343-
<!--
344-
Pick one more of these and delete the rest.
345-
->
346-
347-
- [ ] Metrics
348-
- Metric name:
349-
- [Optional] Aggregation method:
350-
- Components exposing the metric:
351-
- [ ] Other (treat as last resort)
352-
- Details:
316+
The metric `probe_total` can be used to check for the probe result. Event and
317+
`kubelet.log` log entries can be observed to troubleshoot issues.
353318

354319
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
355320

356-
<!--
357-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
358-
implementation difficulties, etc.).
359-
-->
321+
Creation of a probe duration metric is tracked in this issue:
322+
https://github.com/kubernetes/kubernetes/issues/101035 and out of scope for this
323+
KEP.
360324

361325
### Dependencies
362326

363-
Beta TODO
364-
365-
<!--
366-
This section must be completed when targeting beta to a release.
367-
->
368-
369327
###### Does this feature depend on any specific services running in the cluster?
370328

371-
<!--
372-
Think about both cluster-level services (e.g. metrics-server) as well
373-
as node-level agents (e.g. specific version of CRI). Focus on external or
374-
optional services that are needed. For example, if this feature depends on
375-
a cloud provider API, or upon an external software-defined storage or network
376-
control plane.
377-
378-
For each of these, fill in the following—thinking about running existing user workloads
379-
and creating new ones, as well as about cluster-level services (e.g. DNS):
380-
- [Dependency name]
381-
- Usage description:
382-
- Impact of its outage on the feature:
383-
- Impact of its degraded performance or high-error rates on the feature:
384-
-->
329+
No
385330

386331
### Scalability
387332

@@ -399,48 +344,37 @@ No.
399344

400345
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
401346

402-
Adds < 200 bytes to Pod.Spec.
403-
347+
Adds < 200 bytes to Pod.Spec, which is consistent with other probe types.
404348

405349
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
406350

407351
No.
408352

409353
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
410354

411-
No.
355+
The overhead of executing probes is consistent with other probe types.
412356

413-
### Troubleshooting
357+
We expect decrease of disk, RAM, and CPU use for many scenarios where the https://github.com/grpc-ecosystem/grpc-health-probe
358+
was used to probe gRPC endpoints.
414359

415-
Beta TODO.
416-
<!--
417-
This section must be completed when targeting beta to a release.
418-
419-
The Troubleshooting section currently serves the `Playbook` role. We may consider
420-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
421-
details). For now, we leave it here.
422-
->
360+
### Troubleshooting
423361

424362
###### How does this feature react if the API server and/or etcd is unavailable?
425363

364+
No dependency on etcd availability.
365+
426366
###### What are other known failure modes?
427367

428-
<!--
429-
For each of them, fill in the following information by copying the below template:
430-
- [Failure mode brief description]
431-
- Detection: How can it be detected via metrics? Stated another way:
432-
how can an operator troubleshoot without logging into a master or worker node?
433-
- Mitigations: What can be done to stop the bleeding, especially for already
434-
running user workloads?
435-
- Diagnostics: What are the useful log messages and their required logging
436-
levels that could help debug the issue?
437-
Not required until feature graduated to beta.
438-
- Testing: Are there any tests for failure mode? If not, describe why.
439-
->
368+
None
440369

441370
###### What steps should be taken if SLOs are not being met to determine the problem?
442371

443-
-->
372+
- Make sure feature gate is set
373+
- Make sure configuration is correct and gRPC service is reacheable by kubelet.
374+
This may be different when migrating off https://github.com/grpc-ecosystem/grpc-health-probe
375+
and is covered in feature documentation.
376+
- `kubelet.log` log must be analyzed to understand why there is a mismatch of
377+
service response and status reported by probe.
444378

445379
## Implementation History
446380

@@ -462,6 +396,14 @@ Major milestones might include:
462396
* 2021-05-12: Cloned to this KEP to move the probe forward.
463397
* 2021-05-13: Updates.
464398

399+
### Alpha
400+
401+
Alpha feature was implemented in 1.23.
402+
403+
### Beta
404+
405+
Feature is promoted to beta in 1.24.
406+
465407
## Alternatives
466408

467409
* 3rd party solutions like https://github.com/grpc-ecosystem/grpc-health-probe

keps/sig-node/2727-grpc-probe/kep.yaml

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,18 @@ reviewers:
1616
- "@SergeyKanzhelev"
1717
approvers:
1818
- "@thockin"
19+
- "@dchen1107"
1920
see-also:
20-
replaces:
2121
prr-approvers:
2222
- "@johnbelarmic"
23-
stage: "alpha"
24-
latest-milestone: "v1.23"
23+
stage: "beta"
24+
latest-milestone: "v1.24"
25+
milestone:
26+
alpha: "v1.23"
27+
beta: "v1.24"
28+
feature-gates:
29+
- name: GRPCContainerProbe
30+
components:
31+
- kube-apiserver
32+
- kubelet
33+
disable-supported: true

0 commit comments

Comments
 (0)