Skip to content

Commit 1cf0b07

Browse files
2727 - promote grpc probes to GA (#3807)
* promote grpc probes to GA * Update keps/sig-node/2727-grpc-probe/kep.yaml Co-authored-by: Mark Rossetti <[email protected]> * updated to the latest template * Filled up Drawbacks --------- Co-authored-by: Mark Rossetti <[email protected]>
1 parent a86925e commit 1cf0b07

File tree

3 files changed

+132
-81
lines changed

3 files changed

+132
-81
lines changed

keps/prod-readiness/sig-node/2727.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ alpha:
33
approver: "@johnbelamaric"
44
beta:
55
approver: "@johnbelamaric"
6+
stable:
7+
approver: "@johnbelamaric"

keps/sig-node/2727-grpc-probe/README.md

Lines changed: 125 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,19 @@
22

33
<!-- toc -->
44
- [Release Signoff Checklist](#release-signoff-checklist)
5-
- [Goals](#goals)
6-
- [Non-Goals](#non-goals)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
79
- [Proposal](#proposal)
810
- [Risks and Mitigations](#risks-and-mitigations)
911
- [Design Details](#design-details)
10-
- [Test Plan](#test-plan)
1112
- [Alternative Considerations](#alternative-considerations)
13+
- [Test Plan](#test-plan)
14+
- [Prerequisite testing updates](#prerequisite-testing-updates)
15+
- [Unit tests](#unit-tests)
16+
- [Integration tests](#integration-tests)
17+
- [e2e tests](#e2e-tests)
1218
- [Graduation Criteria](#graduation-criteria)
1319
- [Alpha](#alpha)
1420
- [Beta](#beta)
@@ -23,11 +29,13 @@
2329
- [Scalability](#scalability)
2430
- [Troubleshooting](#troubleshooting)
2531
- [Implementation History](#implementation-history)
26-
- [Implementation History](#implementation-history-1)
2732
- [Alpha](#alpha-1)
2833
- [Beta](#beta-1)
34+
- [GA](#ga-1)
35+
- [Drawbacks](#drawbacks)
2936
- [Alternatives](#alternatives)
3037
- [References](#references)
38+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
3139
<!-- /toc -->
3240

3341

@@ -52,17 +60,35 @@
5260
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
5361
[kubernetes/website]: https://git.k8s.io/website
5462

55-
## Goals
63+
## Summary
64+
65+
Add gRPC probe to Pod.Spec.Container.{Liveness,Readiness,Startup}Probe.
66+
67+
## Motivation
68+
69+
gRPC is wide spread RPC framework. Existing solutions to add
70+
probes to gRPC apps like exposing additional http endpoint
71+
for health checks or packing external gRPC client as part of
72+
an image and use exec probes have many limitations and overhead.
73+
74+
Many load balancers support gRPC natively so adding it to
75+
Kubernetes aligns well with the industry.
76+
77+
Finally, Kubernetes project actively uses gRPC so adding built-in
78+
support for gRPC endpoints does not introduce any new dependencies
79+
to the project.
80+
81+
### Goals
5682

5783
Enable gRPC probe natively from Kubelet without requiring users to package a
5884
gRPC healthcheck binary with their container.
5985

6086
- https://github.com/grpc-ecosystem/grpc-health-probe
6187
- https://github.com/grpc/grpc/blob/master/doc/health-checking.md
6288

63-
## Non-Goals
89+
### Non-Goals
6490

65-
Add gRPC support in other areas of K8s (e.g. Services).
91+
- Add gRPC support in other areas of K8s (e.g. Services).
6692

6793
## Proposal
6894

@@ -141,11 +167,6 @@ Note that `GRPCAction.Port` is an int32, which is inconsistent with
141167
the other existing probe definitions. This is on purpose -- we want to
142168
move users away from using the (portNum, portName) union type.
143169

144-
### Test Plan
145-
146-
- Unit test: Add unit tests to `pkg/kubelet/prober/...`
147-
- e2e: Add test case and conformance test to `e2e/common/node/container_probe.go`.
148-
149170
### Alternative Considerations
150171

151172
Note that `readinessProbe.grpc.service` may be confusing, some
@@ -158,6 +179,47 @@ alternatives considered:
158179

159180
There were no feedback on the selected name being confusing in the context of a probe definition.
160181

182+
### Test Plan
183+
184+
<!--
185+
**Note:** *Not required until targeted at a release.*
186+
The goal is to ensure that we don't accept enhancements with inadequate testing.
187+
188+
All code is expected to have adequate tests (eventually with coverage
189+
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
190+
when drafting this test plan.
191+
192+
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
193+
-->
194+
195+
[X] I/we understand the owners of the involved components may require updates to
196+
existing tests to make this code solid enough prior to committing the changes necessary
197+
to implement this enhancement.
198+
199+
##### Prerequisite testing updates
200+
201+
<!--
202+
Based on reviewers feedback describe what additional tests need to be added prior
203+
implementing this enhancement to ensure the enhancements have also solid foundations.
204+
-->
205+
206+
##### Unit tests
207+
208+
- `k8s.io/kubernetes/pkg/probe/grpc`: `2023/02/06` - `78.1%`
209+
210+
##### Integration tests
211+
212+
N/A, only unit tests and e2e coverage.
213+
214+
##### e2e tests
215+
216+
Tests in `test/e2e/common/node/container_probe.go`:
217+
218+
- should *not* be restarted with a GRPC liveness probe: [results](https://storage.googleapis.com/k8s-triage/index.html?test=Probing%20container%20should%20%5C*not%5C*%20be%20restarted%20with%20a%20GRPC%20liveness%20probe)
219+
- should be restarted with a GRPC liveness probe: [results](https://storage.googleapis.com/k8s-triage/index.html?test=should%20be%20restarted%20with%20a%20GRPC%20liveness%20probe)
220+
221+
TODO: stress test to validate the scale (see GA requirements).
222+
161223
### Graduation Criteria
162224

163225
#### Alpha
@@ -177,12 +239,14 @@ Depending on skew strategy:
177239

178240
#### GA
179241

180-
- Address feedback from beta usage
181-
- Validate that API is appropriate for users. There are some potential tunables:
242+
- [X] Address feedback from beta usage
243+
- [X] Validate that API is appropriate for users. There are some potential tunables:
182244
- `User-Agent`
183245
- connect timeout
184246
- protocol (HTTP, QUIC)
185-
- Close on any remaining open issues & bugs
247+
- [ ] Close on any remaining open issues & bugs
248+
- [ ] Promote tests to conformance
249+
- [ ] Implement a stress test
186250

187251
### Upgrade / Downgrade Strategy
188252

@@ -198,38 +262,12 @@ Downgrade: gRPC probes will not be supported in a downgrade from Alpha.
198262

199263
## Production Readiness Review Questionnaire
200264

201-
<!--
202-
203-
Production readiness reviews are intended to ensure that features merging into
204-
Kubernetes are observable, scalable and supportable; can be safely operated in
205-
production environments, and can be disabled or rolled back in the event they
206-
cause increased failures in production. See more in the PRR KEP at
207-
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
208-
209-
The production readiness review questionnaire must be completed and approved
210-
for the KEP to move to `implementable` status and be included in the release.
211-
212-
In some cases, the questions below should also have answers in `kep.yaml`. This
213-
is to enable automation to verify the presence of the review, and to reduce review
214-
burden and latency.
215-
216-
The KEP must have a approver from the
217-
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
218-
team. Please reach out on the
219-
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
220-
you need any help or guidance.
221-
-->
222-
223265
### Feature Enablement and Rollback
224266

225267
Feature enablement will be guarded by a feature gate flag.
226268

227269
###### How can this feature be enabled / disabled in a live cluster?
228270

229-
<!--
230-
Pick one of these and delete the rest.
231-
-->
232-
233271
- [x] Feature gate (also fill in values in `kep.yaml`)
234272
- Feature gate name: `GRPCContainerProbe`
235273
- Components depending on the feature gate: `kubelet` (probing), API
@@ -250,42 +288,26 @@ It becomes enabled again after the `kubelet` restart.
250288

251289
###### Are there any tests for feature enablement/disablement?
252290

253-
Y
254-
es, unit tests for the feature when enabled and disabled will be
291+
Yes, unit tests for the feature when enabled and disabled will be
255292
implemented in both kubelet and api server.
256293

257294
### Rollout, Upgrade and Rollback Planning
258295

259-
<!--
260-
This section must be completed when targeting beta to a release.
261-
-->
296+
We passed the version skew problem for the new API. No planning is required.
262297

263298
###### How can a rollout or rollback fail? Can it impact already running workloads?
264299

265-
<!--
266-
Try to be as paranoid as possible - e.g., what if some components will restart
267-
mid-rollout?
268-
269-
Be sure to consider highly-available clusters, where, for example,
270-
feature flags will be enabled on some API servers and not others during the
271-
rollout. Similarly, consider large clusters and how enablement/disablement
272-
will rollout across nodes.
273-
-->
300+
We passed the version skew problem - the API will be available on any supported
301+
version skew. So no issues are expected with rollout and rollback.
274302

275303
###### What specific metrics should inform a rollback?
276304

277-
<!--
278-
What signals should users be paying attention to when the feature is young
279-
that might indicate a serious problem?
280-
-->
305+
Rollback wouldn't address issues. Pods will need to stop using the new probe
306+
type.
281307

282308
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
283309

284-
<!--
285-
Describe manual testing that was done and the outcomes.
286-
Longer term, we may want to require automated upgrade/rollback tests, but we
287-
are missing a bunch of machinery and tooling and can't do that now.
288-
-->
310+
N/A
289311

290312
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
291313

@@ -357,8 +379,27 @@ The overhead of executing probes is consistent with other probe types.
357379
We expect decrease of disk, RAM, and CPU use for many scenarios where the https://github.com/grpc-ecosystem/grpc-health-probe
358380
was used to probe gRPC endpoints.
359381

382+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
383+
384+
Yes, gRPC probes use node resources to establish connection.
385+
This may lead to issue like [kubernetes/kubernetes#89898](https://github.com/kubernetes/kubernetes/issues/89898).
386+
387+
The node resources for gRPC probes can be exhausted by a Pod with HostPort
388+
making many connections to different destinations or any other process on a node.
389+
This problem cannot be addressed generically.
390+
391+
However, the design where node resources are being used for gRPC probes works
392+
for the most setups. The default pods maximum is `110`. There are currently
393+
no limits on number of containers. The number of containers is limited by the
394+
amount of resources requested by these containers. With the fix limiting
395+
the `TIME_WAIT` for the socket to 1 second,
396+
[this calculation](https://github.com/kubernetes/kubernetes/issues/89898#issuecomment-1383207322)
397+
demonstrates it will be hard to reach the limits on sockets.
398+
360399
### Troubleshooting
361400

401+
Logs and Pod events can be used to troubleshoot probe failures.
402+
362403
###### How does this feature react if the API server and/or etcd is unavailable?
363404

364405
No dependency on etcd availability.
@@ -378,19 +419,6 @@ None
378419

379420
## Implementation History
380421

381-
<!--
382-
Major milestones in the lifecycle of a KEP should be tracked in this section.
383-
Major milestones might include:
384-
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
385-
- the `Proposal` section being merged, signaling agreement on a proposed design
386-
- the date implementation started
387-
- the first Kubernetes release where an initial version of the KEP was available
388-
- the version of Kubernetes where the KEP graduated to general availability
389-
- when the KEP was retired or superseded
390-
-->
391-
392-
## Implementation History
393-
394422
* Original PR for k8 Prober: https://github.com/kubernetes/kubernetes/pull/89832
395423
* 2020-04-04: MR for k8 Prober
396424
* 2021-05-12: Cloned to this KEP to move the probe forward.
@@ -404,10 +432,30 @@ Alpha feature was implemented in 1.23.
404432

405433
Feature is promoted to beta in 1.24.
406434

435+
### GA
436+
437+
Feature is promoted to GA in 1.27.
438+
439+
## Drawbacks
440+
441+
See [Motivation](#motivation) on why gRPC was picked as another RPC framework
442+
to support natively.
443+
444+
Adding gRPC is a small increment to k8s functionality with very little side
445+
effects. But providing a lot of "quaity of life improvements" to gRPC apps.
446+
407447
## Alternatives
408448

409449
* 3rd party solutions like https://github.com/grpc-ecosystem/grpc-health-probe
410450

411451
## References
412452

413453
* GRPC healthchecking: https://github.com/grpc/grpc/blob/master/doc/health-checking.md
454+
455+
## Infrastructure Needed (Optional)
456+
457+
<!--
458+
Use this section if you need things from the project/SIG. Examples include a
459+
new subproject, repos requested, or GitHub details. Listing these here allows a
460+
SIG to get the process for these resources started right away.
461+
-->

keps/sig-node/2727-grpc-probe/kep.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,26 +3,27 @@ kep-number: 2727
33
authors:
44
- "@bowei"
55
- "@PxyUp"
6+
- "@SergeyKanzhelev"
67
owning-sig: sig-node
78
participating-sigs:
89
- sig-node
910
- sig-network
1011
status: implementable
1112
creation-date: 2020-04-04
12-
last-updated: 2021-05-12
13+
last-updated: 2023-01-31
1314
reviewers:
1415
- "@thockin"
1516
- "@mrunalp"
16-
- "@SergeyKanzhelev"
1717
approvers:
1818
- "@thockin"
1919
- "@dchen1107"
2020
see-also:
21-
stage: "beta"
22-
latest-milestone: "v1.24"
21+
stage: "stable"
22+
latest-milestone: "v1.27"
2323
milestone:
2424
alpha: "v1.23"
2525
beta: "v1.24"
26+
stable: "v1.27"
2627
feature-gates:
2728
- name: GRPCContainerProbe
2829
components:

0 commit comments

Comments
 (0)