2
2
3
3
<!-- toc -->
4
4
- [ Release Signoff Checklist] ( #release-signoff-checklist )
5
- - [ Goals] ( #goals )
6
- - [ Non-Goals] ( #non-goals )
5
+ - [ Summary] ( #summary )
6
+ - [ Motivation] ( #motivation )
7
+ - [ Goals] ( #goals )
8
+ - [ Non-Goals] ( #non-goals )
7
9
- [ Proposal] ( #proposal )
8
10
- [ Risks and Mitigations] ( #risks-and-mitigations )
9
11
- [ Design Details] ( #design-details )
10
- - [ Test Plan] ( #test-plan )
11
12
- [ Alternative Considerations] ( #alternative-considerations )
13
+ - [ Test Plan] ( #test-plan )
14
+ - [ Prerequisite testing updates] ( #prerequisite-testing-updates )
15
+ - [ Unit tests] ( #unit-tests )
16
+ - [ Integration tests] ( #integration-tests )
17
+ - [ e2e tests] ( #e2e-tests )
12
18
- [ Graduation Criteria] ( #graduation-criteria )
13
19
- [ Alpha] ( #alpha )
14
20
- [ Beta] ( #beta )
23
29
- [ Scalability] ( #scalability )
24
30
- [ Troubleshooting] ( #troubleshooting )
25
31
- [ Implementation History] ( #implementation-history )
26
- - [ Implementation History] ( #implementation-history-1 )
27
32
- [ Alpha] ( #alpha-1 )
28
33
- [ Beta] ( #beta-1 )
34
+ - [ GA] ( #ga-1 )
35
+ - [ Drawbacks] ( #drawbacks )
29
36
- [ Alternatives] ( #alternatives )
30
37
- [ References] ( #references )
38
+ - [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
31
39
<!-- /toc -->
32
40
33
41
52
60
[ kubernetes/kubernetes ] : https://git.k8s.io/kubernetes
53
61
[ kubernetes/website ] : https://git.k8s.io/website
54
62
55
- ## Goals
63
+ ## Summary
64
+
65
+ Add gRPC probe to Pod.Spec.Container.{Liveness,Readiness,Startup}Probe.
66
+
67
+ ## Motivation
68
+
69
+ gRPC is wide spread RPC framework. Existing solutions to add
70
+ probes to gRPC apps like exposing additional http endpoint
71
+ for health checks or packing external gRPC client as part of
72
+ an image and use exec probes have many limitations and overhead.
73
+
74
+ Many load balancers support gRPC natively so adding it to
75
+ Kubernetes aligns well with the industry.
76
+
77
+ Finally, Kubernetes project actively uses gRPC so adding built-in
78
+ support for gRPC endpoints does not introduce any new dependencies
79
+ to the project.
80
+
81
+ ### Goals
56
82
57
83
Enable gRPC probe natively from Kubelet without requiring users to package a
58
84
gRPC healthcheck binary with their container.
59
85
60
86
- https://github.com/grpc-ecosystem/grpc-health-probe
61
87
- https://github.com/grpc/grpc/blob/master/doc/health-checking.md
62
88
63
- ## Non-Goals
89
+ ### Non-Goals
64
90
65
- Add gRPC support in other areas of K8s (e.g. Services).
91
+ - Add gRPC support in other areas of K8s (e.g. Services).
66
92
67
93
## Proposal
68
94
@@ -141,11 +167,6 @@ Note that `GRPCAction.Port` is an int32, which is inconsistent with
141
167
the other existing probe definitions. This is on purpose -- we want to
142
168
move users away from using the (portNum, portName) union type.
143
169
144
- ### Test Plan
145
-
146
- - Unit test: Add unit tests to ` pkg/kubelet/prober/... `
147
- - e2e: Add test case and conformance test to ` e2e/common/node/container_probe.go ` .
148
-
149
170
### Alternative Considerations
150
171
151
172
Note that ` readinessProbe.grpc.service ` may be confusing, some
@@ -158,6 +179,47 @@ alternatives considered:
158
179
159
180
There were no feedback on the selected name being confusing in the context of a probe definition.
160
181
182
+ ### Test Plan
183
+
184
+ <!--
185
+ **Note:** *Not required until targeted at a release.*
186
+ The goal is to ensure that we don't accept enhancements with inadequate testing.
187
+
188
+ All code is expected to have adequate tests (eventually with coverage
189
+ expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
190
+ when drafting this test plan.
191
+
192
+ [testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
193
+ -->
194
+
195
+ [ X] I/we understand the owners of the involved components may require updates to
196
+ existing tests to make this code solid enough prior to committing the changes necessary
197
+ to implement this enhancement.
198
+
199
+ ##### Prerequisite testing updates
200
+
201
+ <!--
202
+ Based on reviewers feedback describe what additional tests need to be added prior
203
+ implementing this enhancement to ensure the enhancements have also solid foundations.
204
+ -->
205
+
206
+ ##### Unit tests
207
+
208
+ - ` k8s.io/kubernetes/pkg/probe/grpc ` : ` 2023/02/06 ` - ` 78.1% `
209
+
210
+ ##### Integration tests
211
+
212
+ N/A, only unit tests and e2e coverage.
213
+
214
+ ##### e2e tests
215
+
216
+ Tests in ` test/e2e/common/node/container_probe.go ` :
217
+
218
+ - should * not* be restarted with a GRPC liveness probe: [ results] ( https://storage.googleapis.com/k8s-triage/index.html?test=Probing%20container%20should%20%5C*not%5C*%20be%20restarted%20with%20a%20GRPC%20liveness%20probe )
219
+ - should be restarted with a GRPC liveness probe: [ results] ( https://storage.googleapis.com/k8s-triage/index.html?test=should%20be%20restarted%20with%20a%20GRPC%20liveness%20probe )
220
+
221
+ TODO: stress test to validate the scale (see GA requirements).
222
+
161
223
### Graduation Criteria
162
224
163
225
#### Alpha
@@ -177,12 +239,14 @@ Depending on skew strategy:
177
239
178
240
#### GA
179
241
180
- - Address feedback from beta usage
181
- - Validate that API is appropriate for users. There are some potential tunables:
242
+ - [X] Address feedback from beta usage
243
+ - [X] Validate that API is appropriate for users. There are some potential tunables:
182
244
- ` User-Agent `
183
245
- connect timeout
184
246
- protocol (HTTP, QUIC)
185
- - Close on any remaining open issues & bugs
247
+ - [ ] Close on any remaining open issues & bugs
248
+ - [ ] Promote tests to conformance
249
+ - [ ] Implement a stress test
186
250
187
251
### Upgrade / Downgrade Strategy
188
252
@@ -198,38 +262,12 @@ Downgrade: gRPC probes will not be supported in a downgrade from Alpha.
198
262
199
263
## Production Readiness Review Questionnaire
200
264
201
- <!--
202
-
203
- Production readiness reviews are intended to ensure that features merging into
204
- Kubernetes are observable, scalable and supportable; can be safely operated in
205
- production environments, and can be disabled or rolled back in the event they
206
- cause increased failures in production. See more in the PRR KEP at
207
- https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
208
-
209
- The production readiness review questionnaire must be completed and approved
210
- for the KEP to move to `implementable` status and be included in the release.
211
-
212
- In some cases, the questions below should also have answers in `kep.yaml`. This
213
- is to enable automation to verify the presence of the review, and to reduce review
214
- burden and latency.
215
-
216
- The KEP must have a approver from the
217
- [`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
218
- team. Please reach out on the
219
- [#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
220
- you need any help or guidance.
221
- -->
222
-
223
265
### Feature Enablement and Rollback
224
266
225
267
Feature enablement will be guarded by a feature gate flag.
226
268
227
269
###### How can this feature be enabled / disabled in a live cluster?
228
270
229
- <!--
230
- Pick one of these and delete the rest.
231
- -->
232
-
233
271
- [x] Feature gate (also fill in values in ` kep.yaml ` )
234
272
- Feature gate name: ` GRPCContainerProbe `
235
273
- Components depending on the feature gate: ` kubelet ` (probing), API
@@ -250,42 +288,26 @@ It becomes enabled again after the `kubelet` restart.
250
288
251
289
###### Are there any tests for feature enablement/disablement?
252
290
253
- Y
254
- es, unit tests for the feature when enabled and disabled will be
291
+ Yes, unit tests for the feature when enabled and disabled will be
255
292
implemented in both kubelet and api server.
256
293
257
294
### Rollout, Upgrade and Rollback Planning
258
295
259
- <!--
260
- This section must be completed when targeting beta to a release.
261
- -->
296
+ We passed the version skew problem for the new API. No planning is required.
262
297
263
298
###### How can a rollout or rollback fail? Can it impact already running workloads?
264
299
265
- <!--
266
- Try to be as paranoid as possible - e.g., what if some components will restart
267
- mid-rollout?
268
-
269
- Be sure to consider highly-available clusters, where, for example,
270
- feature flags will be enabled on some API servers and not others during the
271
- rollout. Similarly, consider large clusters and how enablement/disablement
272
- will rollout across nodes.
273
- -->
300
+ We passed the version skew problem - the API will be available on any supported
301
+ version skew. So no issues are expected with rollout and rollback.
274
302
275
303
###### What specific metrics should inform a rollback?
276
304
277
- <!--
278
- What signals should users be paying attention to when the feature is young
279
- that might indicate a serious problem?
280
- -->
305
+ Rollback wouldn't address issues. Pods will need to stop using the new probe
306
+ type.
281
307
282
308
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
283
309
284
- <!--
285
- Describe manual testing that was done and the outcomes.
286
- Longer term, we may want to require automated upgrade/rollback tests, but we
287
- are missing a bunch of machinery and tooling and can't do that now.
288
- -->
310
+ N/A
289
311
290
312
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
291
313
@@ -357,8 +379,27 @@ The overhead of executing probes is consistent with other probe types.
357
379
We expect decrease of disk, RAM, and CPU use for many scenarios where the https://github.com/grpc-ecosystem/grpc-health-probe
358
380
was used to probe gRPC endpoints.
359
381
382
+ ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
383
+
384
+ Yes, gRPC probes use node resources to establish connection.
385
+ This may lead to issue like [ kubernetes/kubernetes #89898 ] ( https://github.com/kubernetes/kubernetes/issues/89898 ) .
386
+
387
+ The node resources for gRPC probes can be exhausted by a Pod with HostPort
388
+ making many connections to different destinations or any other process on a node.
389
+ This problem cannot be addressed generically.
390
+
391
+ However, the design where node resources are being used for gRPC probes works
392
+ for the most setups. The default pods maximum is ` 110 ` . There are currently
393
+ no limits on number of containers. The number of containers is limited by the
394
+ amount of resources requested by these containers. With the fix limiting
395
+ the ` TIME_WAIT ` for the socket to 1 second,
396
+ [ this calculation] ( https://github.com/kubernetes/kubernetes/issues/89898#issuecomment-1383207322 )
397
+ demonstrates it will be hard to reach the limits on sockets.
398
+
360
399
### Troubleshooting
361
400
401
+ Logs and Pod events can be used to troubleshoot probe failures.
402
+
362
403
###### How does this feature react if the API server and/or etcd is unavailable?
363
404
364
405
No dependency on etcd availability.
@@ -378,19 +419,6 @@ None
378
419
379
420
## Implementation History
380
421
381
- <!--
382
- Major milestones in the lifecycle of a KEP should be tracked in this section.
383
- Major milestones might include:
384
- - the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
385
- - the `Proposal` section being merged, signaling agreement on a proposed design
386
- - the date implementation started
387
- - the first Kubernetes release where an initial version of the KEP was available
388
- - the version of Kubernetes where the KEP graduated to general availability
389
- - when the KEP was retired or superseded
390
- -->
391
-
392
- ## Implementation History
393
-
394
422
* Original PR for k8 Prober: https://github.com/kubernetes/kubernetes/pull/89832
395
423
* 2020-04-04: MR for k8 Prober
396
424
* 2021-05-12: Cloned to this KEP to move the probe forward.
@@ -404,10 +432,30 @@ Alpha feature was implemented in 1.23.
404
432
405
433
Feature is promoted to beta in 1.24.
406
434
435
+ ### GA
436
+
437
+ Feature is promoted to GA in 1.27.
438
+
439
+ ## Drawbacks
440
+
441
+ See [ Motivation] ( #motivation ) on why gRPC was picked as another RPC framework
442
+ to support natively.
443
+
444
+ Adding gRPC is a small increment to k8s functionality with very little side
445
+ effects. But providing a lot of "quaity of life improvements" to gRPC apps.
446
+
407
447
## Alternatives
408
448
409
449
* 3rd party solutions like https://github.com/grpc-ecosystem/grpc-health-probe
410
450
411
451
## References
412
452
413
453
* GRPC healthchecking: https://github.com/grpc/grpc/blob/master/doc/health-checking.md
454
+
455
+ ## Infrastructure Needed (Optional)
456
+
457
+ <!--
458
+ Use this section if you need things from the project/SIG. Examples include a
459
+ new subproject, repos requested, or GitHub details. Listing these here allows a
460
+ SIG to get the process for these resources started right away.
461
+ -->
0 commit comments