8
8
- [ Risks and Mitigations] ( #risks-and-mitigations )
9
9
- [ Design Details] ( #design-details )
10
10
- [ Test Plan] ( #test-plan )
11
+ - [ Alternative Considerations] ( #alternative-considerations )
11
12
- [ Graduation Criteria] ( #graduation-criteria )
12
13
- [ Alpha] ( #alpha )
13
14
- [ Beta] ( #beta )
23
24
- [ Troubleshooting] ( #troubleshooting )
24
25
- [ Implementation History] ( #implementation-history )
25
26
- [ Implementation History] ( #implementation-history-1 )
27
+ - [ Alpha] ( #alpha-1 )
28
+ - [ Beta] ( #beta-1 )
26
29
- [ Alternatives] ( #alternatives )
27
30
- [ References] ( #references )
28
31
<!-- /toc -->
29
32
30
33
31
34
## Release Signoff Checklist
32
35
33
- - [ ] Enhancement issue in release milestone, which links to KEP dir in
36
+ - [X ] Enhancement issue in release milestone, which links to KEP dir in
34
37
[ kubernetes/enhancements] (not the initial KEP PR)
35
- - [ ] KEP approvers have approved the KEP status as ` implementable `
36
- - [ ] Design details are appropriately documented
37
- - [ ] Test plan is in place, giving consideration to SIG Architecture
38
+ - [X ] KEP approvers have approved the KEP status as ` implementable `
39
+ - [X ] Design details are appropriately documented
40
+ - [X ] Test plan is in place, giving consideration to SIG Architecture
38
41
and SIG Testing input
39
- - [ ] Graduation criteria is in place
40
- - [ ] "Implementation History" section is up-to-date for milestone
41
- - [ ] User-facing documentation has been created in
42
+ - [X ] Graduation criteria is in place
43
+ - [X ] "Implementation History" section is up-to-date for milestone
44
+ - [X ] User-facing documentation has been created in
42
45
[ kubernetes/website] , for publication to [ kubernetes.io]
43
- - [ ] Supporting documentation e.g., additional design documents,
46
+ - [X ] Supporting documentation e.g., additional design documents,
44
47
links to mailing list discussions/SIG meetings, relevant PRs/issues,
45
48
release notes
46
49
@@ -76,33 +79,25 @@ and `StartupProbe`. Example:
76
79
` ` `
77
80
78
81
This will result in the use of gRPC (using HTTP/2 over TLS) to use the
79
- standard healthcheck service to determine the health of the
80
- container. As spec'd, the ` kubelet` probe will not allow use of client
82
+ standard healthcheck service (` Check` method) to determine the health of the
83
+ container. Using `Watch` method of the healthcheck service is not supported,
84
+ but may be considered in future iterations.
85
+ As spec'd, the `kubelet` probe will not allow use of client
81
86
certificates nor verify the certificate on the container. We do not
82
87
support other protocols for the time being (unencrypted HTTP/2, QUIC).
83
88
84
- Note that `readinessProbe.grpc.service` may be confusing, some
85
- alternatives :
86
-
87
- - ` serviceName`
88
- - ` healthCheckServiceName`
89
- - ` grpcService`
90
- - ` grpcServiceName`
91
-
92
- These options can be added in Beta with user feedback.
93
-
94
89
The healthcheck request will be identified with the following gRPC
95
90
` User-Agent` metadata. This user agent will be statically defined (not
96
91
configurable by the user) :
97
92
98
93
` ` `
99
- User-Agent: kubernetes /K8S_MAJOR_VER.K8S_MINOR_VER
94
+ User-Agent: kube-probe /K8S_MAJOR_VER.K8S_MINOR_VER
100
95
` ` `
101
96
102
97
Example :
103
98
104
99
` ` `
105
- User-Agent: kubernetes /1.22
100
+ User-Agent: kube-probe /1.23
106
101
` ` `
107
102
108
103
# ## Risks and Mitigations
@@ -151,6 +146,18 @@ move users away from using the (portNum, portName) union type.
151
146
- Unit test: Add unit tests to ` pkg/kubelet/prober/... `
152
147
- e2e: Add test case and conformance test to ` e2e/common/node/container_probe.go ` .
153
148
149
+ ### Alternative Considerations
150
+
151
+ Note that ` readinessProbe.grpc.service ` may be confusing, some
152
+ alternatives considered:
153
+
154
+ - ` serviceName `
155
+ - ` healthCheckServiceName `
156
+ - ` grpcService `
157
+ - ` grpcServiceName `
158
+
159
+ There were no feedback on the selected name being confusing in the context of a probe definition.
160
+
154
161
### Graduation Criteria
155
162
156
163
#### Alpha
@@ -160,11 +167,7 @@ move users away from using the (portNum, portName) union type.
160
167
161
168
#### Beta
162
169
163
- - Solicit feedback from the Alpha. Validate that API is appropriate
164
- for users. There are some potential tunables:
165
- - ` User-Agent `
166
- - connect timeout
167
- - protocol (HTTP, QUIC)
170
+ - Solicit feedback from the Alpha.
168
171
- Ensure tests are stable and passing.
169
172
170
173
Depending on skew strategy:
@@ -174,7 +177,11 @@ Depending on skew strategy:
174
177
175
178
#### GA
176
179
177
- - Address feedback from beta
180
+ - Address feedback from beta usage
181
+ - Validate that API is appropriate for users. There are some potential tunables:
182
+ - ` User-Agent `
183
+ - connect timeout
184
+ - protocol (HTTP, QUIC)
178
185
- Close on any remaining open issues & bugs
179
186
180
187
### Upgrade / Downgrade Strategy
@@ -286,102 +293,40 @@ No
286
293
287
294
### Monitoring Requirements
288
295
289
- TODO for Beta.
290
-
291
- <!--
292
-
293
296
###### How can an operator determine if the feature is in use by workloads?
294
297
295
- TODO for Beta.
296
-
297
- <!--
298
- Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
299
- checking if there are objects with field X set) may be a last resort. Avoid
300
- logs or events for this purpose.
301
- ->
298
+ When gRPC probe is configured, Pod must be scheduled and, the metric
299
+ ` probe_total ` can be observed to see the result of probe execution.
302
300
303
301
###### How can someone using this feature know that it is working for their instance?
304
302
305
- TODO for Beta.
303
+ When gRPC probe is configured, Pod must be scheduled and, the metric
304
+ ` probe_total ` can be observed to see the result of probe execution.
306
305
307
- <!--
308
- For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
309
- for each individual pod.
310
- Pick one more of these and delete the rest.
311
- Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
312
- and operation of this feature.
313
- Recall that end users cannot usually observe component logs or access metrics.
314
- ->
315
-
316
- - [ ] Events
317
- - Event Reason:
318
- - [ ] API .status
319
- - Condition name:
320
- - Other field:
321
- - [ ] Other (treat as last resort)
322
- - Details:
306
+ Event will be emitted for the failed probe and logs available in ` kubelet.log `
307
+ to troubleshoot the failing probes.
323
308
324
309
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
325
310
326
- <!--
327
- This is your opportunity to define what "normal" quality of service looks like
328
- for a feature.
329
-
330
- It's impossible to provide comprehensive guidance, but at the very
331
- high level (needs more precise definitions) those may be things like:
332
- - per-day percentage of API calls finishing with 5XX errors <= 1%
333
- - 99% percentile over day of absolute value from (job creation time minus expected
334
- job creation time) for cron job <= 10%
335
- - 99.9% of /health requests per day finish with 200 code
336
-
337
- These goals will help you determine what you need to measure (SLIs) in the next
338
- question.
339
- ->
311
+ Probe must succeed whenever service has returned the correct response
312
+ in defined timeout, and fail otherwise.
340
313
341
314
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
342
315
343
- <!--
344
- Pick one more of these and delete the rest.
345
- ->
346
-
347
- - [ ] Metrics
348
- - Metric name:
349
- - [Optional] Aggregation method:
350
- - Components exposing the metric:
351
- - [ ] Other (treat as last resort)
352
- - Details:
316
+ The metric ` probe_total ` can be used to check for the probe result. Event and
317
+ ` kubelet.log ` log entries can be observed to troubleshoot issues.
353
318
354
319
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
355
320
356
- <!--
357
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
358
- implementation difficulties, etc.).
359
- -->
321
+ Creation of a probe duration metric is tracked in this issue:
322
+ https://github.com/kubernetes/kubernetes/issues/101035 and out of scope for this
323
+ KEP.
360
324
361
325
### Dependencies
362
326
363
- Beta TODO
364
-
365
- <!--
366
- This section must be completed when targeting beta to a release.
367
- ->
368
-
369
327
###### Does this feature depend on any specific services running in the cluster?
370
328
371
- <!--
372
- Think about both cluster-level services (e.g. metrics-server) as well
373
- as node-level agents (e.g. specific version of CRI). Focus on external or
374
- optional services that are needed. For example, if this feature depends on
375
- a cloud provider API, or upon an external software-defined storage or network
376
- control plane.
377
-
378
- For each of these, fill in the following—thinking about running existing user workloads
379
- and creating new ones, as well as about cluster-level services (e.g. DNS):
380
- - [Dependency name]
381
- - Usage description:
382
- - Impact of its outage on the feature:
383
- - Impact of its degraded performance or high-error rates on the feature:
384
- -->
329
+ No
385
330
386
331
### Scalability
387
332
@@ -399,48 +344,37 @@ No.
399
344
400
345
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
401
346
402
- Adds < 200 bytes to Pod.Spec.
403
-
347
+ Adds < 200 bytes to Pod.Spec, which is consistent with other probe types.
404
348
405
349
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
406
350
407
351
No.
408
352
409
353
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
410
354
411
- No .
355
+ The overhead of executing probes is consistent with other probe types .
412
356
413
- ### Troubleshooting
357
+ We expect decrease of disk, RAM, and CPU use for many scenarios where the https://github.com/grpc-ecosystem/grpc-health-probe
358
+ was used to probe gRPC endpoints.
414
359
415
- Beta TODO.
416
- <!--
417
- This section must be completed when targeting beta to a release.
418
-
419
- The Troubleshooting section currently serves the `Playbook` role. We may consider
420
- splitting it into a dedicated `Playbook` document (potentially with some monitoring
421
- details). For now, we leave it here.
422
- ->
360
+ ### Troubleshooting
423
361
424
362
###### How does this feature react if the API server and/or etcd is unavailable?
425
363
364
+ No dependency on etcd availability.
365
+
426
366
###### What are other known failure modes?
427
367
428
- <!--
429
- For each of them, fill in the following information by copying the below template:
430
- - [Failure mode brief description]
431
- - Detection: How can it be detected via metrics? Stated another way:
432
- how can an operator troubleshoot without logging into a master or worker node?
433
- - Mitigations: What can be done to stop the bleeding, especially for already
434
- running user workloads?
435
- - Diagnostics: What are the useful log messages and their required logging
436
- levels that could help debug the issue?
437
- Not required until feature graduated to beta.
438
- - Testing: Are there any tests for failure mode? If not, describe why.
439
- ->
368
+ None
440
369
441
370
###### What steps should be taken if SLOs are not being met to determine the problem?
442
371
443
- -->
372
+ - Make sure feature gate is set
373
+ - Make sure configuration is correct and gRPC service is reacheable by kubelet.
374
+ This may be different when migrating off https://github.com/grpc-ecosystem/grpc-health-probe
375
+ and is covered in feature documentation.
376
+ - ` kubelet.log ` log must be analyzed to understand why there is a mismatch of
377
+ service response and status reported by probe.
444
378
445
379
## Implementation History
446
380
@@ -462,6 +396,14 @@ Major milestones might include:
462
396
* 2021-05-12: Cloned to this KEP to move the probe forward.
463
397
* 2021-05-13: Updates.
464
398
399
+ ### Alpha
400
+
401
+ Alpha feature was implemented in 1.23.
402
+
403
+ ### Beta
404
+
405
+ Feature is promoted to beta in 1.24.
406
+
465
407
## Alternatives
466
408
467
409
* 3rd party solutions like https://github.com/grpc-ecosystem/grpc-health-probe
0 commit comments