@@ -206,8 +206,8 @@ The risk are:
206
206
of a failing kube-proxy on ingress connectivity. Kube-proxy currently has a
207
207
lot of metrics regarding how its health is doing, but no direct red/green
208
208
indicator of what the end result of its health is. A couple of such metric
209
- could be ` proxy_healthz_200_count ` /
210
- ` proxy_healthz_503_count ` / ` proxy_livez_200_count ` / ` proxy_livez_503_count `
209
+ could be ` proxy_healthz_total ` / ` proxy_livez_total ` with labels for the
210
+ HTTP status codes: 503 / 200.
211
211
212
212
4 . The feature could be disabled for user who is dependent upon such behavior by
213
213
means of flipping the feature flag to off.
@@ -328,27 +328,27 @@ in that case.
328
328
329
329
###### What specific metrics should inform a rollback?
330
330
331
- The metric: ` proxy_healthz_503_count ` mentioned in [ Monitoring
331
+ The metric: ` proxy_healthz_total ` (with label: 503) mentioned in [ Monitoring
332
332
requirements] ( #monitoring-requirements ) will inform on red ` healthz ` .
333
- ` proxy_livez_503_count ` will inform on red ` livez ` state. If the ` healthz ` count
334
- is increasing but the ` livez ` does not: then a problem might have occurred with
335
- the node related reconciliation logic.
333
+ ` proxy_livez_total ` (with label: 503) will inform on red ` livez ` state. If the
334
+ ` healthz ` count is increasing but the ` livez ` does not: then a problem might
335
+ have occurred with the node related reconciliation logic.
336
336
337
337
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
338
338
339
- Once the change is implemented: the author will work with Kubernetes vendors to
340
- test the upgrade/downgrade scenario in a cloud environment.
339
+ Given that the feature is purely in-memory for kube-proxy and determines the way
340
+ it reports /healthz: upgrade-rollback-upgrade doesn't introduce additional value
341
+ on top of regular feature tests.
341
342
342
343
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
343
344
344
345
No
345
346
346
347
### Monitoring Requirements
347
348
348
- Four new metrics
349
- ` proxy_healthz_200_count ` /` proxy_healthz_503_count ` /` proxy_livez_200_count ` /` proxy_livez_503_count `
350
- which will count the amount of reported successful/unsuccessful health check
351
- invocations. A drop in this metric can then be correlated to impacted ingress
349
+ Two new metrics ` proxy_healthz_total ` /` proxy_livez_total ` which will count the
350
+ amount of reported successful/unsuccessful health check invocations per ` 503 `
351
+ and ` 200 ` . These metrics can then be correlated to impacted ingress
352
352
connectivity, for endpoints running on those nodes.
353
353
354
354
###### How can an operator determine if the feature is in use by workloads?
369
369
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
370
370
371
371
- [X] Metrics
372
- - Metric name: ` proxy_healthz_200_count `
373
- - Metric name: ` proxy_healthz_503_count `
374
- - Metric name: ` proxy_livez_200_count `
375
- - Metric name: ` proxy_livez_503_count `
372
+ - Metric name: ` proxy_healthz_total `
373
+ - Metric name: ` proxy_livez_total `
376
374
377
375
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
378
376
@@ -424,8 +422,43 @@ Not any different than today.
424
422
425
423
###### What are other known failure modes?
426
424
425
+ - Vendors of Kubernetes which deploy Kube-proxy and specify a ` livenessProbe `
426
+ targeting ` /healthz ` are expected to start seeing a CrashLooping Kube-proxy
427
+ when the Node gets tainted with ` ToBeDeletedByClusterAutoscaler ` . This is
428
+ because: if we modify ` /healthz ` to fail when this taint gets added on the
429
+ Node, then the ` livenessProbe ` will fail, causing the Kubelet to restart the
430
+ Pod until the Node is deleted.
431
+ - Detection: node is tainted with ` ToBeDeletedByClusterAutoscaler ` upon which
432
+ Kube-proxy fails its ` /healthz ` check and starts Crashlooping. Confirm this
433
+ by validating that Kube-proxy has a ` livenessProbe ` defined which targets
434
+ ` /healthz ` .
435
+ - Mitigations:
436
+ - While in beta: disable the feature gate
437
+ ` KubeProxyDrainingTerminatingNodes ` .
438
+ - While in stable: update the ` livenessProbe ` to target ` /livez ` .
439
+ ` ToBeDeletedByClusterAutoscaler ` is a taint placed on the Node by the
440
+ cluster-autoscaler and indicates that the node will be deleted. Kube-proxy
441
+ is therefore going to terminate soon in any case. If a Crashlooping
442
+ Kube-proxy is problematic in such a situations (ex: it needs to handle
443
+ service/endpoint updates until the node is completely gone), then updating
444
+ the ` livenessProbe ` to ` /livez ` , provides a mitigation and resolves the
445
+ issue once the update has rolled out.
446
+ - Diagnostics:
447
+ - The metric ` proxy_healthz_total ` aggregated over the label ` 503 ` is
448
+ increasing while the metric ` proxy_livez_total ` aggregated over the label
449
+ ` 503 ` remains unchanged. This indicates and confirms that the ` /healthz `
450
+ endpoint is failing, and that the reason is: the node is being deleted.
451
+ This is the difference between ` /healthz ` and ` /livez ` .
452
+ - Testing:
453
+ - Configure Kube-proxy with a ` livenessProbe ` targeting ` /healthz ` and
454
+ delete a Node. Kube-proxy on that Node should start failing its ` /healthz `
455
+ and start Crashlooping. Apply the fixes proposed in ` Mitigations ` and
456
+ verify that it resolves the issue.
457
+
427
458
###### What steps should be taken if SLOs are not being met to determine the problem?
428
459
460
+ There are no SLOs for this KEP, see: "What are the reasonable SLOs (Service Level Objectives) for the enhancement?"
461
+
429
462
## Implementation History
430
463
431
464
- 2023-02-03: Initial proposal
0 commit comments