Skip to content

Commit 25ada4e

Browse files
authored
Merge pull request kubernetes#4258 from alexanderConstantinescu/3836-v129
[KEP 3836]: Bump to beta for 1.29
2 parents a81c32a + 6539f3b commit 25ada4e

File tree

3 files changed

+53
-18
lines changed

3 files changed

+53
-18
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3836
22
alpha:
3+
approver: "@wojtek-t"
4+
beta:
35
approver: "@wojtek-t"

keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md

Lines changed: 49 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -206,8 +206,8 @@ The risk are:
206206
of a failing kube-proxy on ingress connectivity. Kube-proxy currently has a
207207
lot of metrics regarding how its health is doing, but no direct red/green
208208
indicator of what the end result of its health is. A couple of such metric
209-
could be `proxy_healthz_200_count` /
210-
`proxy_healthz_503_count`/`proxy_livez_200_count` / `proxy_livez_503_count`
209+
could be `proxy_healthz_total`/`proxy_livez_total` with labels for the
210+
HTTP status codes: 503 / 200.
211211

212212
4. The feature could be disabled for user who is dependent upon such behavior by
213213
means of flipping the feature flag to off.
@@ -328,27 +328,27 @@ in that case.
328328

329329
###### What specific metrics should inform a rollback?
330330

331-
The metric: `proxy_healthz_503_count` mentioned in [Monitoring
331+
The metric: `proxy_healthz_total` (with label: 503) mentioned in [Monitoring
332332
requirements](#monitoring-requirements) will inform on red `healthz`.
333-
`proxy_livez_503_count` will inform on red `livez` state. If the `healthz` count
334-
is increasing but the `livez` does not: then a problem might have occurred with
335-
the node related reconciliation logic.
333+
`proxy_livez_total` (with label: 503) will inform on red `livez` state. If the
334+
`healthz` count is increasing but the `livez` does not: then a problem might
335+
have occurred with the node related reconciliation logic.
336336

337337
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
338338

339-
Once the change is implemented: the author will work with Kubernetes vendors to
340-
test the upgrade/downgrade scenario in a cloud environment.
339+
Given that the feature is purely in-memory for kube-proxy and determines the way
340+
it reports /healthz: upgrade-rollback-upgrade doesn't introduce additional value
341+
on top of regular feature tests.
341342

342343
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
343344

344345
No
345346

346347
### Monitoring Requirements
347348

348-
Four new metrics
349-
`proxy_healthz_200_count`/`proxy_healthz_503_count`/`proxy_livez_200_count`/`proxy_livez_503_count`
350-
which will count the amount of reported successful/unsuccessful health check
351-
invocations. A drop in this metric can then be correlated to impacted ingress
349+
Two new metrics `proxy_healthz_total`/`proxy_livez_total` which will count the
350+
amount of reported successful/unsuccessful health check invocations per `503`
351+
and `200`. These metrics can then be correlated to impacted ingress
352352
connectivity, for endpoints running on those nodes.
353353

354354
###### How can an operator determine if the feature is in use by workloads?
@@ -369,10 +369,8 @@ No
369369
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
370370

371371
- [X] Metrics
372-
- Metric name: `proxy_healthz_200_count`
373-
- Metric name: `proxy_healthz_503_count`
374-
- Metric name: `proxy_livez_200_count`
375-
- Metric name: `proxy_livez_503_count`
372+
- Metric name: `proxy_healthz_total`
373+
- Metric name: `proxy_livez_total`
376374

377375
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
378376

@@ -424,8 +422,43 @@ Not any different than today.
424422

425423
###### What are other known failure modes?
426424

425+
- Vendors of Kubernetes which deploy Kube-proxy and specify a `livenessProbe`
426+
targeting `/healthz` are expected to start seeing a CrashLooping Kube-proxy
427+
when the Node gets tainted with `ToBeDeletedByClusterAutoscaler`. This is
428+
because: if we modify `/healthz` to fail when this taint gets added on the
429+
Node, then the `livenessProbe` will fail, causing the Kubelet to restart the
430+
Pod until the Node is deleted.
431+
- Detection: node is tainted with `ToBeDeletedByClusterAutoscaler` upon which
432+
Kube-proxy fails its `/healthz` check and starts Crashlooping. Confirm this
433+
by validating that Kube-proxy has a `livenessProbe` defined which targets
434+
`/healthz`.
435+
- Mitigations:
436+
- While in beta: disable the feature gate
437+
`KubeProxyDrainingTerminatingNodes`.
438+
- While in stable: update the `livenessProbe` to target `/livez`.
439+
`ToBeDeletedByClusterAutoscaler` is a taint placed on the Node by the
440+
cluster-autoscaler and indicates that the node will be deleted. Kube-proxy
441+
is therefore going to terminate soon in any case. If a Crashlooping
442+
Kube-proxy is problematic in such a situations (ex: it needs to handle
443+
service/endpoint updates until the node is completely gone), then updating
444+
the `livenessProbe` to `/livez`, provides a mitigation and resolves the
445+
issue once the update has rolled out.
446+
- Diagnostics:
447+
- The metric `proxy_healthz_total` aggregated over the label `503` is
448+
increasing while the metric `proxy_livez_total` aggregated over the label
449+
`503` remains unchanged. This indicates and confirms that the `/healthz`
450+
endpoint is failing, and that the reason is: the node is being deleted.
451+
This is the difference between `/healthz` and `/livez`.
452+
- Testing:
453+
- Configure Kube-proxy with a `livenessProbe` targeting `/healthz` and
454+
delete a Node. Kube-proxy on that Node should start failing its `/healthz`
455+
and start Crashlooping. Apply the fixes proposed in `Mitigations` and
456+
verify that it resolves the issue.
457+
427458
###### What steps should be taken if SLOs are not being met to determine the problem?
428459

460+
There are no SLOs for this KEP, see: "What are the reasonable SLOs (Service Level Objectives) for the enhancement?"
461+
429462
## Implementation History
430463

431464
- 2023-02-03: Initial proposal

keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ reviewers: ['@thockin', '@danwinship', "@aojea"]
77
approvers: ['@thockin']
88
creation-date: "2023-02-03"
99
status: implementable
10-
stage: alpha
11-
latest-milestone: "v1.28"
10+
stage: beta
11+
latest-milestone: "v1.29"
1212
milestone:
1313
alpha: "v1.28"
1414
beta: "v1.29"

0 commit comments

Comments
 (0)