Skip to content

Commit 50d6306

Browse files
KEP 3458: Remove transient node predicates from KCCM's service controller
1 parent 2d3d1d7 commit 50d6306

File tree

2 files changed

+405
-0
lines changed
  • keps/sig-network/3458-remove-transient-node-predicates-from-service-controller

2 files changed

+405
-0
lines changed
Lines changed: 381 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,381 @@
1+
# KEP-3458: Remove transient node predicates from KCCM's service controller
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Risks and Mitigations](#risks-and-mitigations)
11+
- [Risk](#risk)
12+
- [Mitigations](#mitigations)
13+
- [Design Details](#design-details)
14+
- [Test Plan](#test-plan)
15+
- [Prerequisite testing updates](#prerequisite-testing-updates)
16+
- [Unit tests](#unit-tests)
17+
- [Integration tests](#integration-tests)
18+
- [e2e tests](#e2e-tests)
19+
- [Graduation Criteria](#graduation-criteria)
20+
- [Beta](#beta)
21+
- [GA](#ga)
22+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
23+
- [Version Skew Strategy](#version-skew-strategy)
24+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
25+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
26+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
27+
- [Monitoring Requirements](#monitoring-requirements)
28+
- [Dependencies](#dependencies)
29+
- [Scalability](#scalability)
30+
- [Troubleshooting](#troubleshooting)
31+
- [Implementation History](#implementation-history)
32+
- [Drawbacks](#drawbacks)
33+
- [Alternatives](#alternatives)
34+
<!-- /toc -->
35+
36+
## Release Signoff Checklist
37+
38+
Items marked with (R) are required *prior to targeting to a milestone / release*.
39+
40+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
41+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
42+
- [ ] (R) Design details are appropriately documented
43+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
44+
- [ ] e2e Tests for all Beta API Operations (endpoints)
45+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
46+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
47+
- [ ] (R) Graduation criteria is in place
48+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
49+
- [ ] (R) Production readiness review completed
50+
- [ ] (R) Production readiness review approved
51+
- [ ] "Implementation History" section is up-to-date for milestone
52+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
53+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
54+
55+
<!--
56+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
57+
-->
58+
59+
[kubernetes.io]: https://kubernetes.io/
60+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
61+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
62+
[kubernetes/website]: https://git.k8s.io/website
63+
64+
## Summary
65+
66+
The service controller in the Kubernetes cloud controller manager (KCCM)
67+
currently adds/removes Nodes from the load balancers' node set in the following
68+
cases:
69+
70+
a) When a node gets the taint `ToBeDeletedByClusterAutoscaler` added/removed
71+
b) When a node goes `Ready` / `NotReady`
72+
73+
b) however only applies to services with `externalTrafficPolicy: Cluster`. In
74+
both cases: removing the Node in question from the load balancers' node set will
75+
cause all connections on that node to get terminated instantly. This can be
76+
considered a bug / sub-optimal behavior for nodes which are experiencing
77+
transient readiness state or for terminating nodes, since connections are not
78+
allowed to drain in those cases, even though the load balancer might support
79+
that. Moreover: on large clusters with a lot nodes and entropy, re-syncing load
80+
balancers like this can lead to rate-limiting by the cloud provider due to an
81+
excessive amount of update calls.
82+
83+
As to enable connection draining, reduce cloud provider API calls and simplify
84+
the KCCMs sync loop: this KEP proposes that the service controller stops
85+
synchronizing the load balancer node set in these cases. Seeing as how this has
86+
always been the case, a new feature gate `StableLoadBalancerNodeSet` will
87+
be introduced, which will be used to enable the more optimal behavior.
88+
89+
## Motivation
90+
91+
Abruptly terminating connections in the cases defined by a) and b) above can be
92+
seen as buggy behavior and should be improved. By enabling connection draining,
93+
applications are allowed profit from graceful shutdown / termination, for what
94+
concerns cluster ingress connectivity. Users of Kubernetes will also see a
95+
reduction in the amount of cloud API calls, for what concerns calls stemming
96+
from syncing load balancers with the Kubernetes cluster state.
97+
98+
### Goals
99+
100+
- Stop re-configuring the load balancers' node set for cases a) and b) above
101+
102+
### Non-Goals
103+
104+
- Stop re-configuring the load balancers' node set for fully deleted /
105+
newly added cluster nodes, or for nodes which get annotated with
106+
`node.kubernetes.io/exclude-from-external-load-balancers`.
107+
- Enable load balancer connection draining while Node is draining. This requires
108+
health check changes.
109+
110+
## Proposal
111+
112+
### Risks and Mitigations
113+
114+
#### Risk
115+
116+
1. Cloud providers which do not allow VM deletion when the VM is referenced by
117+
other constructs, will block the cluster auto-scaler (CA) from deleting the VM
118+
upon downscale. This will result in reduced downscale performance by the CA,
119+
or completely block VM deletion from happening - this is because the service
120+
controller will never proceed to de-reference the VM from the load balancer
121+
node set until the Node is fully deleted in the API server, which will never
122+
occur until the VM is deleted. The three major cloud providers (GCP/AWS/Azure)
123+
do however support this, and it is not expected that other providers don't.
124+
2. Cloud providers which do not configure their load balancer health checks to
125+
target the service proxy's healthz, alternatively: constructs which validate
126+
the endpoint's reachability across the data plane; risk experiencing
127+
regressions as a consequence of the removal of b). This would happen if a node
128+
is faced with a terminal error which does impact the Node's network
129+
connectivity. Doing this is considered incorrect, and therefor not expected to
130+
be the case.
131+
3. By removing b) above we are delaying the removal of the Node from the load
132+
balancers' node set until the Node is completely deleted in the API server.
133+
This might have an impact on CA downscaling. The reason for this is: the CA
134+
deletes the VM and expects the node controller in the KCCM to notice this and
135+
delete the Node in Kubernetes, as a consequence. If the node controller takes
136+
a while to sync that and other Node related events trigger load balancer
137+
reconciliation while this is happening, then the service controller will error
138+
until the cluster reaches steady-state (because it's trying to sync Nodes for
139+
which the VM is non-existent). A mitigation to this is presented in
140+
[Mitigations](#mitigations)
141+
142+
#### Mitigations
143+
144+
- Cloud providers/workloads which do not support the behavior mentioned in
145+
[Risk](#risk), have the possibility to set the feature flag to false, thus
146+
default back to the current mechanism.
147+
- For point 3. we could implement the following change in the service
148+
controller; ensure it only enqueues Node UPDATE on changes to
149+
`.metadata.labels["node.kubernetes.io/exclude-from-external-load-balancers"]`.
150+
When processing the sync: list only Node following the existing predicates
151+
defined for `externalTrafficPolicy: Cluster/Local` services (both currently
152+
exclude Nodes with the taint `ToBeDeletedByClusterAutoscaler`). This will
153+
ensure that whatever Nodes are included in the load balancer set, always have
154+
a corresponding VM. Doing this is however reverting on the goal of the KEP.
155+
156+
## Design Details
157+
158+
- Implement the change in the service controller and ensure it does not add /
159+
remove nodes from the load balancers' node set for cases a) and b) mentioned
160+
in (Summary)[#Summary]
161+
- Add the feature gate: `StableLoadBalancerNodeSet`, set it to "on" by
162+
default and promote it directly to Beta.
163+
164+
### Test Plan
165+
166+
#### Prerequisite testing updates
167+
168+
#### Unit tests
169+
170+
The service controller in the KCCM currently has a set of tests validating
171+
expected syncs caused by Node predicates, these will need to be updated.
172+
173+
#### Integration tests
174+
175+
Kubernetes is mostly tested via unit tests and e2e, not integration, and this is
176+
not expected to change.
177+
178+
#### e2e tests
179+
180+
Kubernetes in general needs to extended its load balancing test suite with
181+
disruption tests, this might be the right effort we need to get that ball
182+
rolling. Testing would include:
183+
184+
- validation that an application running on a deleting VM benefits from graceful
185+
termination of its TCP connection.
186+
- validation that Node readiness state changes do not result in load balancer
187+
re-syncs.
188+
189+
### Graduation Criteria
190+
191+
#### Beta
192+
193+
This is addressing a sub-optimal solution currently existing in Kubernetes, so
194+
the feature gate will be moved to Beta and "on" by default from the start.
195+
196+
The feature flag should be kept available until we get sufficient evidence of
197+
people not being affected by anything mentioned in (Risks)[#Risks] or other.
198+
199+
#### GA
200+
201+
Given the lack of reported issues in Beta: the feature gate will be locked-in in
202+
GA.
203+
204+
Tentative timeline for this is in v1.29. Services of `type: LoadBalancer` are
205+
sufficiently common on any given Kubernetes cluster, that any cloud provider
206+
susceptible to the (Risks)[#Risks] will very likely report issues in Beta.
207+
208+
### Upgrade / Downgrade Strategy
209+
210+
Any upgrade to a version enabling the feature, succeeded by a downgrade to a
211+
version disabling it, is not expected to be impacted in any way. On upgrade: the
212+
service controller will add all existing cluster nodes (bar excluded ones) to
213+
the load balancer set. On downgrade: any nodes `NotReady` / tainted will get
214+
reconciled by the service controller corresponding to the downgraded control
215+
plane version and get removed from the load balancer set - as they should.
216+
217+
### Version Skew Strategy
218+
219+
This change is contained to only the control plane and is therefor not
220+
impacted by any version skew.
221+
222+
## Production Readiness Review Questionnaire
223+
224+
### Feature Enablement and Rollback
225+
226+
###### How can this feature be enabled / disabled in a live cluster?
227+
228+
- [X] Feature gate (also fill in values in `kep.yaml`)
229+
- Feature gate name: `StableLoadBalancerNodeSet`
230+
- Components depending on the feature gate: Kubernetes cloud controller manager
231+
232+
###### Does enabling the feature change any default behavior?
233+
234+
Yes, Kubernetes Nodes will remain in the load balancers' node set until fully
235+
deleted in the API server, as opposed to the current behavior: which adds /
236+
removes the nodes from the set when the Node experience transient state changes.
237+
Cloud providers which do not support deleting VMs which are still referenced by
238+
load balancers, will be unable to do so upon downscaling by the cluster
239+
auto-scaler when it attempts to delete the VM.
240+
241+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
242+
243+
Yes, by resetting the feature gate back.
244+
245+
###### What happens if we reenable the feature if it was previously rolled back?
246+
247+
Behavior will be restored back immediately.
248+
249+
###### Are there any tests for feature enablement/disablement?
250+
251+
N/A
252+
253+
### Rollout, Upgrade and Rollback Planning
254+
255+
###### How can a rollout or rollback fail? Can it impact already running workloads?
256+
257+
Rollout and rollback are not expected to fail.
258+
259+
###### What specific metrics should inform a rollback?
260+
261+
Performance degradation by the CA when downscaling / flat out inability to delete VMs.
262+
263+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
264+
265+
N/A
266+
267+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
268+
269+
No.
270+
271+
### Monitoring Requirements
272+
273+
The only mechanism currently implemented, is: events for syncing load balancers
274+
in the KCCM. The events are triggered any time a service is synced or Node
275+
change triggers a re-sync of all services. This will not change and can be used
276+
to monitor the implemented change. The implementation will result in less load
277+
balancer re-syncs.
278+
279+
A new metric `load_balancer_sync_count` will be added for explicitly monitoring
280+
the amount of load balancer related syncs performed by the service controller.
281+
This will include load balancer syncs caused by Service and Node changes.
282+
283+
A new metric `nodesync_error_rate` will be added for explicitly monitoring the
284+
amount of errors produced by syncing Node related events for load balancers. The
285+
goal is have an indicator of if the service controller is impacted by point 3.
286+
mentioned in (Risk)[#Risk], and at which frequency.
287+
288+
###### How can an operator determine if the feature is in use by workloads?
289+
290+
Analyze events stemming from the API server, correlating node state changes
291+
(readiness or addition / removal of the taint: `ToBeDeletedByClusterAutoscaler`)
292+
to load balancer re-syncs. The events should show a clear reduction in re-syncs
293+
post the implementation and rollout of the change.
294+
295+
###### How can someone using this feature know that it is working for their instance?
296+
297+
Yes, when a node transition from `Ready` <-> `NotReady` the load balancers are
298+
not re-synced and the load balancers' node set will remain unchanged for
299+
`externalTrafficPolicy: Cluster` services. On downscaling by the CA, the node
300+
will remain the the load balancers' set for a longer period of time, just until
301+
the Node object in Kubernetes is fully deleted.
302+
303+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
304+
305+
Total amount of load balancer re-syncs should be reduced, leading to less cloud
306+
provider API calls. Also, and more subtle: connections will get a chance to
307+
gracefully terminate when the CA downscales cluster nodes. For services of type
308+
`externalTrafficPolicy: Cluster` "traversing" connections through a "nexthop"
309+
node might not be impacted by that Node's readiness state anymore.
310+
311+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
312+
313+
<!--
314+
Pick one more of these and delete the rest.
315+
-->
316+
317+
- [X] Metrics
318+
- Events: The KCCM triggers events when syncing load balancers. The amount of
319+
these events should be reduced.
320+
321+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
322+
323+
N/A
324+
325+
### Dependencies
326+
327+
###### Does this feature depend on any specific services running in the cluster?
328+
329+
It depends on:
330+
331+
- having a `type: LoadBalancer` service with `externalTrafficPolicy: Cluster`
332+
333+
### Scalability
334+
335+
###### Will enabling / using this feature result in any new API calls?
336+
337+
No
338+
339+
###### Will enabling / using this feature result in introducing new API types?
340+
341+
No
342+
343+
###### Will enabling / using this feature result in any new calls to the cloud provider?
344+
345+
No
346+
347+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
348+
349+
No
350+
351+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
352+
353+
No
354+
355+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
356+
357+
No
358+
359+
### Troubleshooting
360+
361+
###### How does this feature react if the API server and/or etcd is unavailable?
362+
363+
Not any different than today.
364+
365+
###### What are other known failure modes?
366+
367+
None
368+
369+
###### What steps should be taken if SLOs are not being met to determine the problem?
370+
371+
Validate that services of `type: LoadBalancer` exists on the cluster and that
372+
Nodes are experiencing a transitioning readiness state, alternatively that the
373+
CA downscales and deletes VMs.
374+
375+
## Implementation History
376+
377+
- 2023-02-01: Initial proposal
378+
379+
## Drawbacks
380+
381+
## Alternatives

0 commit comments

Comments
 (0)