Skip to content

Commit df246c8

Browse files
authored
Merge pull request kubernetes#4114 from RyanAoh/kep-1860
KEP-1860: Make Kubernetes aware of the LoadBalancer behavior (reopen)
2 parents 43ceaed + eea659a commit df246c8

File tree

3 files changed

+311
-12
lines changed

3 files changed

+311
-12
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# The KEP must have an approver from the
2+
# "prod-readiness-approvers" group
3+
# of http://git.k8s.io/enhancements/OWNERS_ALIASES
4+
kep-number: 1860
5+
alpha:
6+
approver: "@wojtek-t"

keps/sig-network/1860-kube-proxy-IP-node-binding/README.md

Lines changed: 298 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,13 @@
2323
- [Beta/GA](#betaga)
2424
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
2525
- [Version Skew Strategy](#version-skew-strategy)
26+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
27+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
28+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
29+
- [Monitoring Requirements](#monitoring-requirements)
30+
- [Dependencies](#dependencies)
31+
- [Scalability](#scalability)
32+
- [Troubleshooting](#troubleshooting)
2633
<!-- /toc -->
2734

2835
## Release Signoff Checklist
@@ -32,13 +39,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
3239
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
3340
- [x] (R) KEP approvers have approved the KEP status as `implementable`
3441
- [x] (R) Design details are appropriately documented
35-
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
36-
- [x] (R) Graduation criteria is in place
42+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
43+
- [ ] e2e Tests for all Beta API Operations (endpoints)
44+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
45+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
46+
- [ ] (R) Graduation criteria is in place
47+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
3748
- [ ] (R) Production readiness review completed
38-
- [ ] Production readiness review approved
49+
- [ ] (R) Production readiness review approved
3950
- [ ] "Implementation History" section is up-to-date for milestone
4051
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
41-
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
52+
- [ ] Supporting documentatione.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4253

4354

4455
## Summary
@@ -122,7 +133,6 @@ API changes to Service:
122133
Unit tests:
123134
- unit tests for the ipvs and iptables rules
124135
- unit tests for the validation
125-
- unit tests for a new util in pkg/proxy
126136

127137
E2E tests:
128138
- The default behavior for `ipMode` does not break any existing e2e tests
@@ -140,7 +150,8 @@ Adds new field `ipMode` to Service, which is used when `LoadBalancerIPMode` feat
140150

141151
### Upgrade / Downgrade Strategy
142152

143-
On upgrade, while the feature gate is disabled, nothing will change. Once the feature gate is enabled, all the previous LoadBalancer service will get an `ipMode` of `VIP`.
153+
On upgrade, while the feature gate is disabled, nothing will change. Once the feature gate is enabled,
154+
all the previous LoadBalancer service will get an `ipMode` of `VIP` by the defaulting function when we get them from kube-apiserver(xref https://github.com/kubernetes/kubernetes/pull/118895/files#r1248316868).
144155
If `kube-proxy` was not yet upgraded: the field will simply be ignored.
145156
If `kube-proxy` was upgraded, and the feature gate enabled, it will stil behave as before if the `ipMode` is `VIP`, and will behave accordingly if the `ipMode` is `Proxy`.
146157

@@ -149,3 +160,284 @@ On downgrade, the feature gate will simply be disabled, and as long as `kube-pro
149160
### Version Skew Strategy
150161

151162
Version skew from the control plane to `kube-proxy` should be trivial since `kube-proxy` will simply ignore the `ipMode` field.
163+
164+
## Production Readiness Review Questionnaire
165+
166+
### Feature Enablement and Rollback
167+
168+
###### How can this feature be enabled / disabled in a live cluster?
169+
170+
- [x] Feature gate (also fill in values in `kep.yaml`)
171+
- Feature gate name: LoadBalancerIPMode
172+
- Components depending on the feature gate: kube-proxy, kube-apiserver, cloud-controller-manager
173+
174+
###### Does enabling the feature change any default behavior?
175+
176+
No.
177+
178+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
179+
180+
Yes, by disabling the feature gate. Disabling it in kube-proxy is necessary and sufficient to have a user-visible effect.
181+
182+
###### What happens if we reenable the feature if it was previously rolled back?
183+
184+
It works. The forwarding rules for services which have the value of `ipMode` had been set to "Proxy" will be removed by kube-proxy.
185+
186+
###### Are there any tests for feature enablement/disablement?
187+
188+
Yes. It is tested by `TestUpdateServiceLoadBalancerStatus` in pkg/registry/core/service/storage/storage_test.go.
189+
190+
### Rollout, Upgrade and Rollback Planning
191+
192+
<!--
193+
This section must be completed when targeting beta to a release.
194+
-->
195+
196+
###### How can a rollout or rollback fail? Can it impact already running workloads?
197+
198+
<!--
199+
Try to be as paranoid as possible - e.g., what if some components will restart
200+
mid-rollout?
201+
202+
Be sure to consider highly-available clusters, where, for example,
203+
feature flags will be enabled on some API servers and not others during the
204+
rollout. Similarly, consider large clusters and how enablement/disablement
205+
will rollout across nodes.
206+
-->
207+
208+
###### What specific metrics should inform a rollback?
209+
210+
<!--
211+
What signals should users be paying attention to when the feature is young
212+
that might indicate a serious problem?
213+
-->
214+
215+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
216+
217+
<!--
218+
Describe manual testing that was done and the outcomes.
219+
Longer term, we may want to require automated upgrade/rollback tests, but we
220+
are missing a bunch of machinery and tooling and can't do that now.
221+
-->
222+
223+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
224+
225+
<!--
226+
Even if applying deprecation policies, they may still surprise some users.
227+
-->
228+
229+
### Monitoring Requirements
230+
231+
<!--
232+
This section must be completed when targeting beta to a release.
233+
234+
For GA, this section is required: approvers should be able to confirm the
235+
previous answers based on experience in the field.
236+
-->
237+
238+
###### How can an operator determine if the feature is in use by workloads?
239+
240+
<!--
241+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
242+
checking if there are objects with field X set) may be a last resort. Avoid
243+
logs or events for this purpose.
244+
-->
245+
246+
###### How can someone using this feature know that it is working for their instance?
247+
248+
<!--
249+
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
250+
for each individual pod.
251+
Pick one more of these and delete the rest.
252+
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
253+
and operation of this feature.
254+
Recall that end users cannot usually observe component logs or access metrics.
255+
-->
256+
257+
- [ ] Events
258+
- Event Reason:
259+
- [ ] API .status
260+
- Condition name:
261+
- Other field:
262+
- [ ] Other (treat as last resort)
263+
- Details:
264+
265+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
266+
267+
<!--
268+
This is your opportunity to define what "normal" quality of service looks like
269+
for a feature.
270+
271+
It's impossible to provide comprehensive guidance, but at the very
272+
high level (needs more precise definitions) those may be things like:
273+
- per-day percentage of API calls finishing with 5XX errors <= 1%
274+
- 99% percentile over day of absolute value from (job creation time minus expected
275+
job creation time) for cron job <= 10%
276+
- 99.9% of /health requests per day finish with 200 code
277+
278+
These goals will help you determine what you need to measure (SLIs) in the next
279+
question.
280+
-->
281+
282+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
283+
284+
<!--
285+
Pick one more of these and delete the rest.
286+
-->
287+
288+
- [ ] Metrics
289+
- Metric name:
290+
- [Optional] Aggregation method:
291+
- Components exposing the metric:
292+
- [ ] Other (treat as last resort)
293+
- Details:
294+
295+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
296+
297+
<!--
298+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
299+
implementation difficulties, etc.).
300+
-->
301+
302+
### Dependencies
303+
304+
<!--
305+
This section must be completed when targeting beta to a release.
306+
-->
307+
308+
###### Does this feature depend on any specific services running in the cluster?
309+
310+
<!--
311+
Think about both cluster-level services (e.g. metrics-server) as well
312+
as node-level agents (e.g. specific version of CRI). Focus on external or
313+
optional services that are needed. For example, if this feature depends on
314+
a cloud provider API, or upon an external software-defined storage or network
315+
control plane.
316+
317+
For each of these, fill in the following—thinking about running existing user workloads
318+
and creating new ones, as well as about cluster-level services (e.g. DNS):
319+
- [Dependency name]
320+
- Usage description:
321+
- Impact of its outage on the feature:
322+
- Impact of its degraded performance or high-error rates on the feature:
323+
-->
324+
325+
### Scalability
326+
327+
<!--
328+
For alpha, this section is encouraged: reviewers should consider these questions
329+
and attempt to answer them.
330+
331+
For beta, this section is required: reviewers must answer these questions.
332+
333+
For GA, this section is required: approvers should be able to confirm the
334+
previous answers based on experience in the field.
335+
-->
336+
337+
###### Will enabling / using this feature result in any new API calls?
338+
339+
<!--
340+
Describe them, providing:
341+
- API call type (e.g. PATCH pods)
342+
- estimated throughput
343+
- originating component(s) (e.g. Kubelet, Feature-X-controller)
344+
Focusing mostly on:
345+
- components listing and/or watching resources they didn't before
346+
- API calls that may be triggered by changes of some Kubernetes resources
347+
(e.g. update of object X triggers new updates of object Y)
348+
- periodic API calls to reconcile state (e.g. periodic fetching state,
349+
heartbeats, leader election, etc.)
350+
-->
351+
352+
###### Will enabling / using this feature result in introducing new API types?
353+
354+
<!--
355+
Describe them, providing:
356+
- API type
357+
- Supported number of objects per cluster
358+
- Supported number of objects per namespace (for namespace-scoped objects)
359+
-->
360+
361+
###### Will enabling / using this feature result in any new calls to the cloud provider?
362+
363+
<!--
364+
Describe them, providing:
365+
- Which API(s):
366+
- Estimated increase:
367+
-->
368+
369+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
370+
371+
<!--
372+
Describe them, providing:
373+
- API type(s):
374+
- Estimated increase in size: (e.g., new annotation of size 32B)
375+
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
376+
-->
377+
378+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
379+
380+
<!--
381+
Look at the [existing SLIs/SLOs].
382+
383+
Think about adding additional work or introducing new steps in between
384+
(e.g. need to do X to start a container), etc. Please describe the details.
385+
386+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
387+
-->
388+
389+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
390+
391+
<!--
392+
Things to keep in mind include: additional in-memory state, additional
393+
non-trivial computations, excessive access to disks (including increased log
394+
volume), significant amount of data sent and/or received over network, etc.
395+
This through this both in small and large cases, again with respect to the
396+
[supported limits].
397+
398+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
399+
-->
400+
401+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
402+
403+
<!--
404+
Focus not just on happy cases, but primarily on more pathological cases
405+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
406+
If any of the resources can be exhausted, how this is mitigated with the existing limits
407+
(e.g. pods per node) or new limits added by this KEP?
408+
409+
Are there any tests that were run/should be run to understand performance characteristics better
410+
and validate the declared limits?
411+
-->
412+
413+
### Troubleshooting
414+
415+
<!--
416+
This section must be completed when targeting beta to a release.
417+
418+
For GA, this section is required: approvers should be able to confirm the
419+
previous answers based on experience in the field.
420+
421+
The Troubleshooting section currently serves the `Playbook` role. We may consider
422+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
423+
details). For now, we leave it here.
424+
-->
425+
426+
###### How does this feature react if the API server and/or etcd is unavailable?
427+
428+
###### What are other known failure modes?
429+
430+
<!--
431+
For each of them, fill in the following information by copying the below template:
432+
- [Failure mode brief description]
433+
- Detection: How can it be detected via metrics? Stated another way:
434+
how can an operator troubleshoot without logging into a master or worker node?
435+
- Mitigations: What can be done to stop the bleeding, especially for already
436+
running user workloads?
437+
- Diagnostics: What are the useful log messages and their required logging
438+
levels that could help debug the issue?
439+
Not required until feature graduated to beta.
440+
- Testing: Are there any tests for failure mode? If not, describe why.
441+
-->
442+
443+
###### What steps should be taken if SLOs are not being met to determine the problem?

keps/sig-network/1860-kube-proxy-IP-node-binding/kep.yaml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,19 @@ approvers:
1414
- "@thockin"
1515
- "@andrewsykim"
1616

17-
# latest-milestone: "v1.21"
17+
stage: "alpha"
18+
19+
latest-milestone: "v1.29"
1820

1921
milestone:
20-
alpha: "v1.21"
21-
beta: "v1.22"
22+
alpha: "v1.29"
23+
beta: "v1.30"
24+
stable: "v1.31"
2225

2326
feature-gates:
2427
- name: LoadBalancerIPMode
2528
components:
2629
- kube-apiserver
2730
- kube-proxy
31+
- cloud-controller-manager
2832
disable-supported: true
29-
30-
latest-milestone: "0.0"
31-
stage: "alpha"

0 commit comments

Comments
 (0)