Skip to content

Commit 430f4a7

Browse files
committed
Propose beta for kep 1860
1 parent ce17c8b commit 430f4a7

File tree

1 file changed

+49
-159
lines changed
  • keps/sig-network/1860-kube-proxy-IP-node-binding

1 file changed

+49
-159
lines changed

keps/sig-network/1860-kube-proxy-IP-node-binding/README.md

Lines changed: 49 additions & 159 deletions
Original file line numberDiff line numberDiff line change
@@ -189,138 +189,76 @@ Yes. It is tested by `TestUpdateServiceLoadBalancerStatus` in pkg/registry/core/
189189

190190
### Rollout, Upgrade and Rollback Planning
191191

192-
<!--
193-
This section must be completed when targeting beta to a release.
194-
-->
195-
196192
###### How can a rollout or rollback fail? Can it impact already running workloads?
197193

198-
<!--
199-
Try to be as paranoid as possible - e.g., what if some components will restart
200-
mid-rollout?
194+
A rollout can fail in case the value of `ipMode` had been set to "Proxy" on a service
195+
and running workloads consuming this service fails to reach it because of some
196+
extra hop, or some misconfiguration on the LoadBalancer.
201197

202-
Be sure to consider highly-available clusters, where, for example,
203-
feature flags will be enabled on some API servers and not others during the
204-
rollout. Similarly, consider large clusters and how enablement/disablement
205-
will rollout across nodes.
206-
-->
198+
A rollback can fail in case kube-proxy is not able to detect a rollback and re-add
199+
the LoadBalancer address back to the interface.
207200

208201
###### What specific metrics should inform a rollback?
209202

210-
<!--
211-
What signals should users be paying attention to when the feature is young
212-
that might indicate a serious problem?
213-
-->
203+
Workloads consuming a service configured to use the new `ipMode` and that starts
204+
to fail to reach the service, or an increase on the requests time are metrics that
205+
should inform a rollback
214206

215207
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
216208

217-
<!--
218-
Describe manual testing that was done and the outcomes.
219-
Longer term, we may want to require automated upgrade/rollback tests, but we
220-
are missing a bunch of machinery and tooling and can't do that now.
221-
-->
209+
No.
222210

223211
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
224212

225-
<!--
226-
Even if applying deprecation policies, they may still surprise some users.
227-
-->
213+
No.
228214

229215
### Monitoring Requirements
230216

231-
<!--
232-
This section must be completed when targeting beta to a release.
233-
234-
For GA, this section is required: approvers should be able to confirm the
235-
previous answers based on experience in the field.
236-
-->
237-
238217
###### How can an operator determine if the feature is in use by workloads?
239218

240-
<!--
241-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
242-
checking if there are objects with field X set) may be a last resort. Avoid
243-
logs or events for this purpose.
244-
-->
219+
As this is a low level operation, to check if it is working an operator should:
220+
* Verify a service of type=LoadBalancer and this feature enabled
221+
* Check and confirm that the IPs set on `.status.loadBalancer.ingress.ip` are not set
222+
on any interface
245223

246224
###### How can someone using this feature know that it is working for their instance?
247225

248-
<!--
249-
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
250-
for each individual pod.
251-
Pick one more of these and delete the rest.
252-
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
253-
and operation of this feature.
254-
Recall that end users cannot usually observe component logs or access metrics.
255-
-->
256-
257226
- [ ] Events
258227
- Event Reason:
259-
- [ ] API .status
228+
- [X] API .status
260229
- Condition name:
261-
- Other field:
230+
- Other field: `.status.loadBalancer.ingress.ipMode` not null
262231
- [ ] Other (treat as last resort)
263232
- Details:
264233

265234
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
266235

267-
<!--
268-
This is your opportunity to define what "normal" quality of service looks like
269-
for a feature.
270-
271-
It's impossible to provide comprehensive guidance, but at the very
272-
high level (needs more precise definitions) those may be things like:
273-
- per-day percentage of API calls finishing with 5XX errors <= 1%
274-
- 99% percentile over day of absolute value from (job creation time minus expected
275-
job creation time) for cron job <= 10%
276-
- 99.9% of /health requests per day finish with 200 code
277-
278-
These goals will help you determine what you need to measure (SLIs) in the next
279-
question.
280-
-->
236+
No increase of error rate when a workload of a cluster is targeting a service of
237+
type LoadBalancer and the feature enabled.
281238

282239
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
283240

284-
<!--
285-
Pick one more of these and delete the rest.
286-
-->
287-
288-
- [ ] Metrics
289-
- Metric name:
290-
- [Optional] Aggregation method:
291-
- Components exposing the metric:
292-
- [ ] Other (treat as last resort)
293-
- Details:
241+
- [X] Other (treat as last resort)
242+
- Details: Workload/Application instrumentation containing the error rate and
243+
latency of calls against other services on this cluster
294244

295245
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
296246

297-
<!--
298-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
299-
implementation difficulties, etc.).
300-
-->
247+
N/A
301248

302249
### Dependencies
303250

304-
<!--
305-
This section must be completed when targeting beta to a release.
306-
-->
307-
308251
###### Does this feature depend on any specific services running in the cluster?
309252

310-
<!--
311-
Think about both cluster-level services (e.g. metrics-server) as well
312-
as node-level agents (e.g. specific version of CRI). Focus on external or
313-
optional services that are needed. For example, if this feature depends on
314-
a cloud provider API, or upon an external software-defined storage or network
315-
control plane.
316-
317-
For each of these, fill in the following—thinking about running existing user workloads
318-
and creating new ones, as well as about cluster-level services (e.g. DNS):
319-
- [Dependency name]
320-
- Usage description:
321-
- Impact of its outage on the feature:
322-
- Impact of its degraded performance or high-error rates on the feature:
323-
-->
253+
- cloud controller manager / LoadBalancer controller
254+
- LoadBalancer controller should set the right .status field for `ipMode`
255+
- In case of this feature outage, the traffic may still be routed using the `VIP` mode
256+
- kube-proxy
257+
- Network interface IP addressprogramming
258+
- In case of this feature outage, network interfaces on the node may still keep
259+
adding the LoadBalancer IP, that may cause wrong traffic routing
260+
- This dependency doesn't happen on clusters that uses CNI that replaces kube-proxy.
261+
The CNIs should implement this feature their own, in this case.
324262

325263
### Scalability
326264

@@ -336,79 +274,36 @@ previous answers based on experience in the field.
336274

337275
###### Will enabling / using this feature result in any new API calls?
338276

339-
<!--
340-
Describe them, providing:
341-
- API call type (e.g. PATCH pods)
342-
- estimated throughput
343-
- originating component(s) (e.g. Kubelet, Feature-X-controller)
344-
Focusing mostly on:
345-
- components listing and/or watching resources they didn't before
346-
- API calls that may be triggered by changes of some Kubernetes resources
347-
(e.g. update of object X triggers new updates of object Y)
348-
- periodic API calls to reconcile state (e.g. periodic fetching state,
349-
heartbeats, leader election, etc.)
350-
-->
277+
- API call type - Patch
278+
- Estimated throughput - 1 per service creation/reconciliation
279+
- originating component - cloud controller manager / LoadBalancer controller
351280

352281
###### Will enabling / using this feature result in introducing new API types?
353282

354-
<!--
355-
Describe them, providing:
356-
- API type
357-
- Supported number of objects per cluster
358-
- Supported number of objects per namespace (for namespace-scoped objects)
359-
-->
283+
No
360284

361285
###### Will enabling / using this feature result in any new calls to the cloud provider?
362286

363-
<!--
364-
Describe them, providing:
365-
- Which API(s):
366-
- Estimated increase:
367-
-->
287+
No.
368288

369289
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
370290

371-
<!--
372-
Describe them, providing:
373-
- API type(s):
374-
- Estimated increase in size: (e.g., new annotation of size 32B)
375-
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
376-
-->
291+
- API type: v1/Service
292+
- Estimated increase size: new string field. Supported options at this time are max 6 characters (`Proxy`)
293+
- Estimated amount of new objects: 0
377294

378295
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
379296

380-
<!--
381-
Look at the [existing SLIs/SLOs].
382-
383-
Think about adding additional work or introducing new steps in between
384-
(e.g. need to do X to start a container), etc. Please describe the details.
297+
No.
385298

386-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
387-
-->
388299

389300
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
390301

391-
<!--
392-
Things to keep in mind include: additional in-memory state, additional
393-
non-trivial computations, excessive access to disks (including increased log
394-
volume), significant amount of data sent and/or received over network, etc.
395-
This through this both in small and large cases, again with respect to the
396-
[supported limits].
397-
398-
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
399-
-->
302+
No.
400303

401304
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
402305

403-
<!--
404-
Focus not just on happy cases, but primarily on more pathological cases
405-
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
406-
If any of the resources can be exhausted, how this is mitigated with the existing limits
407-
(e.g. pods per node) or new limits added by this KEP?
408-
409-
Are there any tests that were run/should be run to understand performance characteristics better
410-
and validate the declared limits?
411-
-->
306+
No
412307

413308
### Troubleshooting
414309

@@ -425,19 +320,14 @@ details). For now, we leave it here.
425320

426321
###### How does this feature react if the API server and/or etcd is unavailable?
427322

323+
Same for any loadbalancer/cloud controller manager, the new IP and the new status will not be
324+
set.
325+
326+
kube-proxy reacts on the IP status, so the service LoadBalancer IP and configuration will be pending.
327+
428328
###### What are other known failure modes?
429329

430-
<!--
431-
For each of them, fill in the following information by copying the below template:
432-
- [Failure mode brief description]
433-
- Detection: How can it be detected via metrics? Stated another way:
434-
how can an operator troubleshoot without logging into a master or worker node?
435-
- Mitigations: What can be done to stop the bleeding, especially for already
436-
running user workloads?
437-
- Diagnostics: What are the useful log messages and their required logging
438-
levels that could help debug the issue?
439-
Not required until feature graduated to beta.
440-
- Testing: Are there any tests for failure mode? If not, describe why.
441-
-->
330+
N/A
442331

443332
###### What steps should be taken if SLOs are not being met to determine the problem?
333+
N/A

0 commit comments

Comments
 (0)