@@ -189,138 +189,76 @@ Yes. It is tested by `TestUpdateServiceLoadBalancerStatus` in pkg/registry/core/
189
189
190
190
### Rollout, Upgrade and Rollback Planning
191
191
192
- <!--
193
- This section must be completed when targeting beta to a release.
194
- -->
195
-
196
192
###### How can a rollout or rollback fail? Can it impact already running workloads?
197
193
198
- <!--
199
- Try to be as paranoid as possible - e.g., what if some components will restart
200
- mid-rollout?
194
+ A rollout can fail in case the value of ` ipMode ` had been set to "Proxy" on a service
195
+ and running workloads consuming this service fails to reach it because of some
196
+ extra hop, or some misconfiguration on the LoadBalancer.
201
197
202
- Be sure to consider highly-available clusters, where, for example,
203
- feature flags will be enabled on some API servers and not others during the
204
- rollout. Similarly, consider large clusters and how enablement/disablement
205
- will rollout across nodes.
206
- -->
198
+ A rollback can fail in case kube-proxy is not able to detect a rollback and re-add
199
+ the LoadBalancer address back to the interface.
207
200
208
201
###### What specific metrics should inform a rollback?
209
202
210
- <!--
211
- What signals should users be paying attention to when the feature is young
212
- that might indicate a serious problem?
213
- -->
203
+ Workloads consuming a service configured to use the new ` ipMode ` and that starts
204
+ to fail to reach the service, or an increase on the requests time are metrics that
205
+ should inform a rollback
214
206
215
207
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
216
208
217
- <!--
218
- Describe manual testing that was done and the outcomes.
219
- Longer term, we may want to require automated upgrade/rollback tests, but we
220
- are missing a bunch of machinery and tooling and can't do that now.
221
- -->
209
+ No.
222
210
223
211
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
224
212
225
- <!--
226
- Even if applying deprecation policies, they may still surprise some users.
227
- -->
213
+ No.
228
214
229
215
### Monitoring Requirements
230
216
231
- <!--
232
- This section must be completed when targeting beta to a release.
233
-
234
- For GA, this section is required: approvers should be able to confirm the
235
- previous answers based on experience in the field.
236
- -->
237
-
238
217
###### How can an operator determine if the feature is in use by workloads?
239
218
240
- <!--
241
- Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
242
- checking if there are objects with field X set) may be a last resort. Avoid
243
- logs or events for this purpose.
244
- -->
219
+ As this is a low level operation, to check if it is working an operator should:
220
+ * Verify a service of type=LoadBalancer and this feature enabled
221
+ * Check and confirm that the IPs set on ` .status.loadBalancer.ingress.ip ` are not set
222
+ on any interface
245
223
246
224
###### How can someone using this feature know that it is working for their instance?
247
225
248
- <!--
249
- For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
250
- for each individual pod.
251
- Pick one more of these and delete the rest.
252
- Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
253
- and operation of this feature.
254
- Recall that end users cannot usually observe component logs or access metrics.
255
- -->
256
-
257
226
- [ ] Events
258
227
- Event Reason:
259
- - [ ] API .status
228
+ - [X ] API .status
260
229
- Condition name:
261
- - Other field:
230
+ - Other field: ` .status.loadBalancer.ingress.ipMode ` not null
262
231
- [ ] Other (treat as last resort)
263
232
- Details:
264
233
265
234
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
266
235
267
- <!--
268
- This is your opportunity to define what "normal" quality of service looks like
269
- for a feature.
270
-
271
- It's impossible to provide comprehensive guidance, but at the very
272
- high level (needs more precise definitions) those may be things like:
273
- - per-day percentage of API calls finishing with 5XX errors <= 1%
274
- - 99% percentile over day of absolute value from (job creation time minus expected
275
- job creation time) for cron job <= 10%
276
- - 99.9% of /health requests per day finish with 200 code
277
-
278
- These goals will help you determine what you need to measure (SLIs) in the next
279
- question.
280
- -->
236
+ No increase of error rate when a workload of a cluster is targeting a service of
237
+ type LoadBalancer and the feature enabled.
281
238
282
239
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
283
240
284
- <!--
285
- Pick one more of these and delete the rest.
286
- -->
287
-
288
- - [ ] Metrics
289
- - Metric name:
290
- - [ Optional] Aggregation method:
291
- - Components exposing the metric:
292
- - [ ] Other (treat as last resort)
293
- - Details:
241
+ - [X] Other (treat as last resort)
242
+ - Details: Workload/Application instrumentation containing the error rate and
243
+ latency of calls against other services on this cluster
294
244
295
245
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
296
246
297
- <!--
298
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
299
- implementation difficulties, etc.).
300
- -->
247
+ N/A
301
248
302
249
### Dependencies
303
250
304
- <!--
305
- This section must be completed when targeting beta to a release.
306
- -->
307
-
308
251
###### Does this feature depend on any specific services running in the cluster?
309
252
310
- <!--
311
- Think about both cluster-level services (e.g. metrics-server) as well
312
- as node-level agents (e.g. specific version of CRI). Focus on external or
313
- optional services that are needed. For example, if this feature depends on
314
- a cloud provider API, or upon an external software-defined storage or network
315
- control plane.
316
-
317
- For each of these, fill in the following—thinking about running existing user workloads
318
- and creating new ones, as well as about cluster-level services (e.g. DNS):
319
- - [Dependency name]
320
- - Usage description:
321
- - Impact of its outage on the feature:
322
- - Impact of its degraded performance or high-error rates on the feature:
323
- -->
253
+ - cloud controller manager / LoadBalancer controller
254
+ - LoadBalancer controller should set the right .status field for ` ipMode `
255
+ - In case of this feature outage, the traffic may still be routed using the ` VIP ` mode
256
+ - kube-proxy
257
+ - Network interface IP addressprogramming
258
+ - In case of this feature outage, network interfaces on the node may still keep
259
+ adding the LoadBalancer IP, that may cause wrong traffic routing
260
+ - This dependency doesn't happen on clusters that uses CNI that replaces kube-proxy.
261
+ The CNIs should implement this feature their own, in this case.
324
262
325
263
### Scalability
326
264
@@ -336,79 +274,36 @@ previous answers based on experience in the field.
336
274
337
275
###### Will enabling / using this feature result in any new API calls?
338
276
339
- <!--
340
- Describe them, providing:
341
- - API call type (e.g. PATCH pods)
342
- - estimated throughput
343
- - originating component(s) (e.g. Kubelet, Feature-X-controller)
344
- Focusing mostly on:
345
- - components listing and/or watching resources they didn't before
346
- - API calls that may be triggered by changes of some Kubernetes resources
347
- (e.g. update of object X triggers new updates of object Y)
348
- - periodic API calls to reconcile state (e.g. periodic fetching state,
349
- heartbeats, leader election, etc.)
350
- -->
277
+ - API call type - Patch
278
+ - Estimated throughput - 1 per service creation/reconciliation
279
+ - originating component - cloud controller manager / LoadBalancer controller
351
280
352
281
###### Will enabling / using this feature result in introducing new API types?
353
282
354
- <!--
355
- Describe them, providing:
356
- - API type
357
- - Supported number of objects per cluster
358
- - Supported number of objects per namespace (for namespace-scoped objects)
359
- -->
283
+ No
360
284
361
285
###### Will enabling / using this feature result in any new calls to the cloud provider?
362
286
363
- <!--
364
- Describe them, providing:
365
- - Which API(s):
366
- - Estimated increase:
367
- -->
287
+ No.
368
288
369
289
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
370
290
371
- <!--
372
- Describe them, providing:
373
- - API type(s):
374
- - Estimated increase in size: (e.g., new annotation of size 32B)
375
- - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
376
- -->
291
+ - API type: v1/Service
292
+ - Estimated increase size: new string field. Supported options at this time are max 6 characters (` Proxy ` )
293
+ - Estimated amount of new objects: 0
377
294
378
295
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
379
296
380
- <!--
381
- Look at the [existing SLIs/SLOs].
382
-
383
- Think about adding additional work or introducing new steps in between
384
- (e.g. need to do X to start a container), etc. Please describe the details.
297
+ No.
385
298
386
- [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
387
- -->
388
299
389
300
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
390
301
391
- <!--
392
- Things to keep in mind include: additional in-memory state, additional
393
- non-trivial computations, excessive access to disks (including increased log
394
- volume), significant amount of data sent and/or received over network, etc.
395
- This through this both in small and large cases, again with respect to the
396
- [supported limits].
397
-
398
- [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
399
- -->
302
+ No.
400
303
401
304
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
402
305
403
- <!--
404
- Focus not just on happy cases, but primarily on more pathological cases
405
- (e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
406
- If any of the resources can be exhausted, how this is mitigated with the existing limits
407
- (e.g. pods per node) or new limits added by this KEP?
408
-
409
- Are there any tests that were run/should be run to understand performance characteristics better
410
- and validate the declared limits?
411
- -->
306
+ No
412
307
413
308
### Troubleshooting
414
309
@@ -425,19 +320,14 @@ details). For now, we leave it here.
425
320
426
321
###### How does this feature react if the API server and/or etcd is unavailable?
427
322
323
+ Same for any loadbalancer/cloud controller manager, the new IP and the new status will not be
324
+ set.
325
+
326
+ kube-proxy reacts on the IP status, so the service LoadBalancer IP and configuration will be pending.
327
+
428
328
###### What are other known failure modes?
429
329
430
- <!--
431
- For each of them, fill in the following information by copying the below template:
432
- - [Failure mode brief description]
433
- - Detection: How can it be detected via metrics? Stated another way:
434
- how can an operator troubleshoot without logging into a master or worker node?
435
- - Mitigations: What can be done to stop the bleeding, especially for already
436
- running user workloads?
437
- - Diagnostics: What are the useful log messages and their required logging
438
- levels that could help debug the issue?
439
- Not required until feature graduated to beta.
440
- - Testing: Are there any tests for failure mode? If not, describe why.
441
- -->
330
+ N/A
442
331
443
332
###### What steps should be taken if SLOs are not being met to determine the problem?
333
+ N/A
0 commit comments