@@ -189,138 +189,125 @@ Yes. It is tested by `TestUpdateServiceLoadBalancerStatus` in pkg/registry/core/
189
189
190
190
### Rollout, Upgrade and Rollback Planning
191
191
192
- <!--
193
- This section must be completed when targeting beta to a release.
194
- -->
195
-
196
192
###### How can a rollout or rollback fail? Can it impact already running workloads?
197
193
198
- <!--
199
- Try to be as paranoid as possible - e.g., what if some components will restart
200
- mid-rollout?
194
+ As the rollout will enable a feature not being used yet, there is no possible failure
195
+ scenario as this feature will then need to be also enabled by the cloud provider on
196
+ the services resources.
201
197
202
- Be sure to consider highly-available clusters, where, for example,
203
- feature flags will be enabled on some API servers and not others during the
204
- rollout. Similarly, consider large clusters and how enablement/disablement
205
- will rollout across nodes.
206
- -->
198
+ In case of a rollback, kube-proxy will also rollback to the default behavior, switching
199
+ back to VIP mode. This can fail for workloads that may be already relying on the
200
+ new behavior (eg. sending traffic to the LoadBalancer expecting some additional
201
+ features, like PROXY and TLS Termination as per the Motivations section).
207
202
208
203
###### What specific metrics should inform a rollback?
209
204
210
- <!--
211
- What signals should users be paying attention to when the feature is young
212
- that might indicate a serious problem?
213
- -->
205
+ If using kube-proxy, looking at metrics ` sync_proxy_rules_duration_seconds ` and
206
+ ` sync_proxy_rules_last_timestamp_seconds ` may help identifying problems and indications
207
+ of a required rollback.
214
208
215
209
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
216
210
217
- <!--
218
- Describe manual testing that was done and the outcomes.
219
- Longer term, we may want to require automated upgrade/rollback tests, but we
220
- are missing a bunch of machinery and tooling and can't do that now.
221
- -->
211
+ Because this is a feature that depends on CCM/LoadBalancer controller, and none yet
212
+ implements it, the scenario is simulated with the upgrade/downgrade/upgrade path being
213
+ enabling and disabling the feature flag, and doing the changes on services status subresources.
214
+
215
+ There is a LoadBalancer running on the environment (metallb) that is responsible for doing the proper
216
+ LB ip allocation and announcement, but the rest of the test is related to kube-proxy programming
217
+ or not the iptables rules based on this enablement/disablement path
218
+
219
+ * Initial scenario
220
+ * Started with a v1.29 cluster with the feature flag enabled
221
+ * Created 3 Deployments:
222
+ * web1 - Will be using the new feature
223
+ * web2 - Will NOT be using the new feature
224
+ * client - "the client"
225
+ * Created the loadbalancer for the two web services. By default both LBs are with the default ` VIP ` value
226
+ ``` yaml
227
+ status :
228
+ loadBalancer :
229
+ ingress :
230
+ - ip : 172.18.255.200
231
+ ipMode : VIP
232
+ ` ` `
233
+ * With the feature flag enabled but no change on the service resources, tested and both
234
+ web deployments were accessible
235
+ * Verified that the iptables rule for both LBs exists on all nodes
236
+ * Testing the feature ("upgrade")
237
+ * Changed the ` ipMode` from first LoadBalancer to `Proxy`
238
+ * Verified that the iptables rule for the second LB still exists, while the first one didn't
239
+ * Because the LoadBalancer of the first service is not aware of this new implementation (metallb), it is
240
+ not accessible anymore from the client Pod
241
+ * The second service, which `ipMode` is `VIP` is still accessible from the Pods
242
+ * Disable the feature flag ("downgrade")
243
+ * Edit kube-apiserver manifest and disable the feature flag
244
+ * Edit kube-proxy configmap, disable the feature and restart kube-proxy Pods
245
+ * Confirmed that both iptables rules are present, even if the `ipMode` field was still
246
+ set as `Proxy`, confirming the feature is disabled. Both accesses are working
247
+
248
+ Additionally, an apiserver and kube-proxy upgrade test was executed as the following :
249
+ * Created a KinD cluster with v1.28
250
+ * Created the same deployments and services as bellow
251
+ * Both loadbalancer are accessible
252
+ * Upgraded apiserver and kube-proxy to v1.29, and enabled the feature flag
253
+ * Set `ipMode` as `Proxy` on one of the services and execute the same tests as above
254
+ * Observed the expected behavior of iptables rule for the changed service
255
+ not being created
256
+ * Observed that the access of the changed service was not accessible anymore, as
257
+ expected
258
+ * Disable feature flag
259
+ * Rollback kube-apiserver and kube-proxy to v1.28
260
+ * Verified that both services are working correctly on v1.28
261
+ * Upgraded again to v1.29, keeping the feature flag disabled
262
+ * Both loadbalancers worked as expected, the field is still present on
263
+ the changed service.
264
+
222
265
223
266
# ##### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
224
267
225
- <!--
226
- Even if applying deprecation policies, they may still surprise some users.
227
- -->
268
+ No.
228
269
229
270
# ## Monitoring Requirements
230
271
231
- <!--
232
- This section must be completed when targeting beta to a release.
233
-
234
- For GA, this section is required: approvers should be able to confirm the
235
- previous answers based on experience in the field.
236
- -->
237
-
238
272
# ##### How can an operator determine if the feature is in use by workloads?
239
273
240
- <!--
241
- Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
242
- checking if there are objects with field X set) may be a last resort. Avoid
243
- logs or events for this purpose.
244
- -->
274
+ If the LB IP works correctly from pods, then the feature is working
245
275
246
276
# ##### How can someone using this feature know that it is working for their instance?
247
277
248
- <!--
249
- For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
250
- for each individual pod.
251
- Pick one more of these and delete the rest.
252
- Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
253
- and operation of this feature.
254
- Recall that end users cannot usually observe component logs or access metrics.
255
- -->
256
-
257
- - [ ] Events
258
- - Event Reason:
259
- - [ ] API .status
278
+ - [X] API .status
260
279
- Condition name :
261
- - Other field:
262
- - [ ] Other (treat as last resort)
263
- - Details:
280
+ - Other field : ` .status.loadBalancer.ingress.ipMode` not null
281
+ - [X] Other :
282
+ - Details : To detect if the traffic is being directed to the LoadBalancer and not
283
+ directly to another node, the user will need to rely on the LoadBalancer logs,
284
+ and the destination workload logs to check if the traffic is coming from one Pod
285
+ to the other or from the LoadBalancer.
286
+
264
287
265
288
# ##### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
266
289
267
- <!--
268
- This is your opportunity to define what "normal" quality of service looks like
269
- for a feature.
270
-
271
- It's impossible to provide comprehensive guidance, but at the very
272
- high level (needs more precise definitions) those may be things like:
273
- - per-day percentage of API calls finishing with 5XX errors <= 1%
274
- - 99% percentile over day of absolute value from (job creation time minus expected
275
- job creation time) for cron job <= 10%
276
- - 99.9% of /health requests per day finish with 200 code
277
-
278
- These goals will help you determine what you need to measure (SLIs) in the next
279
- question.
280
- -->
290
+ The quality of service for clouds using this feature is the same as the existing
291
+ quality of service for clouds that don't need this feature
281
292
282
293
# ##### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
283
294
284
- <!--
285
- Pick one more of these and delete the rest.
286
- -->
287
-
288
- - [ ] Metrics
289
- - Metric name:
290
- - [ Optional] Aggregation method:
291
- - Components exposing the metric:
292
- - [ ] Other (treat as last resort)
293
- - Details:
295
+ N/A
294
296
295
297
# ##### Are there any missing metrics that would be useful to have to improve observability of this feature?
296
298
297
- <!--
298
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
299
- implementation difficulties, etc.).
300
- -->
299
+ * On kube-proxy, a metric containing the count of IP programming vs service type would be useful
300
+ to determine if the feature is being used, and if there is any drift between nodes
301
301
302
302
# ## Dependencies
303
303
304
- <!--
305
- This section must be completed when targeting beta to a release.
306
- -->
307
-
308
304
# ##### Does this feature depend on any specific services running in the cluster?
309
305
310
- <!--
311
- Think about both cluster-level services (e.g. metrics-server) as well
312
- as node-level agents (e.g. specific version of CRI). Focus on external or
313
- optional services that are needed. For example, if this feature depends on
314
- a cloud provider API, or upon an external software-defined storage or network
315
- control plane.
316
-
317
- For each of these, fill in the following—thinking about running existing user workloads
318
- and creating new ones, as well as about cluster-level services (e.g. DNS):
319
- - [Dependency name]
320
- - Usage description:
321
- - Impact of its outage on the feature:
322
- - Impact of its degraded performance or high-error rates on the feature:
323
- -->
306
+ - cloud controller manager / LoadBalancer controller
307
+ - If there is an outage of the cloud controller manager, the result is the same
308
+ as if this feature wasn't in use; the LoadBalancers will get out of sync with Services
309
+ - kube-proxy or other Service Proxy that implements this feature
310
+ - If there is a service proxy outage, the result is the same as if this feature wasn't in use
324
311
325
312
# ## Scalability
326
313
@@ -336,79 +323,34 @@ previous answers based on experience in the field.
336
323
337
324
# ##### Will enabling / using this feature result in any new API calls?
338
325
339
- <!--
340
- Describe them, providing:
341
- - API call type (e.g. PATCH pods)
342
- - estimated throughput
343
- - originating component(s) (e.g. Kubelet, Feature-X-controller)
344
- Focusing mostly on:
345
- - components listing and/or watching resources they didn't before
346
- - API calls that may be triggered by changes of some Kubernetes resources
347
- (e.g. update of object X triggers new updates of object Y)
348
- - periodic API calls to reconcile state (e.g. periodic fetching state,
349
- heartbeats, leader election, etc.)
350
- -->
326
+ No.
351
327
352
328
# ##### Will enabling / using this feature result in introducing new API types?
353
329
354
- <!--
355
- Describe them, providing:
356
- - API type
357
- - Supported number of objects per cluster
358
- - Supported number of objects per namespace (for namespace-scoped objects)
359
- -->
330
+ No.
360
331
361
332
# ##### Will enabling / using this feature result in any new calls to the cloud provider?
362
333
363
- <!--
364
- Describe them, providing:
365
- - Which API(s):
366
- - Estimated increase:
367
- -->
334
+ No.
368
335
369
336
# ##### Will enabling / using this feature result in increasing size or count of the existing API objects?
370
337
371
- <!--
372
- Describe them, providing:
373
- - API type(s):
374
- - Estimated increase in size: (e.g., new annotation of size 32B)
375
- - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
376
- -->
338
+ - API type : v1/Service
339
+ - Estimated increase size : new string field. Supported options at this time are max 6 characters (`Proxy`)
340
+ - Estimated amount of new objects : 0
377
341
378
342
# ##### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
379
343
380
- <!--
381
- Look at the [existing SLIs/SLOs].
382
-
383
- Think about adding additional work or introducing new steps in between
384
- (e.g. need to do X to start a container), etc. Please describe the details.
344
+ No.
385
345
386
- [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
387
- -->
388
346
389
347
# ##### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
390
348
391
- <!--
392
- Things to keep in mind include: additional in-memory state, additional
393
- non-trivial computations, excessive access to disks (including increased log
394
- volume), significant amount of data sent and/or received over network, etc.
395
- This through this both in small and large cases, again with respect to the
396
- [supported limits].
397
-
398
- [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
399
- -->
349
+ No.
400
350
401
351
# ##### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
402
352
403
- <!--
404
- Focus not just on happy cases, but primarily on more pathological cases
405
- (e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
406
- If any of the resources can be exhausted, how this is mitigated with the existing limits
407
- (e.g. pods per node) or new limits added by this KEP?
408
-
409
- Are there any tests that were run/should be run to understand performance characteristics better
410
- and validate the declared limits?
411
- -->
353
+ No
412
354
413
355
# ## Troubleshooting
414
356
@@ -425,19 +367,14 @@ details). For now, we leave it here.
425
367
426
368
# ##### How does this feature react if the API server and/or etcd is unavailable?
427
369
370
+ Same for any loadbalancer/cloud controller manager, the new IP and the new status will not be
371
+ set.
372
+
373
+ kube-proxy reacts on the IP status, so the service LoadBalancer IP and configuration will be pending.
374
+
428
375
# ##### What are other known failure modes?
429
376
430
- <!--
431
- For each of them, fill in the following information by copying the below template:
432
- - [Failure mode brief description]
433
- - Detection: How can it be detected via metrics? Stated another way:
434
- how can an operator troubleshoot without logging into a master or worker node?
435
- - Mitigations: What can be done to stop the bleeding, especially for already
436
- running user workloads?
437
- - Diagnostics: What are the useful log messages and their required logging
438
- levels that could help debug the issue?
439
- Not required until feature graduated to beta.
440
- - Testing: Are there any tests for failure mode? If not, describe why.
441
- -->
377
+ N/A
442
378
443
379
# ##### What steps should be taken if SLOs are not being met to determine the problem?
380
+ N/A
0 commit comments