You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -44,7 +49,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
44
49
-[X] (R) Production readiness review completed
45
50
-[X] (R) Production readiness review approved
46
51
-[ ] "Implementation History" section is up-to-date for milestone
47
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
52
+
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
48
53
-[X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
49
54
50
55
<!--
@@ -321,7 +326,7 @@ Leader migration configuration works on all in-tree cloud providers.
321
326
322
327
### Upgrade / Downgrade Strategy
323
328
324
-
See [Example Walkthrough of Controller Migration](#example-walkthrough-of-controller-migratoin) for upgrade strategy.
329
+
See [Example Walkthrough of Controller Migration](#example-walkthrough-of-controller-migration-with-default-configuration) for upgrade strategy.
325
330
Clusters can be downgraded and migration can be disabled by reversing the steps in the upgrade strategy assuming the behavior of each controller
326
331
does not change incompatibly across those versions.
327
332
@@ -349,18 +354,131 @@ feature without providing a configuration, the default configuration will reflec
349
354
Yes. Once the feature is enabled via feature gate, it can be disabled by unsetting `--enable-leader-migration` on KCM
350
355
and CCM.
351
356
357
+
###### What happens if we reenable the feature if it was previously rolled back?
358
+
359
+
This feature can be re-enabled without side effects.
360
+
352
361
###### Are there any tests for feature enablement/disablement?
353
362
354
363
Yes. Unit & integration tests include flag/configuration parsing. E2E test will have cases with the feature enabled and
355
364
disabled.
356
365
366
+
### Rollout, Upgrade and Rollback Planning
367
+
368
+
###### How can a rollout or rollback fail? Can it impact already running workloads?
369
+
370
+
The rollout may fail if the configuration file does not represent correct controller-to-manager.
371
+
This can cause controllers referred in the configuration file to either be unavailable or run in multiple instances.
372
+
373
+
The rollback may fail if the leader election of the controller manager is not properly configured.
374
+
For example, multiple instances of the same controller manager are running without election, or none of the instances become the leader.
375
+
In these situations, all controllers will be either unavailable or conflict among multiple instances.
376
+
377
+
###### What specific metrics should inform a rollback?
378
+
379
+
If neither controller managers show `leader_active` for the main leader lock or the migration lock, Leader Migration may fail to activate and thus needs rollback.
380
+
If any of the controllers indicate they are unavailable through their per-controller metric, Leader Migration may need reconfiguration.
381
+
The metrics of each controller are specific to the implementation of each cloud provider and out of scope of this KEP,
382
+
383
+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
384
+
385
+
The manual testing showed a clean takeover of the migration leader during both upgrade and downgrade process.
386
+
This process will be tested as part of the e2e suite, required by [Graduation Criteria](#graduation-criteria).
387
+
388
+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
389
+
None.
390
+
391
+
### Monitoring Requirements
392
+
393
+
###### How can an operator determine if the feature is in use by workloads?
394
+
395
+
N/A. This feature is never used by any user workloads.
396
+
397
+
###### How can someone using this feature know that it is working for their instance?
398
+
399
+
- [X] Other (treat as last resort)
400
+
- Details: The cluster administrator SHOULD have access to logs and metrics during cluster upgrade.
401
+
402
+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
403
+
404
+
N/A. This feature is short-lived and only active during cluster upgrade.
405
+
406
+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
407
+
408
+
- [X] Metrics
409
+
- leader_active
410
+
- other per-controller availability metrics.
411
+
412
+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
413
+
414
+
It would help if every controller that the controller manager hosts expose metrics about their availability.
415
+
However, per-controller metrics are out of scope of this KEP.
416
+
417
+
### Dependencies
418
+
419
+
###### Does this feature depend on any specific services running in the cluster?
420
+
421
+
- API Server
422
+
- needed for leader election
423
+
- Impact of its outage on the feature: when leader election timeout, controller managers will lose the leader and exit, causing outage.
424
+
- Impact of its degraded performance or high-error rates on the feature: delayed or retried operations of leader election.
425
+
426
+
### Scalability
427
+
428
+
###### Will enabling / using this feature result in any new API calls?
429
+
430
+
Leader Migration uses exactly one more resource of `coordination.k8s.io/v1.Lease` using the standard leader election process.
431
+
Both `kube-controller-manager` and `cloud-controller-manager` will create, update, and watch on the lease.
432
+
433
+
If the service accounts are not granted access to the lease resources, the RBAC roles of each controller manager may need to modified before the upgrade.
434
+
435
+
###### Will enabling / using this feature result in introducing new API types?
0 commit comments