You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -44,7 +49,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
44
49
-[X] (R) Production readiness review completed
45
50
-[X] (R) Production readiness review approved
46
51
-[ ] "Implementation History" section is up-to-date for milestone
47
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
52
+
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
48
53
-[X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
49
54
50
55
<!--
@@ -314,14 +319,15 @@ unsetting the `--enable-migration-config` flag.
314
319
##### Alpha -> Beta Graduation
315
320
316
321
Leader migration configuration is tested end-to-end on at least 2 cloud providers.
322
+
The default migration configuration is implemented and tested.
317
323
318
324
##### Beta -> GA Graduation
319
325
320
326
Leader migration configuration works on all in-tree cloud providers.
321
327
322
328
### Upgrade / Downgrade Strategy
323
329
324
-
See [Example Walkthrough of Controller Migration](#example-walkthrough-of-controller-migratoin) for upgrade strategy.
330
+
See [Example Walkthrough of Controller Migration](#example-walkthrough-of-controller-migration-with-default-configuration) for upgrade strategy.
325
331
Clusters can be downgraded and migration can be disabled by reversing the steps in the upgrade strategy assuming the behavior of each controller
326
332
does not change incompatibly across those versions.
327
333
@@ -349,18 +355,136 @@ feature without providing a configuration, the default configuration will reflec
349
355
Yes. Once the feature is enabled via feature gate, it can be disabled by unsetting `--enable-leader-migration` on KCM
350
356
and CCM.
351
357
358
+
###### What happens if we reenable the feature if it was previously rolled back?
359
+
360
+
This feature can be re-enabled without side effects.
361
+
352
362
###### Are there any tests for feature enablement/disablement?
353
363
354
364
Yes. Unit & integration tests include flag/configuration parsing. E2E test will have cases with the feature enabled and
355
365
disabled.
356
366
367
+
### Rollout, Upgrade and Rollback Planning
368
+
369
+
###### How can a rollout or rollback fail? Can it impact already running workloads?
370
+
371
+
The rollout may fail if the configuration file does not represent correct controller-to-manager.
372
+
This can cause controllers referred in the configuration file to either be unavailable or run in multiple instances.
373
+
374
+
The rollback may fail if the leader election of the controller manager is not properly configured.
375
+
For example, multiple instances of the same controller manager are running without election, or none of the instances become the leader.
376
+
In these situations, all controllers will be either unavailable or conflict among multiple instances.
377
+
378
+
###### What specific metrics should inform a rollback?
379
+
380
+
If neither controller managers show `leader_active` for the main leader lock or the migration lock, Leader Migration may fail to activate and thus needs rollback.
381
+
If any of the controllers indicate they are unavailable through their per-controller metric, Leader Migration may need reconfiguration.
382
+
The metrics of each controller are specific to the implementation of each cloud provider and out of scope of this KEP,
383
+
384
+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
385
+
386
+
The manual testing showed a clean takeover of the migration leader during both upgrade and downgrade process.
387
+
This process will be tested as part of the e2e suite, required by [Graduation Criteria](#graduation-criteria).
388
+
389
+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
390
+
None.
391
+
392
+
### Monitoring Requirements
393
+
394
+
###### How can an operator determine if the feature is in use by workloads?
395
+
396
+
N/A. This feature is never used by any user workloads.
397
+
398
+
###### How can someone using this feature know that it is working for their instance?
399
+
400
+
- [X] Other (treat as last resort)
401
+
- The `Lease` resource used in the migration can be watched for transition of leadership and timing information.
402
+
- logs and metrics can directly indicate the status of migration.
403
+
404
+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
405
+
406
+
Leader Migration is designed to ensure availability of controller managers during upgrade,
407
+
and this feature will not affect SLOs of controller managers.
408
+
409
+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
410
+
411
+
- [X] Metrics
412
+
- leader_active
413
+
- other per-controller availability metrics.
414
+
415
+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
416
+
417
+
It would help if every controller that the controller manager hosts expose metrics about their availability.
418
+
However, per-controller metrics are out of scope of this KEP.
419
+
420
+
Status of the migration lease, provided by the API server, can help observe the transition of holders
421
+
if exposed as resource metrics.
422
+
423
+
### Dependencies
424
+
425
+
###### Does this feature depend on any specific services running in the cluster?
426
+
427
+
- API Server
428
+
- needed for leader election
429
+
- Impact of its outage on the feature: when leader election timeout, controller managers will lose the leader and exit, causing outage.
430
+
- Impact of its degraded performance or high-error rates on the feature: delayed or retried operations of leader election.
431
+
432
+
### Scalability
433
+
434
+
###### Will enabling / using this feature result in any new API calls?
435
+
436
+
Leader Migration uses exactly one more resource of `coordination.k8s.io/v1.Lease` using the standard leader election process.
437
+
Both `kube-controller-manager` and `cloud-controller-manager` will create, update, and watch on the lease.
438
+
439
+
If the service accounts are not granted access to the lease resources, the RBAC roles of each controller manager may need to modified before the upgrade.
440
+
441
+
###### Will enabling / using this feature result in introducing new API types?
0 commit comments