Skip to content

Commit d035eac

Browse files
authored
Merge pull request kubernetes#2687 from jiahuif/kep-2436/prr-beta
PRR: KEP-2436 Leader Migration for Controller Managers to beta
2 parents d78aa0c + 0172933 commit d035eac

File tree

3 files changed

+132
-5
lines changed

3 files changed

+132
-5
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2436
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@deads2k"

keps/sig-cloud-provider/2436-controller-manager-leader-migration/README.md

Lines changed: 126 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,11 @@
2727
- [Version Skew Strategy](#version-skew-strategy)
2828
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
2929
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
30+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
31+
- [Monitoring Requirements](#monitoring-requirements)
32+
- [Dependencies](#dependencies)
33+
- [Scalability](#scalability)
34+
- [Troubleshooting](#troubleshooting)
3035
- [Implementation History](#implementation-history)
3136
- [Drawbacks](#drawbacks)
3237
- [Alternatives](#alternatives)
@@ -44,7 +49,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4449
- [X] (R) Production readiness review completed
4550
- [X] (R) Production readiness review approved
4651
- [ ] "Implementation History" section is up-to-date for milestone
47-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
52+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
4853
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4954

5055
<!--
@@ -314,14 +319,15 @@ unsetting the `--enable-migration-config` flag.
314319
##### Alpha -> Beta Graduation
315320

316321
Leader migration configuration is tested end-to-end on at least 2 cloud providers.
322+
The default migration configuration is implemented and tested.
317323

318324
##### Beta -> GA Graduation
319325

320326
Leader migration configuration works on all in-tree cloud providers.
321327

322328
### Upgrade / Downgrade Strategy
323329

324-
See [Example Walkthrough of Controller Migration](#example-walkthrough-of-controller-migratoin) for upgrade strategy.
330+
See [Example Walkthrough of Controller Migration](#example-walkthrough-of-controller-migration-with-default-configuration) for upgrade strategy.
325331
Clusters can be downgraded and migration can be disabled by reversing the steps in the upgrade strategy assuming the behavior of each controller
326332
does not change incompatibly across those versions.
327333

@@ -349,18 +355,136 @@ feature without providing a configuration, the default configuration will reflec
349355
Yes. Once the feature is enabled via feature gate, it can be disabled by unsetting `--enable-leader-migration` on KCM
350356
and CCM.
351357

358+
###### What happens if we reenable the feature if it was previously rolled back?
359+
360+
This feature can be re-enabled without side effects.
361+
352362
###### Are there any tests for feature enablement/disablement?
353363

354364
Yes. Unit & integration tests include flag/configuration parsing. E2E test will have cases with the feature enabled and
355365
disabled.
356366

367+
### Rollout, Upgrade and Rollback Planning
368+
369+
###### How can a rollout or rollback fail? Can it impact already running workloads?
370+
371+
The rollout may fail if the configuration file does not represent correct controller-to-manager.
372+
This can cause controllers referred in the configuration file to either be unavailable or run in multiple instances.
373+
374+
The rollback may fail if the leader election of the controller manager is not properly configured.
375+
For example, multiple instances of the same controller manager are running without election, or none of the instances become the leader.
376+
In these situations, all controllers will be either unavailable or conflict among multiple instances.
377+
378+
###### What specific metrics should inform a rollback?
379+
380+
If neither controller managers show `leader_active` for the main leader lock or the migration lock, Leader Migration may fail to activate and thus needs rollback.
381+
If any of the controllers indicate they are unavailable through their per-controller metric, Leader Migration may need reconfiguration.
382+
The metrics of each controller are specific to the implementation of each cloud provider and out of scope of this KEP,
383+
384+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
385+
386+
The manual testing showed a clean takeover of the migration leader during both upgrade and downgrade process.
387+
This process will be tested as part of the e2e suite, required by [Graduation Criteria](#graduation-criteria).
388+
389+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
390+
None.
391+
392+
### Monitoring Requirements
393+
394+
###### How can an operator determine if the feature is in use by workloads?
395+
396+
N/A. This feature is never used by any user workloads.
397+
398+
###### How can someone using this feature know that it is working for their instance?
399+
400+
- [X] Other (treat as last resort)
401+
- The `Lease` resource used in the migration can be watched for transition of leadership and timing information.
402+
- logs and metrics can directly indicate the status of migration.
403+
404+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
405+
406+
Leader Migration is designed to ensure availability of controller managers during upgrade,
407+
and this feature will not affect SLOs of controller managers.
408+
409+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
410+
411+
- [X] Metrics
412+
- leader_active
413+
- other per-controller availability metrics.
414+
415+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
416+
417+
It would help if every controller that the controller manager hosts expose metrics about their availability.
418+
However, per-controller metrics are out of scope of this KEP.
419+
420+
Status of the migration lease, provided by the API server, can help observe the transition of holders
421+
if exposed as resource metrics.
422+
423+
### Dependencies
424+
425+
###### Does this feature depend on any specific services running in the cluster?
426+
427+
- API Server
428+
- needed for leader election
429+
- Impact of its outage on the feature: when leader election timeout, controller managers will lose the leader and exit, causing outage.
430+
- Impact of its degraded performance or high-error rates on the feature: delayed or retried operations of leader election.
431+
432+
### Scalability
433+
434+
###### Will enabling / using this feature result in any new API calls?
435+
436+
Leader Migration uses exactly one more resource of `coordination.k8s.io/v1.Lease` using the standard leader election process.
437+
Both `kube-controller-manager` and `cloud-controller-manager` will create, update, and watch on the lease.
438+
439+
If the service accounts are not granted access to the lease resources, the RBAC roles of each controller manager may need to modified before the upgrade.
440+
441+
###### Will enabling / using this feature result in introducing new API types?
442+
443+
Type: `controllermanager.config.k8s.io/v1alpha1.LeaderMigrationConfiguration`
444+
This resource is only for configuration file parsing. The resource should never reach the API server.
445+
446+
###### Will enabling / using this feature result in any new calls to the cloud provider?
447+
448+
No.
449+
450+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
451+
452+
This feature uses exactly one more `coordination.k8s.io/v1.Lease` resource. The RBAC roles of both controller managers will
453+
gain additional ~50 bytes because of the new lease under `resourceName`.
454+
455+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
456+
457+
No.
458+
459+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
460+
461+
Both `kube-controller-manager` and `cloud-controller-manager` runs another leader election process,
462+
which cause negligible increases of CPU and memory usages, both during upgrade and under normal operations.
463+
464+
### Troubleshooting
465+
466+
###### How does this feature react if the API server and/or etcd is unavailable?
467+
468+
The existing implementation of controller managers will `klog.Fatal` once the leader times out, which eventually happens if API server is unavailable.
469+
Leader Migration will not change such behavior.
470+
471+
###### What are other known failure modes?
472+
473+
None. Leader Migration is known to fail only because of misconfiguration or unavailability of the API server, both of which are discussed above.
474+
475+
###### What steps should be taken if SLOs are not being met to determine the problem?
476+
N/A.
477+
357478
## Implementation History
358479

359480
- 07-25-2019 `Summary` and `Motivation` sections were merged signaling SIG acceptance
360481
- 01-21-2019 Implementation details are proposed to move KEP to `implementable` state.
361482
- 09-30-2020 `LeaderMigrationConfiguration` and `ControllerLeaderConfiguration` schemas merged as #94205.
362483
- 11-04-2020 Registration of both types merged as #96133
363484
- 12-28-2020 Parsing and validation merged as #96226
485+
- 03-10-2021 Implementation for alpha state completed, released in 1.21.
486+
- 03-30-2021 User guide published as kubernetes/website#26970
487+
- 05-11-2021 KEP updated to target beta.
364488

365489
## Drawbacks
366490

keps/sig-cloud-provider/2436-controller-manager-leader-migration/kep.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ see-also:
2020
- "/keps/sig-cloud-provider/20180530-cloud-controller-manager.md"
2121

2222
# The target maturity stage in the current dev cycle for this KEP.
23-
stage: alpha
23+
stage: beta
2424

2525
# The most recent milestone for which work toward delivery of this KEP has been
2626
# done. This can be the current (upcoming) milestone, if it is being actively
2727
# worked on.
28-
latest-milestone: "v1.21"
28+
latest-milestone: "v1.22"
2929

3030
# The milestone at which this feature was, or is targeted to be, at each stage.
3131
milestone:
@@ -43,4 +43,5 @@ feature-gates:
4343
disable-supported: true
4444

4545
# The following PRR answers are required at beta release
46-
metrics: []
46+
metrics:
47+
- leader_active

0 commit comments

Comments
 (0)