Skip to content

Commit 04ff1ce

Browse files
committed
update content for beta.
1 parent d8c229f commit 04ff1ce

File tree

2 files changed

+124
-5
lines changed

2 files changed

+124
-5
lines changed

keps/sig-cloud-provider/2436-controller-manager-leader-migration/README.md

Lines changed: 120 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,11 @@
2727
- [Version Skew Strategy](#version-skew-strategy)
2828
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
2929
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
30+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
31+
- [Monitoring Requirements](#monitoring-requirements)
32+
- [Dependencies](#dependencies)
33+
- [Scalability](#scalability)
34+
- [Troubleshooting](#troubleshooting)
3035
- [Implementation History](#implementation-history)
3136
- [Drawbacks](#drawbacks)
3237
- [Alternatives](#alternatives)
@@ -44,7 +49,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4449
- [X] (R) Production readiness review completed
4550
- [X] (R) Production readiness review approved
4651
- [ ] "Implementation History" section is up-to-date for milestone
47-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
52+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
4853
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4954

5055
<!--
@@ -321,7 +326,7 @@ Leader migration configuration works on all in-tree cloud providers.
321326

322327
### Upgrade / Downgrade Strategy
323328

324-
See [Example Walkthrough of Controller Migration](#example-walkthrough-of-controller-migratoin) for upgrade strategy.
329+
See [Example Walkthrough of Controller Migration](#example-walkthrough-of-controller-migration-with-default-configuration) for upgrade strategy.
325330
Clusters can be downgraded and migration can be disabled by reversing the steps in the upgrade strategy assuming the behavior of each controller
326331
does not change incompatibly across those versions.
327332

@@ -349,18 +354,131 @@ feature without providing a configuration, the default configuration will reflec
349354
Yes. Once the feature is enabled via feature gate, it can be disabled by unsetting `--enable-leader-migration` on KCM
350355
and CCM.
351356

357+
###### What happens if we reenable the feature if it was previously rolled back?
358+
359+
This feature can be re-enabled without side effects.
360+
352361
###### Are there any tests for feature enablement/disablement?
353362

354363
Yes. Unit & integration tests include flag/configuration parsing. E2E test will have cases with the feature enabled and
355364
disabled.
356365

366+
### Rollout, Upgrade and Rollback Planning
367+
368+
###### How can a rollout or rollback fail? Can it impact already running workloads?
369+
370+
The rollout may fail if the configuration file does not represent correct controller-to-manager.
371+
This can cause controllers referred in the configuration file to either be unavailable or run in multiple instances.
372+
373+
The rollback may fail if the leader election of the controller manager is not properly configured.
374+
For example, multiple instances of the same controller manager are running without election, or none of the instances become the leader.
375+
In these situations, all controllers will be either unavailable or conflict among multiple instances.
376+
377+
###### What specific metrics should inform a rollback?
378+
379+
If neither controller managers show `leader_active` for the main leader lock or the migration lock, Leader Migration may fail to activate and thus needs rollback.
380+
If any of the controllers indicate they are unavailable through their per-controller metric, Leader Migration may need reconfiguration.
381+
The metrics of each controller are specific to the implementation of each cloud provider and out of scope of this KEP,
382+
383+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
384+
385+
The manual testing showed a clean takeover of the migration leader during both upgrade and downgrade process.
386+
This process will be tested as part of the e2e suite, required by [Graduation Criteria](#graduation-criteria).
387+
388+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
389+
None.
390+
391+
### Monitoring Requirements
392+
393+
###### How can an operator determine if the feature is in use by workloads?
394+
395+
N/A. This feature is never used by any user workloads.
396+
397+
###### How can someone using this feature know that it is working for their instance?
398+
399+
- [X] Other (treat as last resort)
400+
- Details: The cluster administrator SHOULD have access to logs and metrics during cluster upgrade.
401+
402+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
403+
404+
N/A. This feature is short-lived and only active during cluster upgrade.
405+
406+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
407+
408+
- [X] Metrics
409+
- leader_active
410+
- other per-controller availability metrics.
411+
412+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
413+
414+
It would help if every controller that the controller manager hosts expose metrics about their availability.
415+
However, per-controller metrics are out of scope of this KEP.
416+
417+
### Dependencies
418+
419+
###### Does this feature depend on any specific services running in the cluster?
420+
421+
- API Server
422+
- needed for leader election
423+
- Impact of its outage on the feature: when leader election timeout, controller managers will lose the leader and exit, causing outage.
424+
- Impact of its degraded performance or high-error rates on the feature: delayed or retried operations of leader election.
425+
426+
### Scalability
427+
428+
###### Will enabling / using this feature result in any new API calls?
429+
430+
Leader Migration uses exactly one more resource of `coordination.k8s.io/v1.Lease` using the standard leader election process.
431+
Both `kube-controller-manager` and `cloud-controller-manager` will create, update, and watch on the lease.
432+
433+
If the service accounts are not granted access to the lease resources, the RBAC roles of each controller manager may need to modified before the upgrade.
434+
435+
###### Will enabling / using this feature result in introducing new API types?
436+
437+
Type: `controllermanager.config.k8s.io/v1alpha1.LeaderMigrationConfiguration`
438+
This resource is only for configuration file parsing. The resource should never reach the API server.
439+
440+
###### Will enabling / using this feature result in any new calls to the cloud provider?
441+
442+
No.
443+
444+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
445+
446+
This feature uses exactly one more `coordination.k8s.io/v1.Lease` resource. The RBAC roles of both controller managers will
447+
gain additional ~50 bytes because of the new lease under `resourceName`.
448+
449+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
450+
451+
No.
452+
453+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
454+
455+
Both `kube-controller-manager` and `cloud-controller-manager` runs another leader election process,
456+
which may slightly increase CPU usages during leader changes, and it needs extra memory to hold another lease.
457+
458+
### Troubleshooting
459+
460+
###### How does this feature react if the API server and/or etcd is unavailable?
461+
462+
The existing implementation of controller managers will `klog.Fatal` once the leader times out, which eventually happens if API server is unavailable.
463+
Leader Migration will not change such behavior.
464+
465+
###### What are other known failure modes?
466+
467+
None. Leader Migration is known to fail only because of misconfiguration or unavailability of the API server, both of which are discussed above.
468+
469+
###### What steps should be taken if SLOs are not being met to determine the problem?
470+
N/A.
471+
357472
## Implementation History
358473

359474
- 07-25-2019 `Summary` and `Motivation` sections were merged signaling SIG acceptance
360475
- 01-21-2019 Implementation details are proposed to move KEP to `implementable` state.
361476
- 09-30-2020 `LeaderMigrationConfiguration` and `ControllerLeaderConfiguration` schemas merged as #94205.
362477
- 11-04-2020 Registration of both types merged as #96133
363478
- 12-28-2020 Parsing and validation merged as #96226
479+
- 03-10-2021 Implementation for alpha state completed, released in 1.21.
480+
- 03-30-2021 User guide published as kubernetes/website#26970
481+
- 05-11-2021 KEP updated to target beta.
364482

365483
## Drawbacks
366484

keps/sig-cloud-provider/2436-controller-manager-leader-migration/kep.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ see-also:
2020
- "/keps/sig-cloud-provider/20180530-cloud-controller-manager.md"
2121

2222
# The target maturity stage in the current dev cycle for this KEP.
23-
stage: alpha
23+
stage: beta
2424

2525
# The most recent milestone for which work toward delivery of this KEP has been
2626
# done. This can be the current (upcoming) milestone, if it is being actively
2727
# worked on.
28-
latest-milestone: "v1.21"
28+
latest-milestone: "v1.22"
2929

3030
# The milestone at which this feature was, or is targeted to be, at each stage.
3131
milestone:
@@ -43,4 +43,5 @@ feature-gates:
4343
disable-supported: true
4444

4545
# The following PRR answers are required at beta release
46-
metrics: []
46+
metrics:
47+
- leader_active

0 commit comments

Comments
 (0)