Skip to content

Commit c66b2e9

Browse files
authored
Merge pull request #5088 from Jefftree/cle-beta
Coordinated Leader Election to Beta
2 parents 9d3935f + c3f73fa commit c66b2e9

File tree

3 files changed

+123
-28
lines changed

3 files changed

+123
-28
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
11
kep-number: 4355
22
alpha:
33
approver: "@soltysh"
4+
beta:
5+
approver: "@jpbetz"
6+

keps/sig-api-machinery/4355-coordinated-leader-election/README.md

Lines changed: 116 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,8 @@ SIG Architecture for cross-cutting KEPs).
9999
- [e2e tests](#e2e-tests)
100100
- [Graduation Criteria](#graduation-criteria)
101101
- [Alpha](#alpha)
102+
- [Beta](#beta)
103+
- [GA](#ga)
102104
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
103105
- [Version Skew Strategy](#version-skew-strategy)
104106
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -741,7 +743,7 @@ extending the production code to implement this enhancement.
741743
-->
742744

743745
- `staging/src/k8s.io/client-go/tools/leaderelection`: 76.8
744-
- `pkg/controller/leaderelection`: `TODO` - `new controller tests`
746+
- `pkg/controller/leaderelection`: `77.8`
745747

746748
##### Integration tests
747749

@@ -760,7 +762,8 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
760762
https://storage.googleapis.com/k8s-triage/index.html
761763
-->
762764

763-
- `test/integration/apiserver/coordinatedleaderelection`: New file
765+
766+
- `test/integration/apiserver/coordinatedleaderelection/*`: New directory
764767

765768
##### e2e tests
766769

@@ -841,14 +844,31 @@ in back-to-back releases.
841844
-->
842845

843846
#### Alpha
847+
844848
- Feature implemented behind a feature flag
845-
- The strategy `MinimumCompatibilityVersionStrategy` is implemented
849+
- The strategy `OldestEmulationVersion` is implemented
850+
851+
#### Beta
852+
853+
- e2e & integration tests for coordinated leader election on various scenarios
854+
+ single leasecandidate
855+
+ multiple leasecandidates
856+
+ lease is preempted when another more suitable candidate is found
857+
+ Components that don't know about coordination mixed with those who do
858+
+ Downgrade to components that do not know about coordination
859+
+ Custom third party strategy controller
860+
- Lease pings are parallelized
861+
- Tests are included for third party strategies
862+
- Tests for disablement of the feature gate
863+
864+
#### GA
865+
866+
- Load test Coordinated Leader Election
867+
- Feature is enabled by default
846868

847869
### Upgrade / Downgrade Strategy
848870

849-
If the `--leader-elect-resource-lock=coordinatedleases` flag is set and a
850-
component is downgraded from beta to alpha, it will need to either remove the
851-
flag or enable the alpha feature. All other upgrades and downgrades are safe.
871+
Upgrading requires enabling the feature gate `CoordinatedLeaderElection` and the group version `coordination.k8s.io/v1alpha2`. Downgrading will revert to the old leader election mechanism, but may have extra data in etcd for `LeaseCandidate` objects under the `coordination.k8s.io/v1alpha2` group version.
852872

853873
<!--
854874
If applicable, how will the component be upgraded and downgraded? Make sure
@@ -930,16 +950,12 @@ well as the [existing list] of feature gates.
930950
- kube-apiserver
931951
- kube-controller-manager
932952
- kube-scheduler
933-
- [ ] Other
934-
- Describe the mechanism:
935-
- Will enabling / disabling the feature require downtime of the control plane?
936-
- Will enabling / disabling the feature require downtime or reprovisioning of
937-
a node?
938953

939954
###### Does enabling the feature change any default behavior?
940955

941-
No, even when the feature is enabled, a component must be configured with
942-
`--leader-elect-resource-lock=coordinatedleases` to use the feature.
956+
Yes, kube-scheduler and kube-controller-manager will use coordinated leader
957+
election instead of the default leader election mechanism if the feature is
958+
enabled.
943959

944960
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
945961

@@ -991,13 +1007,27 @@ rollout. Similarly, consider large clusters and how enablement/disablement
9911007
will rollout across nodes.
9921008
-->
9931009

1010+
Rollouts and rollbacks can fail in many ways. During the first rollout of the
1011+
feature, there will be a mixed state of control planes using and not using
1012+
coordinated leader election. Components not using CLE will race to obtain the
1013+
best leader while the ones using CLE will defer the CLE controller to assign
1014+
themselves as leader. We cannot guarantee the best leader is elected during
1015+
mixed version states, but leader election will still be done.
1016+
1017+
If the CLE controller has bugs, it may fail to or incorrectly select a leader
1018+
and could lead to disruptions.
1019+
1020+
If LeaseCandidate objects have incorrect version information, CLE controller may make an incorrect leader selection and potentially lead to version skew violations.
1021+
9941022
###### What specific metrics should inform a rollback?
9951023

9961024
<!--
9971025
What signals should users be paying attention to when the feature is young
9981026
that might indicate a serious problem?
9991027
-->
10001028

1029+
If leases fail to renew that would be a sign for rollback.
1030+
10011031
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
10021032

10031033
<!--
@@ -1006,12 +1036,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
10061036
are missing a bunch of machinery and tooling and can't do that now.
10071037
-->
10081038

1039+
Integration tests include testing for skew scenarios.
1040+
10091041
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
10101042

10111043
<!--
10121044
Even if applying deprecation policies, they may still surprise some users.
10131045
-->
10141046

1047+
No.
1048+
10151049
### Monitoring Requirements
10161050

10171051
<!--
@@ -1029,6 +1063,11 @@ checking if there are objects with field X set) may be a last resort. Avoid
10291063
logs or events for this purpose.
10301064
-->
10311065

1066+
LeaseCandidate resource will be enabled and feature gate
1067+
`CoordinatedLeaderElection` will be enabled. On the Lease object, a new field
1068+
`Strategy` will be populated indicating the strategy used by coordinated leader
1069+
election for selecting the most suitable leader.
1070+
10321071
###### How can someone using this feature know that it is working for their instance?
10331072

10341073
<!--
@@ -1040,13 +1079,10 @@ and operation of this feature.
10401079
Recall that end users cannot usually observe component logs or access metrics.
10411080
-->
10421081

1043-
- [ ] Events
1044-
- Event Reason:
1045-
- [ ] API .status
1046-
- Condition name:
1047-
- Other field:
1048-
- [ ] Other (treat as last resort)
1049-
- Details:
1082+
- LeaseCandidate objects will exist for leader elected components, and the
1083+
`RenewTime` and `PingTime` fields will be recent (within 30 minutes).
1084+
- Lease objects for leader elected components will be assigned and actively
1085+
renewing.
10501086

10511087
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
10521088

@@ -1065,18 +1101,22 @@ These goals will help you determine what you need to measure (SLIs) in the next
10651101
question.
10661102
-->
10671103

1104+
When leader elected components are in the cluster, the leader must be timely
1105+
selected and propagated via the Lease object. The lease must be actively
1106+
renewed.
1107+
10681108
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
10691109

10701110
<!--
10711111
Pick one more of these and delete the rest.
10721112
-->
10731113

1074-
- [ ] Metrics
1075-
- Metric name:
1076-
- [Optional] Aggregation method:
1077-
- Components exposing the metric:
1078-
- [ ] Other (treat as last resort)
1079-
- Details:
1114+
- [x] Metrics
1115+
- Metric name: `apiserver_coordinated_leader_election_leader_changes{name="<component>"}`
1116+
- Metric name: `apiserver_coordinated_leader_election_leader_preemptions{name="<component>"}`
1117+
- Metric name: `apiserver_coordinated_leader_election_failures_total`
1118+
- Metric name: `apiserver_coordinated_leader_election_skew_preventions_total`
1119+
- Components exposing the metrics: kube-apiserver
10801120

10811121
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10821122

@@ -1085,6 +1125,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
10851125
implementation difficulties, etc.).
10861126
-->
10871127

1128+
n/a.
1129+
10881130
### Dependencies
10891131

10901132
<!--
@@ -1108,6 +1150,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
11081150
- Impact of its degraded performance or high-error rates on the feature:
11091151
-->
11101152

1153+
No.
1154+
11111155
### Scalability
11121156

11131157
<!--
@@ -1135,6 +1179,17 @@ Focusing mostly on:
11351179
heartbeats, leader election, etc.)
11361180
-->
11371181

1182+
Yes.
1183+
1184+
- API call type: PUT
1185+
- estimated throughput: Steady state is 3 requests per leader elected component
1186+
every 30 minutes to renew the LeaseCandidate. If there is churn in the control
1187+
plane, an extra 2N requests are performed on every change per leader elected
1188+
component, N representing the number of available control planes. The number
1189+
is 2N because N requests will be sent by the apiserver to ping all candidates,
1190+
and every request should be ack'd by the client.
1191+
- watch on LeaseCandidate resources
1192+
11381193
###### Will enabling / using this feature result in introducing new API types?
11391194

11401195
<!--
@@ -1144,6 +1199,9 @@ Describe them, providing:
11441199
- Supported number of objects per namespace (for namespace-scoped objects)
11451200
-->
11461201

1202+
- coordination.k8s.io/LeaseCandidate
1203+
- One candidate will exist for each leader elected component for each control plane. Total amount is `# leader elected components` * `# control plane instances`
1204+
11471205
###### Will enabling / using this feature result in any new calls to the cloud provider?
11481206

11491207
<!--
@@ -1152,6 +1210,8 @@ Describe them, providing:
11521210
- Estimated increase:
11531211
-->
11541212

1213+
No.
1214+
11551215
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
11561216

11571217
<!--
@@ -1161,6 +1221,8 @@ Describe them, providing:
11611221
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
11621222
-->
11631223

1224+
An additional `Strategy` field will be populated on all leases elected by CLE. This is a string enum.
1225+
11641226
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
11651227

11661228
<!--
@@ -1172,6 +1234,8 @@ Think about adding additional work or introducing new steps in between
11721234
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
11731235
-->
11741236

1237+
No.
1238+
11751239
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
11761240

11771241
<!--
@@ -1184,6 +1248,8 @@ This through this both in small and large cases, again with respect to the
11841248
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
11851249
-->
11861250

1251+
No.
1252+
11871253
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
11881254

11891255
<!--
@@ -1196,6 +1262,8 @@ Are there any tests that were run/should be run to understand performance charac
11961262
and validate the declared limits?
11971263
-->
11981264

1265+
This is a control plane feature and does not affect node.
1266+
11991267
### Troubleshooting
12001268

12011269
<!--
@@ -1211,6 +1279,15 @@ details). For now, we leave it here.
12111279

12121280
###### How does this feature react if the API server and/or etcd is unavailable?
12131281

1282+
If the API server becomes unavailable, the CLE cannot function as it is built on
1283+
top of the API server. It cannot monitor LeaseCandidates, update Leases, or
1284+
elect new leaders. Existing leaders will continue to function until their Leases
1285+
expire, but no new leaders will be elected until the API server recovers.
1286+
1287+
If etcd is unavailable, similar issues arise. The underlying lease mechanism
1288+
relies on etcd for storage and coordination. Without etcd, Leases cannot be
1289+
created, renewed, or monitored.
1290+
12141291
###### What are other known failure modes?
12151292

12161293
<!--
@@ -1226,8 +1303,21 @@ For each of them, fill in the following information by copying the below templat
12261303
- Testing: Are there any tests for failure mode? If not, describe why.
12271304
-->
12281305

1306+
- Leader election controller fails to elect a leader
1307+
- Detection: Via metrics `apiserver_coordinated_leader_election_failures_total` increasing and absence of leader in lease object
1308+
- Mitigations: Operators can disable feature gate.
1309+
- Diagnostics: Check kube-apiserver logs for messages on failing to elect the leader. Look at the lease object renewal times and holder, along with leasecandidate objects for the particular component.
1310+
- Testing: Integration test exists that prevents write access for the CLE controller and ensures that another controller takes over.
1311+
12291312
###### What steps should be taken if SLOs are not being met to determine the problem?
12301313

1314+
Check whether the CLE controller is operating properly, check if API server is
1315+
not overloaded, and in the worst case disable the feature by explicitly setting
1316+
the feature gate to false. This information can be found in the controller and
1317+
API server logs `kube-apiserver.log`. Additionally, looking through the `lease`
1318+
and `leasecandidate` objects will provide insight on whether the leases and
1319+
candidates are renewing properly.
1320+
12311321
## Implementation History
12321322

12331323
<!--

keps/sig-api-machinery/4355-coordinated-leader-election/kep.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,13 @@ approvers:
1515
- "@sttts"
1616
see-also:
1717
- "keps/sig-api-machinery/1965-kube-apiserver-identity"
18-
stage: alpha
19-
latest-milestone: "v1.31"
18+
stage: beta
19+
latest-milestone: "v1.33"
2020

2121
# The milestone at which this feature was, or is targeted to be, at each stage.
2222
milestone:
2323
alpha: "v1.31"
24+
beta: "v1.33"
2425

2526
# The following PRR answers are required at alpha release
2627
# List the feature gate name and the components for which it must be enabled
@@ -29,6 +30,7 @@ feature-gates:
2930
components:
3031
- kube-apiserver
3132
- kube-controller-manager
33+
- kube-scheduler
3234
disable-supported: true
3335

3436
# The following PRR answers are required at beta release

0 commit comments

Comments
 (0)