@@ -71,13 +71,16 @@ SIG Architecture for cross-cutting KEPs).
71
71
- [ Coordinated Election Controller] ( #coordinated-election-controller )
72
72
- [ Coordinated Lease Lock] ( #coordinated-lease-lock )
73
73
- [ Enabling on a component] ( #enabling-on-a-component )
74
+ - [ Migrations] ( #migrations )
74
75
- [ Comparison of leader election] ( #comparison-of-leader-election )
75
76
- [ User Stories (Optional)] ( #user-stories-optional )
76
77
- [ Story 1] ( #story-1 )
77
78
- [ Story 2] ( #story-2 )
78
79
- [ Notes/Constraints/Caveats (Optional)] ( #notesconstraintscaveats-optional )
79
80
- [ Risks and Mitigations] ( #risks-and-mitigations )
80
- - [ Risk: This breaks leader election in some super subtle way] ( #risk-this-breaks-leader-election-in-some-super-subtle-way )
81
+ - [ Risk: Amount of writes performed by leader election increases substantially] ( #risk-amount-of-writes-performed-by-leader-election-increases-substantially )
82
+ - [ Risk: leases watches increases apiserver load substantially] ( #risk-leases-watches-increases-apiserver-load-substantially )
83
+ - [ Risk: We have to " ; start over" ; and build confidence in a new leader election algorithm] ( #risk-we-have-to-start-over-and-build-confidence-in-a-new-leader-election-algorithm )
81
84
- [ Risk: How is the election controller elected?] ( #risk-how-is-the-election-controller-elected )
82
85
- [ Risk: What if the election controller fails to elect a leader?] ( #risk-what-if-the-election-controller-fails-to-elect-a-leader )
83
86
- [ Design Details] ( #design-details )
@@ -100,6 +103,8 @@ SIG Architecture for cross-cutting KEPs).
100
103
- [ Implementation History] ( #implementation-history )
101
104
- [ Drawbacks] ( #drawbacks )
102
105
- [ Alternatives] ( #alternatives )
106
+ - [ Component instances pick a leader without a coordinator] ( #component-instances-pick-a-leader-without-a-coordinator )
107
+ - [ Future Work] ( #future-work )
103
108
- [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
104
109
<!-- /toc -->
105
110
@@ -309,6 +314,36 @@ flowchart TD
309
314
Components with a `--leader-elect-resource-lock` flag (kube-controller-manager,
310
315
kube-scheduler) will accept `coordinatedleases` as a resource lock type.
311
316
317
+ # ## Migrations
318
+
319
+ So long as the API server is running a coordinated election controller, it is
320
+ safe to directly migrate a component from Lease Based Leader Election to
321
+ Coordinated Leader Election (or vis-versa).
322
+
323
+ During the upgrade, a mix of components will be running both election
324
+ approaches. When the leader lease expires, there are a couple possibilities :
325
+
326
+ - A controller instance using Lease Based Leader Election claims the leader lease
327
+ - The coordinated election controller picks a leader, from the components that
328
+ have written identity leases, and claims the lease on the leader's behalf
329
+
330
+ Both possibilities have acceptable outcomes during the migration-- a component
331
+ is elected leader, and once elected, remains leader so long as it keeps the
332
+ lease renewed. The elected leader might not be the leader that Coordinated
333
+ Leader Election would pick, but this is no worse than how leader election works
334
+ before the upgrade, and once the upgrade is complete, Coordinated Leader
335
+ Election works as intended.
336
+
337
+ There is one thing that could make migrations slightly cleaner : If Coordinated
338
+ Leader Election adds a
339
+ `coordination.k8s.io/elected-by : leader-election-controller` annotation to any
340
+ leases that it claims. It can also check for this annotation and only mark
341
+ leases as "end-of-term" if that annotation is present. Lease Based Leader Election
342
+ would ignore "end-of-term" annotations anyway, so this isn't strictly needed,
343
+ but it would reduce writes from the coordinated election controller to leases that
344
+ were claimed by component instances not using Coordinated Leader Election
345
+
346
+
312
347
# ## Comparison of leader election
313
348
314
349
| | Lease Based Leader Election | Coordinated Leader Election |
@@ -359,12 +394,45 @@ This might be a good place to talk about core concepts and how they relate.
359
394
360
395
# ## Risks and Mitigations
361
396
362
- # ### Risk: This breaks leader election in some super subtle way
397
+ # ### Risk: Amount of writes performed by leader election increases substantially
398
+
399
+ This enhancement introduces an identity lease for each instance of each component.
400
+
401
+ Example :
402
+
403
+ - HA cluster with 3 control plane nodes
404
+ - 3 elected components (kube-controller-manager, schedule, cloud-controller-manager) per control plane node
405
+ - 9 identity leases are created and renewed by the components
406
+
407
+ Introducing this feature is roughtly equivalent to adding the same lease load
408
+ as adding 9 nodes to a kubernetes cluster.
409
+
410
+ The [API Server Identity enhancement](../1965-kube-apiserver-identity) also
411
+ introduces similar leases. For comparison, in a HA cluster with 3 control plane
412
+ nodes, API Server Identity adds 3 leases.
413
+
414
+ This risk can be migitated by scale testing and, if needed, extending the lease
415
+ duration and renewal times to reduce writes/s.
416
+
417
+ # ### Risk: leases watches increases apiserver load substantially
418
+
419
+ The [Unknown Version Interoperability Proxy (UVIP) enhancement](../4020-unknown-version-interoperability-proxy) also adds
420
+ lease watches on [API Server Identity](../1965-kube-apiserver-identity) leases in the kube-system namespace.
421
+ This enhancement would increase the expected number of resources being watched from ~3 (for UVIP) to ~12.
422
+
423
+ # ### Risk: We have to "start over" and build confidence in a new leader election algorithm
424
+
425
+ We've built confidence in the existing leasing algorithm, through an investment
426
+ of engineering effort, and in core hours testing it and running it in production.
363
427
364
- The goal of this proposal is to minimize this risk by :
428
+ Changing the algorithm "resets the clock" and forces us to rebuild confidence on
429
+ the new algorithm.
365
430
366
- - Continuing to renew leases in the same was as before and to never change leaders until a lease expires
367
- - Fallback to direct lease claiming by components if a leader is not elected
431
+ The goal of this proposal is to minimize this risk by reusing as much of the
432
+ existing lease algorithm as possible :
433
+
434
+ - Renew leases in exactly the same way as before
435
+ - Leases can never be claimed by another leader until a lease expires
368
436
369
437
# ### Risk: How is the election controller elected?
370
438
@@ -373,7 +441,9 @@ Requires the election controller make careful use concurrency control primitives
373
441
374
442
# ### Risk: What if the election controller fails to elect a leader?
375
443
376
- Fallback to letting component instances self elect after a longer delay.
444
+ Fallback to letting component instances claim the lease directly, after a longer
445
+ delay, to give the coordinated election controller an opportunity to elect
446
+ before resorting to the fallback.
377
447
378
448
# # Design Details
379
449
@@ -951,11 +1021,61 @@ Why should this KEP _not_ be implemented?
951
1021
952
1022
# # Alternatives
953
1023
954
- <!--
955
- What other approaches did you consider, and why did you rule them out? These do
956
- not need to be as detailed as the proposal, but should include enough
957
- information to express the idea and why it was not acceptable.
958
- -->
1024
+ # ## Component instances pick a leader without a coordinator
1025
+
1026
+ The idea of this alternative is to elect a leader from a set of candidates
1027
+ without a coordinated election controller.
1028
+
1029
+ Some rough ideas of how this might be done :
1030
+
1031
+ 1. A candidates is picked at random to be an election coordinator, and the
1032
+ coordinator picks the leader :
1033
+ - Components race to claim the lease
1034
+ - If a component claims the lease, the first thing it does is check to see if there is a better leader
1035
+ - If it finds a better lease, it assigns the lease to that component instead of itself
1036
+
1037
+ Pros :
1038
+
1039
+ - No coordinated election controller
1040
+ Cons :
1041
+
1042
+ - All component instances must watch the identity leases
1043
+ - All components must have the code to decide which component is the best leader
1044
+
1045
+ 2. The candidates agree on the leader collectively
1046
+ - Leases have "Election" and "Term" states
1047
+ - Leases are first created in the "election" state.
1048
+ - While in the "election" state, candidates self-nominate by updating the lease
1049
+ with their identity and version information. Candidates only need to
1050
+ self nominate if they are a better candidate than candidate information
1051
+ already written to the lease.
1052
+ - When "Election" timeout expires, the best candidate becomes the leader
1053
+ - The leader sets the state to "term" and starts renewing the lease
1054
+ - If the lease expires, it goes back to the "election" state
1055
+
1056
+ Pros :
1057
+
1058
+ - No coordinated election controller
1059
+ - No identity leases
1060
+
1061
+ Cons :
1062
+
1063
+ - Complex election algorithm is distributed as a client-go library. A bug in the
1064
+ algorithm cannot not be fixed by only upgrading kubernetes.. all controllers
1065
+ in the ecosystem with the bug must upgrade client-go and release to be fixed.
1066
+ - More difficult to change/customize the criteria for which candidate is best.
1067
+
1068
+ If we decide in the future to shard controllers and wish to leverage coordinated
1069
+ eader election to balance shards, it's much easier introduce the change in a
1070
+ controller in the apiserver than in client-go library code distributed to
1071
+ elected controllers.
1072
+
1073
+ # # Future Work
1074
+
1075
+ - Controller sharding could leverage coordinated leader election to load balance
1076
+ controllers against apiservers.
1077
+ - Optimizations for graceful and performant failover can be built on this
1078
+ enhancement.
959
1079
960
1080
# # Infrastructure Needed (Optional)
961
1081
0 commit comments