Skip to content

Commit e61905f

Browse files
committed
Flesh out migrations and alternatives
1 parent 8b0d0c7 commit e61905f

File tree

1 file changed

+131
-11
lines changed
  • keps/sig-api-machinery/4355-coordinated-leader-election

1 file changed

+131
-11
lines changed

keps/sig-api-machinery/4355-coordinated-leader-election/README.md

Lines changed: 131 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -71,13 +71,16 @@ SIG Architecture for cross-cutting KEPs).
7171
- [Coordinated Election Controller](#coordinated-election-controller)
7272
- [Coordinated Lease Lock](#coordinated-lease-lock)
7373
- [Enabling on a component](#enabling-on-a-component)
74+
- [Migrations](#migrations)
7475
- [Comparison of leader election](#comparison-of-leader-election)
7576
- [User Stories (Optional)](#user-stories-optional)
7677
- [Story 1](#story-1)
7778
- [Story 2](#story-2)
7879
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
7980
- [Risks and Mitigations](#risks-and-mitigations)
80-
- [Risk: This breaks leader election in some super subtle way](#risk-this-breaks-leader-election-in-some-super-subtle-way)
81+
- [Risk: Amount of writes performed by leader election increases substantially](#risk-amount-of-writes-performed-by-leader-election-increases-substantially)
82+
- [Risk: leases watches increases apiserver load substantially](#risk-leases-watches-increases-apiserver-load-substantially)
83+
- [Risk: We have to "start over" and build confidence in a new leader election algorithm](#risk-we-have-to-start-over-and-build-confidence-in-a-new-leader-election-algorithm)
8184
- [Risk: How is the election controller elected?](#risk-how-is-the-election-controller-elected)
8285
- [Risk: What if the election controller fails to elect a leader?](#risk-what-if-the-election-controller-fails-to-elect-a-leader)
8386
- [Design Details](#design-details)
@@ -100,6 +103,8 @@ SIG Architecture for cross-cutting KEPs).
100103
- [Implementation History](#implementation-history)
101104
- [Drawbacks](#drawbacks)
102105
- [Alternatives](#alternatives)
106+
- [Component instances pick a leader without a coordinator](#component-instances-pick-a-leader-without-a-coordinator)
107+
- [Future Work](#future-work)
103108
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
104109
<!-- /toc -->
105110

@@ -309,6 +314,36 @@ flowchart TD
309314
Components with a `--leader-elect-resource-lock` flag (kube-controller-manager,
310315
kube-scheduler) will accept `coordinatedleases` as a resource lock type.
311316

317+
### Migrations
318+
319+
So long as the API server is running a coordinated election controller, it is
320+
safe to directly migrate a component from Lease Based Leader Election to
321+
Coordinated Leader Election (or vis-versa).
322+
323+
During the upgrade, a mix of components will be running both election
324+
approaches. When the leader lease expires, there are a couple possibilities:
325+
326+
- A controller instance using Lease Based Leader Election claims the leader lease
327+
- The coordinated election controller picks a leader, from the components that
328+
have written identity leases, and claims the lease on the leader's behalf
329+
330+
Both possibilities have acceptable outcomes during the migration-- a component
331+
is elected leader, and once elected, remains leader so long as it keeps the
332+
lease renewed. The elected leader might not be the leader that Coordinated
333+
Leader Election would pick, but this is no worse than how leader election works
334+
before the upgrade, and once the upgrade is complete, Coordinated Leader
335+
Election works as intended.
336+
337+
There is one thing that could make migrations slightly cleaner: If Coordinated
338+
Leader Election adds a
339+
`coordination.k8s.io/elected-by: leader-election-controller` annotation to any
340+
leases that it claims. It can also check for this annotation and only mark
341+
leases as "end-of-term" if that annotation is present. Lease Based Leader Election
342+
would ignore "end-of-term" annotations anyway, so this isn't strictly needed,
343+
but it would reduce writes from the coordinated election controller to leases that
344+
were claimed by component instances not using Coordinated Leader Election
345+
346+
312347
### Comparison of leader election
313348

314349
| | Lease Based Leader Election | Coordinated Leader Election |
@@ -359,12 +394,45 @@ This might be a good place to talk about core concepts and how they relate.
359394

360395
### Risks and Mitigations
361396

362-
#### Risk: This breaks leader election in some super subtle way
397+
#### Risk: Amount of writes performed by leader election increases substantially
398+
399+
This enhancement introduces an identity lease for each instance of each component.
400+
401+
Example:
402+
403+
- HA cluster with 3 control plane nodes
404+
- 3 elected components (kube-controller-manager, schedule, cloud-controller-manager) per control plane node
405+
- 9 identity leases are created and renewed by the components
406+
407+
Introducing this feature is roughtly equivalent to adding the same lease load
408+
as adding 9 nodes to a kubernetes cluster.
409+
410+
The [API Server Identity enhancement](../1965-kube-apiserver-identity) also
411+
introduces similar leases. For comparison, in a HA cluster with 3 control plane
412+
nodes, API Server Identity adds 3 leases.
413+
414+
This risk can be migitated by scale testing and, if needed, extending the lease
415+
duration and renewal times to reduce writes/s.
416+
417+
#### Risk: leases watches increases apiserver load substantially
418+
419+
The [Unknown Version Interoperability Proxy (UVIP) enhancement](../4020-unknown-version-interoperability-proxy) also adds
420+
lease watches on [API Server Identity](../1965-kube-apiserver-identity) leases in the kube-system namespace.
421+
This enhancement would increase the expected number of resources being watched from ~3 (for UVIP) to ~12.
422+
423+
#### Risk: We have to "start over" and build confidence in a new leader election algorithm
424+
425+
We've built confidence in the existing leasing algorithm, through an investment
426+
of engineering effort, and in core hours testing it and running it in production.
363427

364-
The goal of this proposal is to minimize this risk by:
428+
Changing the algorithm "resets the clock" and forces us to rebuild confidence on
429+
the new algorithm.
365430

366-
- Continuing to renew leases in the same was as before and to never change leaders until a lease expires
367-
- Fallback to direct lease claiming by components if a leader is not elected
431+
The goal of this proposal is to minimize this risk by reusing as much of the
432+
existing lease algorithm as possible:
433+
434+
- Renew leases in exactly the same way as before
435+
- Leases can never be claimed by another leader until a lease expires
368436

369437
#### Risk: How is the election controller elected?
370438

@@ -373,7 +441,9 @@ Requires the election controller make careful use concurrency control primitives
373441

374442
#### Risk: What if the election controller fails to elect a leader?
375443

376-
Fallback to letting component instances self elect after a longer delay.
444+
Fallback to letting component instances claim the lease directly, after a longer
445+
delay, to give the coordinated election controller an opportunity to elect
446+
before resorting to the fallback.
377447

378448
## Design Details
379449

@@ -951,11 +1021,61 @@ Why should this KEP _not_ be implemented?
9511021

9521022
## Alternatives
9531023

954-
<!--
955-
What other approaches did you consider, and why did you rule them out? These do
956-
not need to be as detailed as the proposal, but should include enough
957-
information to express the idea and why it was not acceptable.
958-
-->
1024+
### Component instances pick a leader without a coordinator
1025+
1026+
The idea of this alternative is to elect a leader from a set of candidates
1027+
without a coordinated election controller.
1028+
1029+
Some rough ideas of how this might be done:
1030+
1031+
1. A candidates is picked at random to be an election coordinator, and the
1032+
coordinator picks the leader:
1033+
- Components race to claim the lease
1034+
- If a component claims the lease, the first thing it does is check to see if there is a better leader
1035+
- If it finds a better lease, it assigns the lease to that component instead of itself
1036+
1037+
Pros:
1038+
1039+
- No coordinated election controller
1040+
Cons:
1041+
1042+
- All component instances must watch the identity leases
1043+
- All components must have the code to decide which component is the best leader
1044+
1045+
2. The candidates agree on the leader collectively
1046+
- Leases have "Election" and "Term" states
1047+
- Leases are first created in the "election" state.
1048+
- While in the "election" state, candidates self-nominate by updating the lease
1049+
with their identity and version information. Candidates only need to
1050+
self nominate if they are a better candidate than candidate information
1051+
already written to the lease.
1052+
- When "Election" timeout expires, the best candidate becomes the leader
1053+
- The leader sets the state to "term" and starts renewing the lease
1054+
- If the lease expires, it goes back to the "election" state
1055+
1056+
Pros:
1057+
1058+
- No coordinated election controller
1059+
- No identity leases
1060+
1061+
Cons:
1062+
1063+
- Complex election algorithm is distributed as a client-go library. A bug in the
1064+
algorithm cannot not be fixed by only upgrading kubernetes.. all controllers
1065+
in the ecosystem with the bug must upgrade client-go and release to be fixed.
1066+
- More difficult to change/customize the criteria for which candidate is best.
1067+
1068+
If we decide in the future to shard controllers and wish to leverage coordinated
1069+
eader election to balance shards, it's much easier introduce the change in a
1070+
controller in the apiserver than in client-go library code distributed to
1071+
elected controllers.
1072+
1073+
## Future Work
1074+
1075+
- Controller sharding could leverage coordinated leader election to load balance
1076+
controllers against apiservers.
1077+
- Optimizations for graceful and performant failover can be built on this
1078+
enhancement.
9591079

9601080
## Infrastructure Needed (Optional)
9611081

0 commit comments

Comments
 (0)