Skip to content

Commit 570bf24

Browse files
committed
Update README.md with content about priority base CLE
1 parent 3a4c03f commit 570bf24

File tree

1 file changed

+96
-0
lines changed
  • keps/sig-api-machinery/4355-coordinated-leader-election

1 file changed

+96
-0
lines changed

keps/sig-api-machinery/4355-coordinated-leader-election/README.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,17 @@ SIG Architecture for cross-cutting KEPs).
7676
- [Creating a new LeaseConfiguration resource](#creating-a-new-leaseconfiguration-resource)
7777
- [YAML/CLI configuration on the kube-apiserver](#yamlcli-configuration-on-the-kube-apiserver)
7878
- [Strategy propagated from LeaseCandidate](#strategy-propagated-from-leasecandidate)
79+
- [Priority-based Coordinated Leader Election](#priority-based-coordinated-leader-election)
80+
- [LeaseCandidateSpec Update](#leasecandidatespec-update)
81+
- [Behavior of the Priority Field](#behavior-of-the-priority-field)
82+
- [Scenario Breakdown for priority based coordination leader election](#scenario-breakdown-for-priority-based-coordination-leader-election)
83+
- [1. Initial State](#1-initial-state)
84+
- [2. During Upgrade](#2-during-upgrade)
85+
- [3. Priority Setting](#3-priority-setting)
86+
- [4.1. Upgrade Completion](#41-upgrade-completion)
87+
- [4.2 Update rollback](#42-update-rollback)
88+
- [5. Priority Persistence](#5-priority-persistence)
89+
- [Consideration for Stale Priorities](#consideration-for-stale-priorities)
7990
- [Enabling on a component](#enabling-on-a-component)
8091
- [Migrations](#migrations)
8192
- [API](#api)
@@ -440,6 +451,90 @@ set of candidates and selected strategy is the same as before.
440451
The obvious drawback is the need for a consensus protocol and extra information
441452
in the `LeaseCandidate` object that may be unnecessary.
442453

454+
### Priority-based Coordinated Leader Election
455+
To enhance control over leader assignment beyond existing CLE strategies like OldestEmulatedVersion, we propose adding an optional `Priority` field unset by default, higher value = higher priority) to `LeaseCandidateSpec`.
456+
457+
This field allows operators to explicitly designate a preferred leader.
458+
The CLE system will select the candidate with the highest non-zero Priority. If multiple candidates share the same highest priority, the existing v1.CoordinatedLeaseStrategy will act as a tie-breaker. If no candidates have a priority set, the system defaults to the existing v1.CoordinatedLeaseStrategy.
459+
460+
This provides granular, temporary control without replacing the primary CLE mechanism.
461+
462+
#### LeaseCandidateSpec Update
463+
A new field called `Priority` is included into `LeaseCandidateSpec`:
464+
```go
465+
// LeaseCandidateSpec is a specification of a Lease.
466+
type LeaseCandidateSpec struct {
467+
// ...
468+
Priority int32 `json:"priority,omitempty" protobuf:"varint,7,opt,name=priority"` // New field: Higher value means higher priority. The value must be > 0.
469+
}
470+
```
471+
472+
#### Behavior of the Priority Field
473+
- Priority Value: The `Priority` field is an int32. A higher numerical value indicates a higher priority. This field must be greater than 0.
474+
- Selection Logic:
475+
- If one or more candidates have a Priority > 0: The candidate with the numerically highest Priority value will be selected as the leader.
476+
- Tie-Breaking for Equal Highest Priority: If multiple candidates share the same highest non-zero Priority value, the selection among these equally prioritized candidates will be resolved using their existing `v1.CoordinatedLeaseStrategy` (e.g., OldestEmulatedVersion).
477+
- If no candidates have a Priority, the leader selection will proceed based purely on the existing `v1.CoordinatedLeaseStrategy`.
478+
479+
#### Scenario Breakdown for priority based coordination leader election
480+
Here is a step-by-step breakdown of the scenarios for better understanding the priority-based leader election during upgrades.
481+
482+
##### 1. Initial State
483+
At the beginning, all components (C1, C2, and C3) are running Binary Version 1 and are emulating Version 1
484+
485+
| Component | Binary Version | Emulation Version | Leader |
486+
|-----------|----------------|-------------------|--------|
487+
| C1 | V1 | V1 | Y |
488+
| C2 | V1 | V1 | |
489+
| C3 | V1 | V1 | |
490+
491+
##### 2. During Upgrade
492+
During the upgrade, C1 and C2 are updated to Binary Version 2, but C3 remains on an earlier version. C2 is momentarily elected as the leader.
493+
494+
| Component | Binary Version | Emulation Version | Leader |
495+
|-----------|----------------|-------------------|--------|
496+
| C1 | V2 | V2 | |
497+
| C2 | V2 | V1 | Y |
498+
| C3 | V2 | V1 | |
499+
500+
##### 3. Priority Setting
501+
The cluster administrator chooses C1 to be the leader by setting its priority to 100.
502+
503+
| Component | Binary Version | Emulation Version | Priority | Leader |
504+
|-----------|----------------|-------------------|----------|--------|
505+
| C1 | V2 | V2 | 100 | Y |
506+
| C2 | V2 | V1 | | |
507+
| C3 | V2 | V1 | | |
508+
509+
##### 4.1. Upgrade Completion
510+
After the upgrade is finished, all components are running Binary Version 2 and are emulating Version 2. C1 remains the leader due to its set priority.
511+
512+
| Component | Binary Version | Emulation Version | Priority | Leader |
513+
|-----------|----------------|-------------------|----------|--------|
514+
| C1 | V2 | V2 | 100 | Y |
515+
| C2 | V2 | V2 | | |
516+
| C3 | V2 | V2 | | |
517+
518+
##### 4.2 Update rollback
519+
Should an issue arise with C1 requiring a rollback, we can unset its priority. This will enable CLE to select C2, which contains the oldest emulated version.
520+
521+
| Component | Binary Version | Emulation Version | Priority | Leader |
522+
|-----------|----------------|-------------------|----------|--------|
523+
| C1 | V2 -> V1 | V2 -> V1 | | |
524+
| C2 | V2 | V1 | | Y |
525+
| C3 | V2 | V1 | | |
526+
527+
##### 5. Priority Persistence
528+
Unless the cluster administrator resets the priority, C1 will always remain the leader. When a component gets upgraded or downgraded, it may create a new release candidate, causing the priority to reset.
529+
530+
#### Consideration for Stale Priorities
531+
A concern with the priority field is the potential for "stale priorities" – a priority set temporarily and not subsequently cleared. This could prevent the Coordinated Leader Election (CLE) system from selecting a more appropriate leader.
532+
We considered exposing a Time-To-Live (TTL) for priority in the `LeaseCandidateSpec`, where the CLE system would ignore a priority once its TTL expired. While this directly addresses the "temporary" nature of many priority assignments, we've decided not to include it in this initial phase due to several complexities:
533+
- Implementation and Semantics: Defining the precise data type and behavior for a TTL (e.g., time.Duration vs. time.Time, resetting logic) adds significant complexity.
534+
- User Rationalization: Adding a third field (ttl) to an already multi-faceted leader election logic (strategy + priority) greatly increases the cognitive load for users to understand and manage leader selection effectively.
535+
536+
Therefore, in this initial iteration, managing priority lifecycles will be an operational responsibility, requiring manual clearance or updates. We may revisit TTL or similar automated mechanisms in future iterations after gaining more experience with the priority field.
537+
443538
### Enabling on a component
444539

445540
Components with a `--leader-elect-resource-lock` flag (kube-controller-manager,
@@ -865,6 +960,7 @@ in back-to-back releases.
865960

866961
- Load test Coordinated Leader Election
867962
- Feature is enabled by default
963+
- A tested solution for stale priorities is implemented, working through either improved user validation to prevent them, or an automated system to correct them.
868964

869965
### Upgrade / Downgrade Strategy
870966

0 commit comments

Comments
 (0)