Skip to content

Latest commit

 

History

History
101 lines (66 loc) · 4.11 KB

File metadata and controls

101 lines (66 loc) · 4.11 KB

Leader Election is a distributed systems concept.

👉 In a cluster of distributed servers (like database replicas, job schedulers, or coordination services), you often need one node to act as the “leader” to coordinate work, while others act as followers.

  • The leader decides things (e.g., assigns tasks, handles writes, maintains consistency).
  • The followers replicate the state and can take over if the leader fails.

Why is leader election needed?

  • To avoid conflicts (e.g., two nodes writing at the same time).
  • To ensure consistency and avoid "split-brain" problems.
  • To automatically recover from failures (if the leader crashes, a new one is elected).

Tools commonly used

  • Zookeeper: Provides primitives like ephemeral znodes to implement leader election.
  • etcd (used by Kubernetes): Provides distributed consensus using Raft protocol for leader election.

Example Scenarios

  • Database clusters (e.g., MongoDB, Elasticsearch) → one primary (leader), many replicas.
  • Kubernetes control plane → one leader kube-controller-manager, others standby.
  • Distributed job schedulers → only leader assigns jobs to avoid duplication.

➡️ Trade-off: Leader election introduces some latency and complexity but is critical for high availability and fault tolerance in distributed systems.


Leader election can be done in different ways depending on the system. Two of the most common approaches are:


🔹 1. Zookeeper-based Leader Election

Zookeeper is a distributed coordination service. It doesn’t elect a leader automatically, but it provides primitives to build it.

How it works (using Ephemeral Sequential Znodes):

  1. All candidate servers (say S1, S2, S3, S4) connect to Zookeeper.

  2. Each one creates an ephemeral sequential znode under /election.

    • Example:

      /election/node_0001
      /election/node_0002
      /election/node_0003
      
  3. Zookeeper assigns numbers automatically (like tickets).

  4. The server with the lowest sequence number becomes the leader.

  5. If the leader crashes, its ephemeral znode is deleted automatically.

  6. The next lowest node becomes the new leader.

👉 This ensures only one leader at a time and automatic failover.


🔹 2. Consensus Protocol (Raft, Paxos, etc.)

This is used by etcd, Consul, Kubernetes, and modern databases.

How Raft Leader Election Works:

  1. Nodes start in Follower state.
  2. If a follower doesn’t hear from a leader within a timeout, it becomes a Candidate.
  3. The candidate requests votes from other nodes.
  4. If it gets majority votes, it becomes the Leader.
  5. The leader periodically sends heartbeats to followers to prevent new elections.
  6. If the leader fails, the process repeats.

👉 Advantage:

  • Ensures strong consistency (via majority vote).
  • Works even in network partitions (as long as majority available).
  • Fully automatic and robust (better than manually using Zookeeper primitives).

✅ Comparison

Aspect Zookeeper (Ephemeral Znodes) Raft/etcd
Simplicity Simpler, app handles leader logic Built-in consensus protocol
Consistency Needs careful design Strong consistency (via majority votes)
Recovery Ephemeral node deletion triggers new leader Automatic re-election
Use Cases Custom job schedulers, Hadoop, Kafka Kubernetes, databases, key-value stores

💡 Example in real systems

  • Kafka: Uses Zookeeper to elect a controller broker (leader).
  • Kubernetes: Uses etcd (Raft-based) for leader election.
  • MongoDB: Replica sets elect a primary using Raft-like consensus.

👉 In system design interviews, if you’re asked about leader election, you can mention:

  • Zookeeper (ephemeral znodes) → simpler, external service.
  • Raft/etcd → more robust, built-in consensus.