Leader Election

Why Distributed Systems Need a Leader

Some operations are unsafe to parallelize: writing to a single partition, scheduling a distributed cron job, sequencing log entries, managing cluster membership. If two nodes simultaneously believe they are the authoritative writer for the same partition, they produce conflicting writes. The cluster has split-brain: two leaders that each believe they are correct, accepting conflicting mutations. Resolving split-brain after the fact is expensive and sometimes impossible without data loss.

Leader election is the mechanism by which a group of nodes agrees on exactly one of them to serve as the authoritative coordinator for a given responsibility.

Algorithms

Bully algorithm: the node with the highest ID initiates an election by messaging all higher-ID nodes. If no higher-ID node responds within a timeout, the initiator declares itself leader and notifies all lower-ID nodes. If a higher-ID node is alive, it takes over the election. The algorithm is correct but generates O(n^2) messages in the worst case and is impractical for large clusters.

Raft leader election: nodes start as followers. A follower that times out without hearing from a leader becomes a candidate and requests votes. A candidate that collects votes from a majority of nodes becomes leader. Term numbers (monotonically increasing integers) prevent stale leaders: a node that receives a message with a higher term than its own immediately reverts to follower. Raft is the current standard for embedded consensus in distributed systems (etcd, CockroachDB, Consul).

ZooKeeper ephemeral znodes: an external coordinator approach. All candidates create an ephemeral sequential znode under a common path. The node that holds the znode with the lowest sequence number is leader. When the leader disconnects, its ephemeral znode is automatically deleted by ZooKeeper, and the next lowest-sequence candidate takes over. This offloads the election protocol to ZooKeeper.

Term Numbers and Lease-Based Leadership

Term numbers (also called epochs) are essential for preventing a stale leader from causing damage after it is replaced. Every write the leader issues includes its current term number. If a node receives a write with an older term than it has seen, it rejects it. A leader that was partitioned from the cluster, missed its replacement, and then reconnects cannot write stale data because its term number is outdated.

Lease-based leadership: the leader holds a time-limited lease (commonly 5-10 seconds) issued by the election protocol. The leader must renew its lease before it expires. If it fails to renew (network partition, crash), the lease expires and a new election begins. The leader must not take any authoritative action after its lease expires, even if it believes it is still leader.

Fencing Tokens

Lease expiry does not always prevent a stale leader from acting. A leader whose lease expires might still be executing an operation it started before expiry (for example, writing to a storage system with a slow network). Fencing tokens address this: the election system issues a monotonically increasing integer token with each new leader. Every write to the storage system must include the fencing token. The storage system rejects writes with a token lower than the highest it has seen. Even if the old leader eventually completes its write, the storage rejects it because the new leader has a higher token.

Production Usage

Kafka: each partition has a primary broker (leader) responsible for all reads and writes to that partition. Kafka brokers use ZooKeeper or KRaft (Kafka's own Raft implementation) for leader election.

Kubernetes: controllers (scheduler, controller-manager) use a leader election mechanism backed by etcd to ensure only one replica of each controller is active at a time.

Elasticsearch: each index shard has a primary shard that handles writes. Primary shard election uses a quorum-based protocol managed by the master node.

Interview Tip

Fencing tokens are the probe that separates candidates who have read about leader election from those who have reasoned about it under failure. The scenario: the old leader holds a lease that expires, a new leader is elected, but the old leader does not know it has been replaced and tries to write. Without fencing, both leaders write. With fencing, the storage system rejects the old leader's write because the new leader's token is higher. The mechanism only works if the storage system participates in enforcement. A leader that writes to a system that does not check fencing tokens is not protected. This conditional is what L6-level answers include.

Why Distributed Systems Need a Leader

Algorithms

Term Numbers and Lease-Based Leadership

Fencing Tokens

Production Usage

Related Concepts