Leader Election in System Design: Raft, Bully Algorithm, Heartbeats & Fencing (Visualized)
Leader election is how a cluster of nodes autonomously agrees on exactly one coordinator โ without a central authority. This guide covers why clusters need a leader, how Raft and the Bully algorithm elect one, heartbeat failure detection, split-brain prevention via terms and quorum, lease-based leadership, and real implementations in ZooKeeper, etcd, and Kubernetes.
Leader election is the process by which a group of distributed nodes collectively agrees on a single node โ the leader โ that is authorised to make decisions on behalf of the entire cluster. All other nodes become followers that replicate the leader's decisions or delegate coordination tasks to it. When the leader fails, the remaining nodes detect the failure and run a new election to pick a replacement, restoring the cluster's ability to make progress without human intervention.
The need for a leader arises whenever a cluster must perform work that requires a single authoritative source: writing to a replicated log in order, issuing distributed locks, assigning tasks to workers, or making schema changes. Without a designated coordinator, any two nodes might issue conflicting writes simultaneously โ a condition called a split-brain โ producing data corruption that is extremely difficult to untangle. Leader election is the distributed systems primitive that prevents split-brain by guaranteeing, at any given moment, at most one node believes it is in charge.
Why a Cluster Needs a Single Coordinator
Consider a three-node database cluster where all nodes can accept writes. Two clients update the same row at the same moment, each hitting a different node. Without coordination, both writes succeed locally and the nodes now disagree on the value โ a conflict. Resolving conflicts after the fact is expensive and sometimes impossible for non-commutative operations. The simpler invariant is to route all writes through one leader that serialises them, then replicates the result to followers. Reads can still be spread across followers for throughput. The cluster sacrifices some write throughput in exchange for strong consistency: any write the leader acknowledges is durable and ordered.
Beyond databases, a scheduler (like Kubernetes) needs one active control-plane instance to avoid scheduling the same pod twice. A distributed lock service needs one node to grant locks. A Kafka partition has one leader broker that accepts producer writes. In every case, the pattern is the same: elect one, replicate through it, re-elect on failure.
Raft-Style Election: Randomised Timeouts and Majority Votes
Raft is the most widely understood modern consensus algorithm and its election sub-protocol is elegant. Each follower runs an election timeout โ a timer reset every time it hears a heartbeat from the current leader. The timeout is chosen randomly (typically 150โ300 ms in Raft's reference implementation). If a follower's timer expires before it hears from the leader, it assumes the leader is dead, increments its term number (a monotonically increasing integer that acts as a logical epoch), and transitions to the candidate state.
The candidate immediately votes for itself and broadcasts a RequestVote RPC to every other node, including its current term and the index of the last log entry it holds. A follower grants its vote if: (1) it has not already voted in this term, and (2) the candidate's log is at least as up-to-date as the follower's. A candidate wins the election the moment it receives votes from a majority (floor(N/2) + 1) of nodes, including itself. It then becomes leader and immediately sends heartbeats to all followers to suppress any competing elections. Because only one candidate can collect a majority in a given term (pigeonhole principle), at most one leader can be elected per term.
Randomised timeouts are the key insight that avoids split votes โ the situation where two candidates each get half the votes and neither wins. If two candidates start at exactly the same time, both will time out and restart at a new random interval. Eventually one fires first, collects a majority, and suppresses all competitors. In practice, split votes are rare and resolve within milliseconds.
The Bully Algorithm
The Bully algorithm, published by Garcia-Molina in 1982, takes a different approach: the node with the highest ID always wins. When a node notices the leader is gone, it sends an ELECTION message to all nodes with higher IDs. If no higher-ID node responds within a timeout, the initiator declares itself leader and broadcasts a COORDINATOR message. If a higher-ID node does respond, it takes over the election process and the original node steps back. The result is deterministic: the highest-numbered live node always becomes leader, hence the name โ it bullies lower-ID nodes out of the race.
The Bully algorithm is simple to implement but generates O(Nยฒ) messages in the worst case (every node simultaneously detects failure and starts an election). It also assumes reliable message delivery and synchronised clocks for timeouts, which makes it fragile in high-latency or partitioned networks. Modern systems generally prefer Raft or Paxos over Bully for these reasons.
Heartbeats and Failure Detection
A leader broadcasts heartbeats โ small empty RPC calls โ to all followers at a fixed interval (typically 50โ150 ms). Followers reset their election timer on each heartbeat. If a follower does not receive a heartbeat within its election timeout window, it infers the leader has failed and starts an election. The gap between heartbeat interval and election timeout is intentional: one or two dropped heartbeats due to network jitter should not trigger an unnecessary election. A common ratio is a 150 ms heartbeat with a 300โ500 ms election timeout.
Failure detection is inherently probabilistic: a node that appears unreachable might be slow, not dead. Phi Accrual detectors (used by Akka and Cassandra) model the inter-arrival time of heartbeats as a probability distribution and output a suspicion level rather than a binary alive/dead signal, letting applications tune their sensitivity. Raft's simple timer is a coarser but sufficient approximation for most systems.
Terms and Epochs: Preventing Two Leaders at Once
The most dangerous failure mode in leader election is having two nodes simultaneously believe they are leader. This can happen if a leader is slow (not dead) and a follower times out and wins a new election before the old leader realises it has been deposed. Without additional safeguards, the old leader could continue accepting writes, causing a split-brain.
Raft solves this with terms. Every message carries the sender's current term. When a node receives a message with a higher term than its own, it immediately steps down to follower and updates its term. A leader that discovers it is behind in terms โ because it receives a heartbeat or vote response from a node with a higher term โ immediately reverts to follower. This guarantees that a deposed leader cannot continue acting as leader: as soon as it communicates with any other node, it learns the new term and stands down. The old leader's uncommitted writes are overwritten by log replication from the new leader.
Split-Brain and Fencing
Even with terms, a leader isolated by a network partition may not immediately receive the higher-term message that would demote it. During that window, both sides of the partition could have a leader. Raft's quorum requirement already limits the damage: the old leader (minority partition) cannot commit any new log entries because it cannot get acknowledgements from a majority. But if the storage backend is shared (e.g., an NFS mount or distributed filesystem), the old leader might still write to it directly.
Fencing tokens address this at the storage layer. When a leader acquires its lease, it receives a monotonically increasing fencing token (effectively the term number). Every write to the shared storage must include this token. The storage backend rejects any write carrying a token older than the highest it has seen. So even if the old leader attempts to write, the storage rejects it because the new leader's higher fencing token has already been observed. This is the pattern used by HDFS, ZooKeeper, and Kubernetes etcd-backed resources.
Lease-Based Leadership
A leader lease bounds the duration of leadership explicitly in wall-clock time. The elected leader writes a lease record to a consensus store (or receives a lease duration from the election protocol) and is guaranteed to be the sole leader for that duration. Followers will not start a new election until the lease expires. This improves read performance: the leader can serve reads locally without a round-trip to followers because it knows no other leader can exist until its lease expires.
Kubernetes uses this pattern directly: each control-plane component (kube-scheduler, kube-controller-manager) uses a Lease resource in the kube-system namespace as a distributed lock. The holder renews its lease every few seconds; if renewal stops (process crash, network issue), the lease expires and another instance wins the lock and takes over. The lease duration is typically 15 seconds with a renewal period of 10 seconds, giving 5 seconds of buffer before a takeover is triggered.
Leader Election in Real Systems
ZooKeeper provides election via ephemeral sequential nodes. Candidates create sequential znodes under a path like /election/candidate-. The candidate with the lowest sequence number is the leader. Each non-leader watches the znode immediately below its own; when a predecessor is deleted (because that node died and ZooKeeper garbage-collected its ephemeral node), the next candidate checks if it is now the lowest and, if so, takes over. This creates a watch chain that avoids the thundering-herd problem of all followers simultaneously trying to become leader.
etcd exposes a higher-level concurrency.Election API built on its MVCC key-value store and lease primitive. A candidate calls Campaign(ctx, value); if no leader exists, it wins immediately. The election is backed by a lease, so the leader must call Proclaim periodically or the lease expires and the election reopens. etcd powers Kubernetes' entire control plane, as well as Vitess (MySQL sharding), CockroachDB node coordination, and CoreDNS service discovery.
Apache Kafka historically used ZooKeeper for broker leader election but migrated to KRaft (Kafka Raft metadata) in Kafka 3.x, running its own Raft-based controller quorum without ZooKeeper. Each partition still has a leader broker elected by the controller; the controller itself is now elected by Raft among the controller-eligible brokers.
Algorithm Comparison
| Raft Election | Bully Algorithm | ZooKeeper Ephemeral Nodes | |
|---|---|---|---|
| Election trigger | Election timeout (randomised 150โ300 ms) | Any node detects leader failure | Ephemeral node deletion by ZK session expiry |
| Winner determined by | First candidate to collect majority votes | Highest node ID among live nodes | Lowest sequential znode number |
| Message complexity | O(N) per election | O(Nยฒ) worst case | O(1) per candidate (watch chain) |
| Split-brain prevention | Term numbers + quorum majority | Single winner by ID ordering | ZooKeeper session fencing + sequential ordering |
| Failure detection | Heartbeat timeout | Heartbeat timeout or probing | ZooKeeper session expiry (ephemeral TTL) |
| Liveness under partition | Majority partition makes progress | Highest-ID live node wins | Quorum of ZK servers must be reachable |
| Used in | etcd, CockroachDB, TiKV, Kafka KRaft | Academic / simple embedded systems | Apache Curator recipes, legacy Kafka, HBase |
Leader Election and the Consensus Problem
Leader election is a special case of the more general consensus problem: getting a group of processes to agree on a single value. In Raft, the election is agreement on which node is leader; subsequent log replication is agreement on the sequence of log entries. The two are inseparable โ Raft's log safety guarantee depends on only one leader existing per term, and the election mechanism enforces that invariant.
The FLP impossibility result (Fischer, Lynch, Paterson 1985) proved that no deterministic asynchronous distributed algorithm can always achieve consensus in the presence of even one failed process. Raft and Paxos sidestep this by using randomisation (election timeouts) and liveness assumptions (eventual message delivery) โ they are not guaranteed to terminate in every execution, but they terminate with high probability in practice.
# Simplified etcd leader election with the official etcd3 Python client
import etcd3
import time
ETCD_HOST = 'localhost'
ETCD_PORT = 2379
LEASE_TTL = 15 # seconds
ELECTION_KEY = '/my-service/leader'
def run_leader_election(node_id: str):
client = etcd3.client(host=ETCD_HOST, port=ETCD_PORT)
while True:
# Acquire a lease: if we stop renewing, etcd deletes our key automatically
lease = client.lease(LEASE_TTL)
# Atomic compare-and-swap: write our node_id only if the key doesn't exist
success = client.put_if_not_exists(
ELECTION_KEY, node_id.encode(), lease=lease
)
if success:
print(f'[{node_id}] Elected as leader (lease={lease.id})')
try:
# Renew the lease and perform leader work in a tight loop
while True:
lease.refresh() # heartbeat to etcd
do_leader_work(node_id)
time.sleep(LEASE_TTL // 3) # renew well before expiry
except Exception as e:
print(f'[{node_id}] Lost leadership: {e}')
lease.revoke()
else:
print(f'[{node_id}] Another node is leader -- watching for changes')
# Block until the current leader key is deleted (lease expires or revoked)
events, cancel = client.watch(ELECTION_KEY)
for event in events:
if isinstance(event, etcd3.events.DeleteEvent):
cancel()
break # re-enter election loop
def do_leader_work(node_id: str):
print(f'[{node_id}] Performing leader duties ...')
Frequently Asked Questions
What happens if two nodes win an election at the same time?
In Raft this cannot happen within the same term because only one node can receive votes from a strict majority โ a node can only vote once per term, so at most one candidate can collect floor(N/2) + 1 votes. If two candidates start simultaneously and split the votes, neither wins; both time out at a new random interval and a fresh election begins in the next term. The randomisation ensures the probability of repeated splits drops exponentially with each round, resolving elections in milliseconds in practice.
How long does a leader election take in practice?
In a typical etcd or Raft cluster on a LAN, an election completes in 150โ500 ms: one election-timeout interval (150โ300 ms) plus one round-trip for vote RPCs (1โ10 ms on a LAN). Kubernetes tolerates up to 15 seconds of leader-lease expiry before a replacement takes over, which is deliberately conservative to avoid flapping. For latency-sensitive workloads, Raft's pre-vote extension lets a candidate verify it can win before incrementing the term, reducing unnecessary disruption from isolated nodes that would never win.
Is leader election the same as distributed locking?
They are closely related but not identical. A distributed lock is acquired for a short-lived critical section and released immediately. Leader election is a long-lived lock: the winner holds it continuously, renewing it via heartbeats or lease renewals, until it crashes or voluntarily steps down. Both use the same primitives โ compare-and-swap in etcd, ephemeral nodes in ZooKeeper โ but a leader also takes on responsibility for cluster coordination during its tenure, making correctness concerns like fencing and term management more critical than they are for a typical short-lived lock.
A cluster without leader election is a committee without a chair โ everyone talks, no one decides. Election gives you one voice backed by the consent of the majority, and term numbers ensure that voice can never be duplicated.
โ alokknight Engineering
